ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.5
Committed: Wed Apr 25 06:37:12 2018 UTC (6 years, 2 months ago) by root
Branch: MAIN
Changes since 1.4: +5 -4 lines
Log Message:
*** empty log message ***

File Contents

# User Rev Content
1 root 1.1 =head1 REGISTRATION INFORMATION
2    
3 root 1.3 Tag 256 (stringref-namespace)
4 root 1.1 Data Item multiple
5     Semantics mark value as having string references
6     Reference http://cbor.schmorp.de/stringref
7     Contact Marc A. Lehmann <cbor@schmorp.de>
8    
9 root 1.3 Tag 25 (stringref)
10 root 1.1 Data Item unsigned integer
11     Semantics reference the nth previously seen string
12     Reference http://cbor.schmorp.de/stringref
13     Contact Marc A. Lehmann <cbor@schmorp.de>
14    
15     =head1 RATIONALE
16    
17     These two tags can be used to efficiently share constant strings, which
18     are common in many data structures. They are an optimisation only,
19     similar to LZSS compresssion, for the string major types in CBOR.
20    
21     Background: many data structures or protocols contain repeated string
22     constants (often called atoms or quarks). For example, this is a
23     real-world (JSON-formatted) data structure that contains many repeated
24     strings used as map keys:
25    
26     [
27     {
28     "name" : "Cocktail",
29     "count" : 417,
30     "rank" : 4
31     },
32     {
33     "rank" : 4,
34     "count" : 312,
35     "name" : "Bath"
36     },
37     {
38     "count" : 691,
39     "name" : "Food",
40     "rank" : 4
41     }
42     ]
43    
44     This can be encoded nicely with CBOR, but each occurrence of "name",
45     "count" and "rank" will be encoded separately. In highly repetitive data
46     structures (the above is taken from a game save file) these can take up
47     considerable space.
48    
49     This scheme can be used to reduce this overhead with a simple scheme that
50     is easily implementable.
51    
52     =head1 DESCRIPTION
53    
54 root 1.3 Stringref consists of two tags, stringref-namespace (value C<256>),
55 root 1.1 which marks a value as containing string references, and stringref (value
56 root 1.3 C<25>), which references a string previously encoded in the value.
57 root 1.1
58     The stringref-namespace tag is used to define a namespace for the string
59     reference ids. stringref tags are only valid inside CBOR values marked
60     with stringref-namespace.
61    
62     The purpose of stringref-namespace is three-fold:
63    
64     =over 4
65    
66     =item 1. It signals the decoder that it needs to keep a copy of every
67     string that is being decoded while decoding the tagged value (and
68     conversely, that there is no need for this extra overhead without this
69     tag).
70    
71     =item 2. It can be used to confine string references to certain subtrees,
72     which might increase encoding efficiency by being able to use smaller
73     references.
74    
75     =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76     existing CBOR streams, as the indices are relative to the outer namespace
77     tag, and don't change when embedded.
78    
79     =back
80    
81     Within a value tagged with stringref-namespace, every string that is
82     encoded with a definite length and has a minimum length is assigned an
83     implicit index, starting from zero.
84    
85     The minimum length required to get an index depends on the index value
86     that would be assigned to it, as stringref references should be shorter
87     than the string they reference, and there is no point in wasting indices
88     on strings that should not be referenced anyway.
89    
90     The lengths are as follows:
91    
92     string index minimum string length in octets
93     0 .. 23 3
94     24 .. 255 4
95     256 .. 65535 5
96 root 1.2 65536 .. 4294967295 7
97     4294967296 .. 11
98 root 1.1
99     The minimum string length is simply the length of the stringref tag (2
100     octets), plus the minimum size the resulting index takes up, using major
101     type 0 encoding.
102    
103     Each time a string is to be encoded by the encoder, it can choose to
104     encode it as a normal string, or attempt to encode it as a stringref.
105    
106     If it attempts to encode a stringref, it needs to check whether the string
107     has been assigned an index already. If yes, it can emit a stringref tag
108     with the index.
109    
110     In all other cases the encoder must check whether it has to assign an
111     index value to it by checking against the minimum length required for the
112     next index value to be assigned, and assigning the next index value if the
113     string has minimum length or is longer.
114    
115     A typical encoder might want to look up every string (using the major
116     type and the octets) it encodes in some kind of internal map that stores
117     previously-assigned indices. If found, it will emit a stringref. If not,
118     it attempts to assign an index to the string and stores that in its map.
119    
120     Decoders might choose not to look up previous indices for a string, for
121     example, because it only wants to compress strings used as map keys, or
122     because it intrinsically knows that the string is only used once. This is
123     fine, but each time the encoder emits a string, it MUST assign an index if
124     the minimum length is reached or exceeded, even if it doesn't store that
125     index for later use.
126    
127     Byte strings and text strings share the same index namespace, but are
128     distinct, even if their contents are bit-identical. Indefinite length
129     strings and the chunks they consist of are never assigned an index.
130    
131     A typical encoder would emit the stringref-namespace tag once to tag
132     the top-level value in a CBOR stream, but nothing keeps an encoder
133     from nesting stringref-namespace tags. Nesting might be done for
134     convenience, to improve coding efficiency, to embed values or for any
135     other reason. Similarly, an encoder can sort map keys for greater coding
136     efficiency, but is not required to.
137    
138     A decoder follows the same process of assigning indices. A possible
139     implementation could be outlined like this:
140    
141     When a decoder supports stringref, then, upon encountering a
142     stringref-namespace tag, it should initialise an array and start decoding
143     the tagged value. The array stores, for each index, the major type (byte
144     or text) and the string value. Implementations are free to use their
145     internal format, of course - languages that do not distinguish between
146     text and byte strings can just store their internal string type in the
147     array.
148    
149     Since stringref-namespace tags can be nested, the decoder needs to save
150     and restore the outer array before starting and after ending the decoding
151     of the tagged value. In a recursive implementation this could simply look
152     like this:
153    
154     global stringref_array; // global per encoder
155     local outer_stringref_array;
156    
157     outer_stringref_array = stringref_array;
158     stringref_array = allocate_array ();
159    
160     value = decode_cbor_value ();
161    
162     deallocate_array (stringref_array);
163     stringref_array = outer_stringref_array;
164    
165     return value;
166    
167     When a stringref tag is encountered, it MUST contain an unsigned integer
168     (major type 0). The integer MUST be less than the number of entries in the
169     array. The decoder than looks up the string in the array, given the index,
170     and copies it to the result (string values should not be shared unless
171     the language normally does so - if strings are mutable, a copy should be
172     decoded).
173    
174     When a definite-length text or byte string is encountered, it is decoded
175     normally. Afterwards, the decoder checks whether it needs to assign an
176     index to the string, by comparing the string length against the minimum
177     length required by the index. If the string is shorter, nothing happens,
178     otherwise the next free index is used and the string (and the information
179     whether it is a byte or text string) is recorded at the end of the array.
180    
181     In languages with dynamic arrays, this could be accomplished by using
182     the array length as the next index to be assigned, and pushing the
183     string onto the end of the array when it is long enough.
184    
185     =head2 IMPLEMENTATION NOTE
186    
187 root 1.3 The semantics of stringref tags require the decoder to be aware and the
188     encoder to be under control of the sequence in which data items are
189     encoded into the CBOR stream. This means these tags cannot be implemented
190     on top of every generic CBOR encoder/decoder (which might reorder entries
191     in a map); they typically need to be integrated into their works.
192 root 1.1
193     =head1 EXAMPLES
194    
195 root 1.3 The array-of-maps from the rationale example would normally compress to a
196     CBOR text of 83 bytes. Using this extension where possible, this reduces
197     to 74 bytes:
198    
199     d9 0100 # tag(256)
200     83 # array(3)
201     a3 # map(3)
202     44 # bytes(4)
203     72616e6b # "rank"
204     04 # unsigned(4)
205     45 # bytes(5)
206     636f756e74 # "count"
207     19 01a1 # unsigned(417)
208     44 # bytes(4)
209     6e616d65 # "name"
210     48 # bytes(8)
211     436f636b7461696c # "Cocktail"
212     a3 # map(3)
213     d8 19 # tag(25)
214     02 # unsigned(2)
215     44 # bytes(4)
216     42617468 # "Bath"
217     d8 19 # tag(25)
218     01 # unsigned(1)
219     19 0138 # unsigned(312)
220     d8 19 # tag(25)
221     00 # unsigned(0)
222     04 # unsigned(4)
223     a3 # map(3)
224     d8 19 # tag(25)
225     02 # unsigned(2)
226     44 # bytes(4)
227     466f6f64 # "Food"
228     d8 19 # tag(25)
229     01 # unsigned(1)
230     19 02b3 # unsigned(691)
231     d8 19 # tag(25)
232     00 # unsigned(0)
233     04 # unsigned(4)
234    
235     The following JSON array illustrates the effect of the index on the
236     minimum string length:
237    
238     [ "1", "222", "333", "4", "555", "666", "777", "888", "999",
239     "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
240     "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
241     "333",
242     "ssss",
243     "qqq", "rrr", "ssss"]
244    
245 root 1.5 The strings "1", "4" and "rrr" are too short to get an index assigned. All
246     others that are not encoded with a stringref do (this assumes that JSON
247     strings are encoded as CBOR byte strings):
248 root 1.3
249     d9 0100 # tag(256)
250     98 20 # array(32)
251     41 # bytes(1)
252     31 # "1"
253     43 # bytes(3)
254     323232 # "222"
255     43 # bytes(3)
256     333333 # "333"
257     41 # bytes(1)
258     34 # "4"
259     43 # bytes(3)
260     353535 # "555"
261     43 # bytes(3)
262     363636 # "666"
263     43 # bytes(3)
264     373737 # "777"
265     43 # bytes(3)
266     383838 # "888"
267     43 # bytes(3)
268     393939 # "999"
269     43 # bytes(3)
270     616161 # "aaa"
271     43 # bytes(3)
272     626262 # "bbb"
273     43 # bytes(3)
274     636363 # "ccc"
275     43 # bytes(3)
276     646464 # "ddd"
277     43 # bytes(3)
278     656565 # "eee"
279     43 # bytes(3)
280     666666 # "fff"
281     43 # bytes(3)
282     676767 # "ggg"
283     43 # bytes(3)
284     686868 # "hhh"
285     43 # bytes(3)
286     696969 # "iii"
287     43 # bytes(3)
288     6a6a6a # "jjj"
289     43 # bytes(3)
290     6b6b6b # "kkk"
291     43 # bytes(3)
292     6c6c6c # "lll"
293     43 # bytes(3)
294     6d6d6d # "mmm"
295     43 # bytes(3)
296     6e6e6e # "nnn"
297     43 # bytes(3)
298     6f6f6f # "ooo"
299     43 # bytes(3)
300     707070 # "ppp"
301     43 # bytes(3)
302     717171 # "qqq"
303     43 # bytes(3)
304     727272 # "rrr"
305 root 1.5 d8 19 # tag(25)
306     01 # unsigned(1)
307 root 1.3 44 # bytes(4)
308     73737373 # "ssss"
309     d8 19 # tag(25)
310     17 # unsigned(23)
311     43 # bytes(3)
312     727272 # "rrr"
313     d8 19 # tag(25)
314     18 18 # unsigned(24)
315    
316     This example shows three stringref-namespace tags, two of which are nested
317     inside another:
318    
319     256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
320    
321     d9 0100 # tag(256)
322     85 # array(5)
323     63 # text(3)
324     616161 # "aaa"
325     d8 19 # tag(25)
326     00 # unsigned(0)
327     d9 0100 # tag(256)
328     83 # array(3)
329     63 # text(3)
330     626262 # "bbb"
331     63 # text(3)
332     616161 # "aaa"
333     d8 19 # tag(25)
334     01 # unsigned(1)
335     d9 0100 # tag(256)
336     82 # array(2)
337     63 # text(3)
338     636363 # "ccc"
339     d8 19 # tag(25)
340     00 # unsigned(0)
341     d8 19 # tag(25)
342     00 # unsigned(0)
343    
344     The decoded data structure might look like this:
345    
346     ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
347    
348 root 1.4 =head1 IMPLEMENTATIONS
349    
350     This section lists known implementations of this extension (L<drop me a
351     mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
352     listed here).
353    
354     =over 4
355    
356     =item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)
357    
358     =back
359 root 1.1