ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.6
Committed: Mon Apr 30 11:24:17 2018 UTC (6 years, 1 month ago) by root
Branch: MAIN
CVS Tags: rel-1_71
Changes since 1.5: +19 -0 lines
Log Message:
*** empty log message ***

File Contents

# User Rev Content
1 root 1.1 =head1 REGISTRATION INFORMATION
2    
3 root 1.3 Tag 256 (stringref-namespace)
4 root 1.1 Data Item multiple
5     Semantics mark value as having string references
6     Reference http://cbor.schmorp.de/stringref
7     Contact Marc A. Lehmann <cbor@schmorp.de>
8    
9 root 1.3 Tag 25 (stringref)
10 root 1.1 Data Item unsigned integer
11     Semantics reference the nth previously seen string
12     Reference http://cbor.schmorp.de/stringref
13     Contact Marc A. Lehmann <cbor@schmorp.de>
14    
15     =head1 RATIONALE
16    
17     These two tags can be used to efficiently share constant strings, which
18     are common in many data structures. They are an optimisation only,
19     similar to LZSS compresssion, for the string major types in CBOR.
20    
21     Background: many data structures or protocols contain repeated string
22     constants (often called atoms or quarks). For example, this is a
23     real-world (JSON-formatted) data structure that contains many repeated
24     strings used as map keys:
25    
26     [
27     {
28     "name" : "Cocktail",
29     "count" : 417,
30     "rank" : 4
31     },
32     {
33     "rank" : 4,
34     "count" : 312,
35     "name" : "Bath"
36     },
37     {
38     "count" : 691,
39     "name" : "Food",
40     "rank" : 4
41     }
42     ]
43    
44     This can be encoded nicely with CBOR, but each occurrence of "name",
45     "count" and "rank" will be encoded separately. In highly repetitive data
46     structures (the above is taken from a game save file) these can take up
47     considerable space.
48    
49     This scheme can be used to reduce this overhead with a simple scheme that
50     is easily implementable.
51    
52     =head1 DESCRIPTION
53    
54 root 1.3 Stringref consists of two tags, stringref-namespace (value C<256>),
55 root 1.1 which marks a value as containing string references, and stringref (value
56 root 1.3 C<25>), which references a string previously encoded in the value.
57 root 1.1
58     The stringref-namespace tag is used to define a namespace for the string
59     reference ids. stringref tags are only valid inside CBOR values marked
60     with stringref-namespace.
61    
62     The purpose of stringref-namespace is three-fold:
63    
64     =over 4
65    
66     =item 1. It signals the decoder that it needs to keep a copy of every
67     string that is being decoded while decoding the tagged value (and
68     conversely, that there is no need for this extra overhead without this
69     tag).
70    
71     =item 2. It can be used to confine string references to certain subtrees,
72     which might increase encoding efficiency by being able to use smaller
73     references.
74    
75     =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76     existing CBOR streams, as the indices are relative to the outer namespace
77     tag, and don't change when embedded.
78    
79     =back
80    
81     Within a value tagged with stringref-namespace, every string that is
82     encoded with a definite length and has a minimum length is assigned an
83     implicit index, starting from zero.
84    
85     The minimum length required to get an index depends on the index value
86     that would be assigned to it, as stringref references should be shorter
87     than the string they reference, and there is no point in wasting indices
88     on strings that should not be referenced anyway.
89    
90     The lengths are as follows:
91    
92     string index minimum string length in octets
93     0 .. 23 3
94     24 .. 255 4
95     256 .. 65535 5
96 root 1.2 65536 .. 4294967295 7
97     4294967296 .. 11
98 root 1.1
99     The minimum string length is simply the length of the stringref tag (2
100     octets), plus the minimum size the resulting index takes up, using major
101     type 0 encoding.
102    
103     Each time a string is to be encoded by the encoder, it can choose to
104     encode it as a normal string, or attempt to encode it as a stringref.
105    
106     If it attempts to encode a stringref, it needs to check whether the string
107     has been assigned an index already. If yes, it can emit a stringref tag
108     with the index.
109    
110     In all other cases the encoder must check whether it has to assign an
111     index value to it by checking against the minimum length required for the
112     next index value to be assigned, and assigning the next index value if the
113     string has minimum length or is longer.
114    
115     A typical encoder might want to look up every string (using the major
116     type and the octets) it encodes in some kind of internal map that stores
117     previously-assigned indices. If found, it will emit a stringref. If not,
118     it attempts to assign an index to the string and stores that in its map.
119    
120     Decoders might choose not to look up previous indices for a string, for
121     example, because it only wants to compress strings used as map keys, or
122     because it intrinsically knows that the string is only used once. This is
123     fine, but each time the encoder emits a string, it MUST assign an index if
124     the minimum length is reached or exceeded, even if it doesn't store that
125     index for later use.
126    
127     Byte strings and text strings share the same index namespace, but are
128     distinct, even if their contents are bit-identical. Indefinite length
129     strings and the chunks they consist of are never assigned an index.
130    
131     A typical encoder would emit the stringref-namespace tag once to tag
132     the top-level value in a CBOR stream, but nothing keeps an encoder
133     from nesting stringref-namespace tags. Nesting might be done for
134     convenience, to improve coding efficiency, to embed values or for any
135     other reason. Similarly, an encoder can sort map keys for greater coding
136     efficiency, but is not required to.
137    
138     A decoder follows the same process of assigning indices. A possible
139     implementation could be outlined like this:
140    
141     When a decoder supports stringref, then, upon encountering a
142     stringref-namespace tag, it should initialise an array and start decoding
143     the tagged value. The array stores, for each index, the major type (byte
144     or text) and the string value. Implementations are free to use their
145     internal format, of course - languages that do not distinguish between
146     text and byte strings can just store their internal string type in the
147     array.
148    
149     Since stringref-namespace tags can be nested, the decoder needs to save
150     and restore the outer array before starting and after ending the decoding
151     of the tagged value. In a recursive implementation this could simply look
152     like this:
153    
154     global stringref_array; // global per encoder
155     local outer_stringref_array;
156    
157     outer_stringref_array = stringref_array;
158     stringref_array = allocate_array ();
159    
160     value = decode_cbor_value ();
161    
162     deallocate_array (stringref_array);
163     stringref_array = outer_stringref_array;
164    
165     return value;
166    
167     When a stringref tag is encountered, it MUST contain an unsigned integer
168     (major type 0). The integer MUST be less than the number of entries in the
169     array. The decoder than looks up the string in the array, given the index,
170     and copies it to the result (string values should not be shared unless
171     the language normally does so - if strings are mutable, a copy should be
172     decoded).
173    
174     When a definite-length text or byte string is encountered, it is decoded
175     normally. Afterwards, the decoder checks whether it needs to assign an
176     index to the string, by comparing the string length against the minimum
177     length required by the index. If the string is shorter, nothing happens,
178     otherwise the next free index is used and the string (and the information
179     whether it is a byte or text string) is recorded at the end of the array.
180    
181     In languages with dynamic arrays, this could be accomplished by using
182     the array length as the next index to be assigned, and pushing the
183     string onto the end of the array when it is long enough.
184    
185     =head2 IMPLEMENTATION NOTE
186    
187 root 1.3 The semantics of stringref tags require the decoder to be aware and the
188     encoder to be under control of the sequence in which data items are
189     encoded into the CBOR stream. This means these tags cannot be implemented
190     on top of every generic CBOR encoder/decoder (which might reorder entries
191     in a map); they typically need to be integrated into their works.
192 root 1.1
193 root 1.6 =head2 DESIGN RATIONALE
194    
195     The stringref tag was chosen to be short, without requiring standards
196     action. The namespace tag is rare, so doesn't benefit from a short
197     encoding as much.
198    
199     Implicit tagging/counting was chosen to support stream encoders. Having
200     to tag strings first requires either multiple passes over the data (which
201     might not be available, ruling out some encoders) or tagging more strings
202     than needed (wasting space). Explicit tagging also isn't necessarily
203     better even under optimal conditions, as the explicit tags waste space.
204    
205     Stream decoders are affected less by implicit tagging than encoders.
206    
207     The namespace tag was introduced for two reasons: first to allow embedding
208     of CBOR strings into other CBOR strings, secondly for decoding efficiency
209     - the decoder only has to expect stringref tags inside namespaces and
210     therefore doesn't have to maintain extra state outside of them.
211    
212 root 1.1 =head1 EXAMPLES
213    
214 root 1.3 The array-of-maps from the rationale example would normally compress to a
215     CBOR text of 83 bytes. Using this extension where possible, this reduces
216     to 74 bytes:
217    
218     d9 0100 # tag(256)
219     83 # array(3)
220     a3 # map(3)
221     44 # bytes(4)
222     72616e6b # "rank"
223     04 # unsigned(4)
224     45 # bytes(5)
225     636f756e74 # "count"
226     19 01a1 # unsigned(417)
227     44 # bytes(4)
228     6e616d65 # "name"
229     48 # bytes(8)
230     436f636b7461696c # "Cocktail"
231     a3 # map(3)
232     d8 19 # tag(25)
233     02 # unsigned(2)
234     44 # bytes(4)
235     42617468 # "Bath"
236     d8 19 # tag(25)
237     01 # unsigned(1)
238     19 0138 # unsigned(312)
239     d8 19 # tag(25)
240     00 # unsigned(0)
241     04 # unsigned(4)
242     a3 # map(3)
243     d8 19 # tag(25)
244     02 # unsigned(2)
245     44 # bytes(4)
246     466f6f64 # "Food"
247     d8 19 # tag(25)
248     01 # unsigned(1)
249     19 02b3 # unsigned(691)
250     d8 19 # tag(25)
251     00 # unsigned(0)
252     04 # unsigned(4)
253    
254     The following JSON array illustrates the effect of the index on the
255     minimum string length:
256    
257     [ "1", "222", "333", "4", "555", "666", "777", "888", "999",
258     "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
259     "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
260     "333",
261     "ssss",
262     "qqq", "rrr", "ssss"]
263    
264 root 1.5 The strings "1", "4" and "rrr" are too short to get an index assigned. All
265     others that are not encoded with a stringref do (this assumes that JSON
266     strings are encoded as CBOR byte strings):
267 root 1.3
268     d9 0100 # tag(256)
269     98 20 # array(32)
270     41 # bytes(1)
271     31 # "1"
272     43 # bytes(3)
273     323232 # "222"
274     43 # bytes(3)
275     333333 # "333"
276     41 # bytes(1)
277     34 # "4"
278     43 # bytes(3)
279     353535 # "555"
280     43 # bytes(3)
281     363636 # "666"
282     43 # bytes(3)
283     373737 # "777"
284     43 # bytes(3)
285     383838 # "888"
286     43 # bytes(3)
287     393939 # "999"
288     43 # bytes(3)
289     616161 # "aaa"
290     43 # bytes(3)
291     626262 # "bbb"
292     43 # bytes(3)
293     636363 # "ccc"
294     43 # bytes(3)
295     646464 # "ddd"
296     43 # bytes(3)
297     656565 # "eee"
298     43 # bytes(3)
299     666666 # "fff"
300     43 # bytes(3)
301     676767 # "ggg"
302     43 # bytes(3)
303     686868 # "hhh"
304     43 # bytes(3)
305     696969 # "iii"
306     43 # bytes(3)
307     6a6a6a # "jjj"
308     43 # bytes(3)
309     6b6b6b # "kkk"
310     43 # bytes(3)
311     6c6c6c # "lll"
312     43 # bytes(3)
313     6d6d6d # "mmm"
314     43 # bytes(3)
315     6e6e6e # "nnn"
316     43 # bytes(3)
317     6f6f6f # "ooo"
318     43 # bytes(3)
319     707070 # "ppp"
320     43 # bytes(3)
321     717171 # "qqq"
322     43 # bytes(3)
323     727272 # "rrr"
324 root 1.5 d8 19 # tag(25)
325     01 # unsigned(1)
326 root 1.3 44 # bytes(4)
327     73737373 # "ssss"
328     d8 19 # tag(25)
329     17 # unsigned(23)
330     43 # bytes(3)
331     727272 # "rrr"
332     d8 19 # tag(25)
333     18 18 # unsigned(24)
334    
335     This example shows three stringref-namespace tags, two of which are nested
336     inside another:
337    
338     256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
339    
340     d9 0100 # tag(256)
341     85 # array(5)
342     63 # text(3)
343     616161 # "aaa"
344     d8 19 # tag(25)
345     00 # unsigned(0)
346     d9 0100 # tag(256)
347     83 # array(3)
348     63 # text(3)
349     626262 # "bbb"
350     63 # text(3)
351     616161 # "aaa"
352     d8 19 # tag(25)
353     01 # unsigned(1)
354     d9 0100 # tag(256)
355     82 # array(2)
356     63 # text(3)
357     636363 # "ccc"
358     d8 19 # tag(25)
359     00 # unsigned(0)
360     d8 19 # tag(25)
361     00 # unsigned(0)
362    
363     The decoded data structure might look like this:
364    
365     ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
366    
367 root 1.4 =head1 IMPLEMENTATIONS
368    
369     This section lists known implementations of this extension (L<drop me a
370     mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
371     listed here).
372    
373     =over 4
374    
375     =item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)
376    
377     =back
378 root 1.1