ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.3
Committed: Thu Nov 28 10:16:50 2013 UTC (10 years, 7 months ago) by root
Branch: MAIN
CVS Tags: rel-1_11, rel-1_12, rel-1_1, rel-1_0, rel-1_3, rel-1_5, rel-1_4, rel-1_6, rel-1_25, rel-1_26, rel-1_41
Changes since 1.2: +161 -10 lines
Log Message:
*** empty log message ***

File Contents

# User Rev Content
1 root 1.1 =head1 REGISTRATION INFORMATION
2    
3 root 1.3 Tag 256 (stringref-namespace)
4 root 1.1 Data Item multiple
5     Semantics mark value as having string references
6     Reference http://cbor.schmorp.de/stringref
7     Contact Marc A. Lehmann <cbor@schmorp.de>
8    
9 root 1.3 Tag 25 (stringref)
10 root 1.1 Data Item unsigned integer
11     Semantics reference the nth previously seen string
12     Reference http://cbor.schmorp.de/stringref
13     Contact Marc A. Lehmann <cbor@schmorp.de>
14    
15     =head1 RATIONALE
16    
17     These two tags can be used to efficiently share constant strings, which
18     are common in many data structures. They are an optimisation only,
19     similar to LZSS compresssion, for the string major types in CBOR.
20    
21     Background: many data structures or protocols contain repeated string
22     constants (often called atoms or quarks). For example, this is a
23     real-world (JSON-formatted) data structure that contains many repeated
24     strings used as map keys:
25    
26     [
27     {
28     "name" : "Cocktail",
29     "count" : 417,
30     "rank" : 4
31     },
32     {
33     "rank" : 4,
34     "count" : 312,
35     "name" : "Bath"
36     },
37     {
38     "count" : 691,
39     "name" : "Food",
40     "rank" : 4
41     }
42     ]
43    
44     This can be encoded nicely with CBOR, but each occurrence of "name",
45     "count" and "rank" will be encoded separately. In highly repetitive data
46     structures (the above is taken from a game save file) these can take up
47     considerable space.
48    
49     This scheme can be used to reduce this overhead with a simple scheme that
50     is easily implementable.
51    
52     =head1 DESCRIPTION
53    
54 root 1.3 Stringref consists of two tags, stringref-namespace (value C<256>),
55 root 1.1 which marks a value as containing string references, and stringref (value
56 root 1.3 C<25>), which references a string previously encoded in the value.
57 root 1.1
58     The stringref-namespace tag is used to define a namespace for the string
59     reference ids. stringref tags are only valid inside CBOR values marked
60     with stringref-namespace.
61    
62     The purpose of stringref-namespace is three-fold:
63    
64     =over 4
65    
66     =item 1. It signals the decoder that it needs to keep a copy of every
67     string that is being decoded while decoding the tagged value (and
68     conversely, that there is no need for this extra overhead without this
69     tag).
70    
71     =item 2. It can be used to confine string references to certain subtrees,
72     which might increase encoding efficiency by being able to use smaller
73     references.
74    
75     =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76     existing CBOR streams, as the indices are relative to the outer namespace
77     tag, and don't change when embedded.
78    
79     =back
80    
81     Within a value tagged with stringref-namespace, every string that is
82     encoded with a definite length and has a minimum length is assigned an
83     implicit index, starting from zero.
84    
85     The minimum length required to get an index depends on the index value
86     that would be assigned to it, as stringref references should be shorter
87     than the string they reference, and there is no point in wasting indices
88     on strings that should not be referenced anyway.
89    
90     The lengths are as follows:
91    
92     string index minimum string length in octets
93     0 .. 23 3
94     24 .. 255 4
95     256 .. 65535 5
96 root 1.2 65536 .. 4294967295 7
97     4294967296 .. 11
98 root 1.1
99     The minimum string length is simply the length of the stringref tag (2
100     octets), plus the minimum size the resulting index takes up, using major
101     type 0 encoding.
102    
103     Each time a string is to be encoded by the encoder, it can choose to
104     encode it as a normal string, or attempt to encode it as a stringref.
105    
106     If it attempts to encode a stringref, it needs to check whether the string
107     has been assigned an index already. If yes, it can emit a stringref tag
108     with the index.
109    
110     In all other cases the encoder must check whether it has to assign an
111     index value to it by checking against the minimum length required for the
112     next index value to be assigned, and assigning the next index value if the
113     string has minimum length or is longer.
114    
115     A typical encoder might want to look up every string (using the major
116     type and the octets) it encodes in some kind of internal map that stores
117     previously-assigned indices. If found, it will emit a stringref. If not,
118     it attempts to assign an index to the string and stores that in its map.
119    
120     Decoders might choose not to look up previous indices for a string, for
121     example, because it only wants to compress strings used as map keys, or
122     because it intrinsically knows that the string is only used once. This is
123     fine, but each time the encoder emits a string, it MUST assign an index if
124     the minimum length is reached or exceeded, even if it doesn't store that
125     index for later use.
126    
127     Byte strings and text strings share the same index namespace, but are
128     distinct, even if their contents are bit-identical. Indefinite length
129     strings and the chunks they consist of are never assigned an index.
130    
131     A typical encoder would emit the stringref-namespace tag once to tag
132     the top-level value in a CBOR stream, but nothing keeps an encoder
133     from nesting stringref-namespace tags. Nesting might be done for
134     convenience, to improve coding efficiency, to embed values or for any
135     other reason. Similarly, an encoder can sort map keys for greater coding
136     efficiency, but is not required to.
137    
138     A decoder follows the same process of assigning indices. A possible
139     implementation could be outlined like this:
140    
141     When a decoder supports stringref, then, upon encountering a
142     stringref-namespace tag, it should initialise an array and start decoding
143     the tagged value. The array stores, for each index, the major type (byte
144     or text) and the string value. Implementations are free to use their
145     internal format, of course - languages that do not distinguish between
146     text and byte strings can just store their internal string type in the
147     array.
148    
149     Since stringref-namespace tags can be nested, the decoder needs to save
150     and restore the outer array before starting and after ending the decoding
151     of the tagged value. In a recursive implementation this could simply look
152     like this:
153    
154     global stringref_array; // global per encoder
155     local outer_stringref_array;
156    
157     outer_stringref_array = stringref_array;
158     stringref_array = allocate_array ();
159    
160     value = decode_cbor_value ();
161    
162     deallocate_array (stringref_array);
163     stringref_array = outer_stringref_array;
164    
165     return value;
166    
167     When a stringref tag is encountered, it MUST contain an unsigned integer
168     (major type 0). The integer MUST be less than the number of entries in the
169     array. The decoder than looks up the string in the array, given the index,
170     and copies it to the result (string values should not be shared unless
171     the language normally does so - if strings are mutable, a copy should be
172     decoded).
173    
174     When a definite-length text or byte string is encountered, it is decoded
175     normally. Afterwards, the decoder checks whether it needs to assign an
176     index to the string, by comparing the string length against the minimum
177     length required by the index. If the string is shorter, nothing happens,
178     otherwise the next free index is used and the string (and the information
179     whether it is a byte or text string) is recorded at the end of the array.
180    
181     In languages with dynamic arrays, this could be accomplished by using
182     the array length as the next index to be assigned, and pushing the
183     string onto the end of the array when it is long enough.
184    
185     =head2 IMPLEMENTATION NOTE
186    
187 root 1.3 The semantics of stringref tags require the decoder to be aware and the
188     encoder to be under control of the sequence in which data items are
189     encoded into the CBOR stream. This means these tags cannot be implemented
190     on top of every generic CBOR encoder/decoder (which might reorder entries
191     in a map); they typically need to be integrated into their works.
192 root 1.1
193     =head1 EXAMPLES
194    
195 root 1.3 The array-of-maps from the rationale example would normally compress to a
196     CBOR text of 83 bytes. Using this extension where possible, this reduces
197     to 74 bytes:
198    
199     d9 0100 # tag(256)
200     83 # array(3)
201     a3 # map(3)
202     44 # bytes(4)
203     72616e6b # "rank"
204     04 # unsigned(4)
205     45 # bytes(5)
206     636f756e74 # "count"
207     19 01a1 # unsigned(417)
208     44 # bytes(4)
209     6e616d65 # "name"
210     48 # bytes(8)
211     436f636b7461696c # "Cocktail"
212     a3 # map(3)
213     d8 19 # tag(25)
214     02 # unsigned(2)
215     44 # bytes(4)
216     42617468 # "Bath"
217     d8 19 # tag(25)
218     01 # unsigned(1)
219     19 0138 # unsigned(312)
220     d8 19 # tag(25)
221     00 # unsigned(0)
222     04 # unsigned(4)
223     a3 # map(3)
224     d8 19 # tag(25)
225     02 # unsigned(2)
226     44 # bytes(4)
227     466f6f64 # "Food"
228     d8 19 # tag(25)
229     01 # unsigned(1)
230     19 02b3 # unsigned(691)
231     d8 19 # tag(25)
232     00 # unsigned(0)
233     04 # unsigned(4)
234    
235     The following JSON array illustrates the effect of the index on the
236     minimum string length:
237    
238     [ "1", "222", "333", "4", "555", "666", "777", "888", "999",
239     "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
240     "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
241     "333",
242     "ssss",
243     "qqq", "rrr", "ssss"]
244    
245     The strings "1", "4" and "rrr" are too short to get an index assigned. All others that are
246     not encoded with a stringref do:
247    
248     d9 0100 # tag(256)
249     98 20 # array(32)
250     41 # bytes(1)
251     31 # "1"
252     43 # bytes(3)
253     323232 # "222"
254     43 # bytes(3)
255     333333 # "333"
256     41 # bytes(1)
257     34 # "4"
258     43 # bytes(3)
259     353535 # "555"
260     43 # bytes(3)
261     363636 # "666"
262     43 # bytes(3)
263     373737 # "777"
264     43 # bytes(3)
265     383838 # "888"
266     43 # bytes(3)
267     393939 # "999"
268     43 # bytes(3)
269     616161 # "aaa"
270     43 # bytes(3)
271     626262 # "bbb"
272     43 # bytes(3)
273     636363 # "ccc"
274     43 # bytes(3)
275     646464 # "ddd"
276     43 # bytes(3)
277     656565 # "eee"
278     43 # bytes(3)
279     666666 # "fff"
280     43 # bytes(3)
281     676767 # "ggg"
282     43 # bytes(3)
283     686868 # "hhh"
284     43 # bytes(3)
285     696969 # "iii"
286     43 # bytes(3)
287     6a6a6a # "jjj"
288     43 # bytes(3)
289     6b6b6b # "kkk"
290     43 # bytes(3)
291     6c6c6c # "lll"
292     43 # bytes(3)
293     6d6d6d # "mmm"
294     43 # bytes(3)
295     6e6e6e # "nnn"
296     43 # bytes(3)
297     6f6f6f # "ooo"
298     43 # bytes(3)
299     707070 # "ppp"
300     43 # bytes(3)
301     717171 # "qqq"
302     43 # bytes(3)
303     727272 # "rrr"
304     44 # bytes(4)
305     73737373 # "ssss"
306     d8 19 # tag(25)
307     01 # unsigned(1)
308     d8 19 # tag(25)
309     17 # unsigned(23)
310     43 # bytes(3)
311     727272 # "rrr"
312     d8 19 # tag(25)
313     18 18 # unsigned(24)
314    
315     This example shows three stringref-namespace tags, two of which are nested
316     inside another:
317    
318     256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
319    
320     d9 0100 # tag(256)
321     85 # array(5)
322     63 # text(3)
323     616161 # "aaa"
324     d8 19 # tag(25)
325     00 # unsigned(0)
326     d9 0100 # tag(256)
327     83 # array(3)
328     63 # text(3)
329     626262 # "bbb"
330     63 # text(3)
331     616161 # "aaa"
332     d8 19 # tag(25)
333     01 # unsigned(1)
334     d9 0100 # tag(256)
335     82 # array(2)
336     63 # text(3)
337     636363 # "ccc"
338     d8 19 # tag(25)
339     00 # unsigned(0)
340     d8 19 # tag(25)
341     00 # unsigned(0)
342    
343     The decoded data structure might look like this:
344    
345     ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
346    
347 root 1.1