ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.7
Committed: Mon Apr 1 04:17:50 2019 UTC (5 years, 1 month ago) by root
Branch: MAIN
CVS Tags: rel-1_8, rel-1_82, rel-1_83, rel-1_81, rel-1_86, rel-1_87, rel-1_84, rel-1_85, HEAD
Changes since 1.6: +2 -0 lines
Log Message:
*** empty log message ***

File Contents

# Content
1 =head1 REGISTRATION INFORMATION
2
3 Tag 256 (stringref-namespace)
4 Data Item multiple
5 Semantics mark value as having string references
6 Reference http://cbor.schmorp.de/stringref
7 Contact Marc A. Lehmann <cbor@schmorp.de>
8
9 Tag 25 (stringref)
10 Data Item unsigned integer
11 Semantics reference the nth previously seen string
12 Reference http://cbor.schmorp.de/stringref
13 Contact Marc A. Lehmann <cbor@schmorp.de>
14
15 =head1 RATIONALE
16
17 These two tags can be used to efficiently share constant strings, which
18 are common in many data structures. They are an optimisation only,
19 similar to LZSS compresssion, for the string major types in CBOR.
20
21 Background: many data structures or protocols contain repeated string
22 constants (often called atoms or quarks). For example, this is a
23 real-world (JSON-formatted) data structure that contains many repeated
24 strings used as map keys:
25
26 [
27 {
28 "name" : "Cocktail",
29 "count" : 417,
30 "rank" : 4
31 },
32 {
33 "rank" : 4,
34 "count" : 312,
35 "name" : "Bath"
36 },
37 {
38 "count" : 691,
39 "name" : "Food",
40 "rank" : 4
41 }
42 ]
43
44 This can be encoded nicely with CBOR, but each occurrence of "name",
45 "count" and "rank" will be encoded separately. In highly repetitive data
46 structures (the above is taken from a game save file) these can take up
47 considerable space.
48
49 This scheme can be used to reduce this overhead with a simple scheme that
50 is easily implementable.
51
52 =head1 DESCRIPTION
53
54 Stringref consists of two tags, stringref-namespace (value C<256>),
55 which marks a value as containing string references, and stringref (value
56 C<25>), which references a string previously encoded in the value.
57
58 The stringref-namespace tag is used to define a namespace for the string
59 reference ids. stringref tags are only valid inside CBOR values marked
60 with stringref-namespace.
61
62 The purpose of stringref-namespace is three-fold:
63
64 =over 4
65
66 =item 1. It signals the decoder that it needs to keep a copy of every
67 string that is being decoded while decoding the tagged value (and
68 conversely, that there is no need for this extra overhead without this
69 tag).
70
71 =item 2. It can be used to confine string references to certain subtrees,
72 which might increase encoding efficiency by being able to use smaller
73 references.
74
75 =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76 existing CBOR streams, as the indices are relative to the outer namespace
77 tag, and don't change when embedded.
78
79 =back
80
81 Within a value tagged with stringref-namespace, every string that is
82 encoded with a definite length and has a minimum length is assigned an
83 implicit index, starting from zero.
84
85 The minimum length required to get an index depends on the index value
86 that would be assigned to it, as stringref references should be shorter
87 than the string they reference, and there is no point in wasting indices
88 on strings that should not be referenced anyway.
89
90 The lengths are as follows:
91
92 string index minimum string length in octets
93 0 .. 23 3
94 24 .. 255 4
95 256 .. 65535 5
96 65536 .. 4294967295 7
97 4294967296 .. 11
98
99 The minimum string length is simply the length of the stringref tag (2
100 octets), plus the minimum size the resulting index takes up, using major
101 type 0 encoding.
102
103 Each time a string is to be encoded by the encoder, it can choose to
104 encode it as a normal string, or attempt to encode it as a stringref.
105
106 If it attempts to encode a stringref, it needs to check whether the string
107 has been assigned an index already. If yes, it can emit a stringref tag
108 with the index.
109
110 In all other cases the encoder must check whether it has to assign an
111 index value to it by checking against the minimum length required for the
112 next index value to be assigned, and assigning the next index value if the
113 string has minimum length or is longer.
114
115 A typical encoder might want to look up every string (using the major
116 type and the octets) it encodes in some kind of internal map that stores
117 previously-assigned indices. If found, it will emit a stringref. If not,
118 it attempts to assign an index to the string and stores that in its map.
119
120 Decoders might choose not to look up previous indices for a string, for
121 example, because it only wants to compress strings used as map keys, or
122 because it intrinsically knows that the string is only used once. This is
123 fine, but each time the encoder emits a string, it MUST assign an index if
124 the minimum length is reached or exceeded, even if it doesn't store that
125 index for later use.
126
127 Byte strings and text strings share the same index namespace, but are
128 distinct, even if their contents are bit-identical. Indefinite length
129 strings and the chunks they consist of are never assigned an index.
130
131 A typical encoder would emit the stringref-namespace tag once to tag
132 the top-level value in a CBOR stream, but nothing keeps an encoder
133 from nesting stringref-namespace tags. Nesting might be done for
134 convenience, to improve coding efficiency, to embed values or for any
135 other reason. Similarly, an encoder can sort map keys for greater coding
136 efficiency, but is not required to.
137
138 A decoder follows the same process of assigning indices. A possible
139 implementation could be outlined like this:
140
141 When a decoder supports stringref, then, upon encountering a
142 stringref-namespace tag, it should initialise an array and start decoding
143 the tagged value. The array stores, for each index, the major type (byte
144 or text) and the string value. Implementations are free to use their
145 internal format, of course - languages that do not distinguish between
146 text and byte strings can just store their internal string type in the
147 array.
148
149 Since stringref-namespace tags can be nested, the decoder needs to save
150 and restore the outer array before starting and after ending the decoding
151 of the tagged value. In a recursive implementation this could simply look
152 like this:
153
154 global stringref_array; // global per encoder
155 local outer_stringref_array;
156
157 outer_stringref_array = stringref_array;
158 stringref_array = allocate_array ();
159
160 value = decode_cbor_value ();
161
162 deallocate_array (stringref_array);
163 stringref_array = outer_stringref_array;
164
165 return value;
166
167 When a stringref tag is encountered, it MUST contain an unsigned integer
168 (major type 0). The integer MUST be less than the number of entries in the
169 array. The decoder than looks up the string in the array, given the index,
170 and copies it to the result (string values should not be shared unless
171 the language normally does so - if strings are mutable, a copy should be
172 decoded).
173
174 When a definite-length text or byte string is encountered, it is decoded
175 normally. Afterwards, the decoder checks whether it needs to assign an
176 index to the string, by comparing the string length against the minimum
177 length required by the index. If the string is shorter, nothing happens,
178 otherwise the next free index is used and the string (and the information
179 whether it is a byte or text string) is recorded at the end of the array.
180
181 In languages with dynamic arrays, this could be accomplished by using
182 the array length as the next index to be assigned, and pushing the
183 string onto the end of the array when it is long enough.
184
185 =head2 IMPLEMENTATION NOTE
186
187 The semantics of stringref tags require the decoder to be aware and the
188 encoder to be under control of the sequence in which data items are
189 encoded into the CBOR stream. This means these tags cannot be implemented
190 on top of every generic CBOR encoder/decoder (which might reorder entries
191 in a map); they typically need to be integrated into their works.
192
193 =head2 DESIGN RATIONALE
194
195 The stringref tag was chosen to be short, without requiring standards
196 action. The namespace tag is rare, so doesn't benefit from a short
197 encoding as much.
198
199 Implicit tagging/counting was chosen to support stream encoders. Having
200 to tag strings first requires either multiple passes over the data (which
201 might not be available, ruling out some encoders) or tagging more strings
202 than needed (wasting space). Explicit tagging also isn't necessarily
203 better even under optimal conditions, as the explicit tags waste space.
204
205 Stream decoders are affected less by implicit tagging than encoders.
206
207 The namespace tag was introduced for two reasons: first to allow embedding
208 of CBOR strings into other CBOR strings, secondly for decoding efficiency
209 - the decoder only has to expect stringref tags inside namespaces and
210 therefore doesn't have to maintain extra state outside of them.
211
212 =head1 EXAMPLES
213
214 The array-of-maps from the rationale example would normally compress to a
215 CBOR text of 83 bytes. Using this extension where possible, this reduces
216 to 74 bytes:
217
218 d9 0100 # tag(256)
219 83 # array(3)
220 a3 # map(3)
221 44 # bytes(4)
222 72616e6b # "rank"
223 04 # unsigned(4)
224 45 # bytes(5)
225 636f756e74 # "count"
226 19 01a1 # unsigned(417)
227 44 # bytes(4)
228 6e616d65 # "name"
229 48 # bytes(8)
230 436f636b7461696c # "Cocktail"
231 a3 # map(3)
232 d8 19 # tag(25)
233 02 # unsigned(2)
234 44 # bytes(4)
235 42617468 # "Bath"
236 d8 19 # tag(25)
237 01 # unsigned(1)
238 19 0138 # unsigned(312)
239 d8 19 # tag(25)
240 00 # unsigned(0)
241 04 # unsigned(4)
242 a3 # map(3)
243 d8 19 # tag(25)
244 02 # unsigned(2)
245 44 # bytes(4)
246 466f6f64 # "Food"
247 d8 19 # tag(25)
248 01 # unsigned(1)
249 19 02b3 # unsigned(691)
250 d8 19 # tag(25)
251 00 # unsigned(0)
252 04 # unsigned(4)
253
254 The following JSON array illustrates the effect of the index on the
255 minimum string length:
256
257 [ "1", "222", "333", "4", "555", "666", "777", "888", "999",
258 "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
259 "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
260 "333",
261 "ssss",
262 "qqq", "rrr", "ssss"]
263
264 The strings "1", "4" and "rrr" are too short to get an index assigned. All
265 others that are not encoded with a stringref do (this assumes that JSON
266 strings are encoded as CBOR byte strings):
267
268 d9 0100 # tag(256)
269 98 20 # array(32)
270 41 # bytes(1)
271 31 # "1"
272 43 # bytes(3)
273 323232 # "222"
274 43 # bytes(3)
275 333333 # "333"
276 41 # bytes(1)
277 34 # "4"
278 43 # bytes(3)
279 353535 # "555"
280 43 # bytes(3)
281 363636 # "666"
282 43 # bytes(3)
283 373737 # "777"
284 43 # bytes(3)
285 383838 # "888"
286 43 # bytes(3)
287 393939 # "999"
288 43 # bytes(3)
289 616161 # "aaa"
290 43 # bytes(3)
291 626262 # "bbb"
292 43 # bytes(3)
293 636363 # "ccc"
294 43 # bytes(3)
295 646464 # "ddd"
296 43 # bytes(3)
297 656565 # "eee"
298 43 # bytes(3)
299 666666 # "fff"
300 43 # bytes(3)
301 676767 # "ggg"
302 43 # bytes(3)
303 686868 # "hhh"
304 43 # bytes(3)
305 696969 # "iii"
306 43 # bytes(3)
307 6a6a6a # "jjj"
308 43 # bytes(3)
309 6b6b6b # "kkk"
310 43 # bytes(3)
311 6c6c6c # "lll"
312 43 # bytes(3)
313 6d6d6d # "mmm"
314 43 # bytes(3)
315 6e6e6e # "nnn"
316 43 # bytes(3)
317 6f6f6f # "ooo"
318 43 # bytes(3)
319 707070 # "ppp"
320 43 # bytes(3)
321 717171 # "qqq"
322 43 # bytes(3)
323 727272 # "rrr"
324 d8 19 # tag(25)
325 01 # unsigned(1)
326 44 # bytes(4)
327 73737373 # "ssss"
328 d8 19 # tag(25)
329 17 # unsigned(23)
330 43 # bytes(3)
331 727272 # "rrr"
332 d8 19 # tag(25)
333 18 18 # unsigned(24)
334
335 This example shows three stringref-namespace tags, two of which are nested
336 inside another:
337
338 256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
339
340 d9 0100 # tag(256)
341 85 # array(5)
342 63 # text(3)
343 616161 # "aaa"
344 d8 19 # tag(25)
345 00 # unsigned(0)
346 d9 0100 # tag(256)
347 83 # array(3)
348 63 # text(3)
349 626262 # "bbb"
350 63 # text(3)
351 616161 # "aaa"
352 d8 19 # tag(25)
353 01 # unsigned(1)
354 d9 0100 # tag(256)
355 82 # array(2)
356 63 # text(3)
357 636363 # "ccc"
358 d8 19 # tag(25)
359 00 # unsigned(0)
360 d8 19 # tag(25)
361 00 # unsigned(0)
362
363 The decoded data structure might look like this:
364
365 ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
366
367 =head1 IMPLEMENTATIONS
368
369 This section lists known implementations of this extension (L<drop me a
370 mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
371 listed here).
372
373 =over 4
374
375 =item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)
376
377 =item * [C++] L<jsoncons|https://github.com/danielaparker/jsoncons>
378
379 =back
380