ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.5
Committed: Wed Apr 25 06:37:12 2018 UTC (6 years, 1 month ago) by root
Branch: MAIN
Changes since 1.4: +5 -4 lines
Log Message:
*** empty log message ***

File Contents

# Content
1 =head1 REGISTRATION INFORMATION
2
3 Tag 256 (stringref-namespace)
4 Data Item multiple
5 Semantics mark value as having string references
6 Reference http://cbor.schmorp.de/stringref
7 Contact Marc A. Lehmann <cbor@schmorp.de>
8
9 Tag 25 (stringref)
10 Data Item unsigned integer
11 Semantics reference the nth previously seen string
12 Reference http://cbor.schmorp.de/stringref
13 Contact Marc A. Lehmann <cbor@schmorp.de>
14
15 =head1 RATIONALE
16
17 These two tags can be used to efficiently share constant strings, which
18 are common in many data structures. They are an optimisation only,
19 similar to LZSS compresssion, for the string major types in CBOR.
20
21 Background: many data structures or protocols contain repeated string
22 constants (often called atoms or quarks). For example, this is a
23 real-world (JSON-formatted) data structure that contains many repeated
24 strings used as map keys:
25
26 [
27 {
28 "name" : "Cocktail",
29 "count" : 417,
30 "rank" : 4
31 },
32 {
33 "rank" : 4,
34 "count" : 312,
35 "name" : "Bath"
36 },
37 {
38 "count" : 691,
39 "name" : "Food",
40 "rank" : 4
41 }
42 ]
43
44 This can be encoded nicely with CBOR, but each occurrence of "name",
45 "count" and "rank" will be encoded separately. In highly repetitive data
46 structures (the above is taken from a game save file) these can take up
47 considerable space.
48
49 This scheme can be used to reduce this overhead with a simple scheme that
50 is easily implementable.
51
52 =head1 DESCRIPTION
53
54 Stringref consists of two tags, stringref-namespace (value C<256>),
55 which marks a value as containing string references, and stringref (value
56 C<25>), which references a string previously encoded in the value.
57
58 The stringref-namespace tag is used to define a namespace for the string
59 reference ids. stringref tags are only valid inside CBOR values marked
60 with stringref-namespace.
61
62 The purpose of stringref-namespace is three-fold:
63
64 =over 4
65
66 =item 1. It signals the decoder that it needs to keep a copy of every
67 string that is being decoded while decoding the tagged value (and
68 conversely, that there is no need for this extra overhead without this
69 tag).
70
71 =item 2. It can be used to confine string references to certain subtrees,
72 which might increase encoding efficiency by being able to use smaller
73 references.
74
75 =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76 existing CBOR streams, as the indices are relative to the outer namespace
77 tag, and don't change when embedded.
78
79 =back
80
81 Within a value tagged with stringref-namespace, every string that is
82 encoded with a definite length and has a minimum length is assigned an
83 implicit index, starting from zero.
84
85 The minimum length required to get an index depends on the index value
86 that would be assigned to it, as stringref references should be shorter
87 than the string they reference, and there is no point in wasting indices
88 on strings that should not be referenced anyway.
89
90 The lengths are as follows:
91
92 string index minimum string length in octets
93 0 .. 23 3
94 24 .. 255 4
95 256 .. 65535 5
96 65536 .. 4294967295 7
97 4294967296 .. 11
98
99 The minimum string length is simply the length of the stringref tag (2
100 octets), plus the minimum size the resulting index takes up, using major
101 type 0 encoding.
102
103 Each time a string is to be encoded by the encoder, it can choose to
104 encode it as a normal string, or attempt to encode it as a stringref.
105
106 If it attempts to encode a stringref, it needs to check whether the string
107 has been assigned an index already. If yes, it can emit a stringref tag
108 with the index.
109
110 In all other cases the encoder must check whether it has to assign an
111 index value to it by checking against the minimum length required for the
112 next index value to be assigned, and assigning the next index value if the
113 string has minimum length or is longer.
114
115 A typical encoder might want to look up every string (using the major
116 type and the octets) it encodes in some kind of internal map that stores
117 previously-assigned indices. If found, it will emit a stringref. If not,
118 it attempts to assign an index to the string and stores that in its map.
119
120 Decoders might choose not to look up previous indices for a string, for
121 example, because it only wants to compress strings used as map keys, or
122 because it intrinsically knows that the string is only used once. This is
123 fine, but each time the encoder emits a string, it MUST assign an index if
124 the minimum length is reached or exceeded, even if it doesn't store that
125 index for later use.
126
127 Byte strings and text strings share the same index namespace, but are
128 distinct, even if their contents are bit-identical. Indefinite length
129 strings and the chunks they consist of are never assigned an index.
130
131 A typical encoder would emit the stringref-namespace tag once to tag
132 the top-level value in a CBOR stream, but nothing keeps an encoder
133 from nesting stringref-namespace tags. Nesting might be done for
134 convenience, to improve coding efficiency, to embed values or for any
135 other reason. Similarly, an encoder can sort map keys for greater coding
136 efficiency, but is not required to.
137
138 A decoder follows the same process of assigning indices. A possible
139 implementation could be outlined like this:
140
141 When a decoder supports stringref, then, upon encountering a
142 stringref-namespace tag, it should initialise an array and start decoding
143 the tagged value. The array stores, for each index, the major type (byte
144 or text) and the string value. Implementations are free to use their
145 internal format, of course - languages that do not distinguish between
146 text and byte strings can just store their internal string type in the
147 array.
148
149 Since stringref-namespace tags can be nested, the decoder needs to save
150 and restore the outer array before starting and after ending the decoding
151 of the tagged value. In a recursive implementation this could simply look
152 like this:
153
154 global stringref_array; // global per encoder
155 local outer_stringref_array;
156
157 outer_stringref_array = stringref_array;
158 stringref_array = allocate_array ();
159
160 value = decode_cbor_value ();
161
162 deallocate_array (stringref_array);
163 stringref_array = outer_stringref_array;
164
165 return value;
166
167 When a stringref tag is encountered, it MUST contain an unsigned integer
168 (major type 0). The integer MUST be less than the number of entries in the
169 array. The decoder than looks up the string in the array, given the index,
170 and copies it to the result (string values should not be shared unless
171 the language normally does so - if strings are mutable, a copy should be
172 decoded).
173
174 When a definite-length text or byte string is encountered, it is decoded
175 normally. Afterwards, the decoder checks whether it needs to assign an
176 index to the string, by comparing the string length against the minimum
177 length required by the index. If the string is shorter, nothing happens,
178 otherwise the next free index is used and the string (and the information
179 whether it is a byte or text string) is recorded at the end of the array.
180
181 In languages with dynamic arrays, this could be accomplished by using
182 the array length as the next index to be assigned, and pushing the
183 string onto the end of the array when it is long enough.
184
185 =head2 IMPLEMENTATION NOTE
186
187 The semantics of stringref tags require the decoder to be aware and the
188 encoder to be under control of the sequence in which data items are
189 encoded into the CBOR stream. This means these tags cannot be implemented
190 on top of every generic CBOR encoder/decoder (which might reorder entries
191 in a map); they typically need to be integrated into their works.
192
193 =head1 EXAMPLES
194
195 The array-of-maps from the rationale example would normally compress to a
196 CBOR text of 83 bytes. Using this extension where possible, this reduces
197 to 74 bytes:
198
199 d9 0100 # tag(256)
200 83 # array(3)
201 a3 # map(3)
202 44 # bytes(4)
203 72616e6b # "rank"
204 04 # unsigned(4)
205 45 # bytes(5)
206 636f756e74 # "count"
207 19 01a1 # unsigned(417)
208 44 # bytes(4)
209 6e616d65 # "name"
210 48 # bytes(8)
211 436f636b7461696c # "Cocktail"
212 a3 # map(3)
213 d8 19 # tag(25)
214 02 # unsigned(2)
215 44 # bytes(4)
216 42617468 # "Bath"
217 d8 19 # tag(25)
218 01 # unsigned(1)
219 19 0138 # unsigned(312)
220 d8 19 # tag(25)
221 00 # unsigned(0)
222 04 # unsigned(4)
223 a3 # map(3)
224 d8 19 # tag(25)
225 02 # unsigned(2)
226 44 # bytes(4)
227 466f6f64 # "Food"
228 d8 19 # tag(25)
229 01 # unsigned(1)
230 19 02b3 # unsigned(691)
231 d8 19 # tag(25)
232 00 # unsigned(0)
233 04 # unsigned(4)
234
235 The following JSON array illustrates the effect of the index on the
236 minimum string length:
237
238 [ "1", "222", "333", "4", "555", "666", "777", "888", "999",
239 "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
240 "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
241 "333",
242 "ssss",
243 "qqq", "rrr", "ssss"]
244
245 The strings "1", "4" and "rrr" are too short to get an index assigned. All
246 others that are not encoded with a stringref do (this assumes that JSON
247 strings are encoded as CBOR byte strings):
248
249 d9 0100 # tag(256)
250 98 20 # array(32)
251 41 # bytes(1)
252 31 # "1"
253 43 # bytes(3)
254 323232 # "222"
255 43 # bytes(3)
256 333333 # "333"
257 41 # bytes(1)
258 34 # "4"
259 43 # bytes(3)
260 353535 # "555"
261 43 # bytes(3)
262 363636 # "666"
263 43 # bytes(3)
264 373737 # "777"
265 43 # bytes(3)
266 383838 # "888"
267 43 # bytes(3)
268 393939 # "999"
269 43 # bytes(3)
270 616161 # "aaa"
271 43 # bytes(3)
272 626262 # "bbb"
273 43 # bytes(3)
274 636363 # "ccc"
275 43 # bytes(3)
276 646464 # "ddd"
277 43 # bytes(3)
278 656565 # "eee"
279 43 # bytes(3)
280 666666 # "fff"
281 43 # bytes(3)
282 676767 # "ggg"
283 43 # bytes(3)
284 686868 # "hhh"
285 43 # bytes(3)
286 696969 # "iii"
287 43 # bytes(3)
288 6a6a6a # "jjj"
289 43 # bytes(3)
290 6b6b6b # "kkk"
291 43 # bytes(3)
292 6c6c6c # "lll"
293 43 # bytes(3)
294 6d6d6d # "mmm"
295 43 # bytes(3)
296 6e6e6e # "nnn"
297 43 # bytes(3)
298 6f6f6f # "ooo"
299 43 # bytes(3)
300 707070 # "ppp"
301 43 # bytes(3)
302 717171 # "qqq"
303 43 # bytes(3)
304 727272 # "rrr"
305 d8 19 # tag(25)
306 01 # unsigned(1)
307 44 # bytes(4)
308 73737373 # "ssss"
309 d8 19 # tag(25)
310 17 # unsigned(23)
311 43 # bytes(3)
312 727272 # "rrr"
313 d8 19 # tag(25)
314 18 18 # unsigned(24)
315
316 This example shows three stringref-namespace tags, two of which are nested
317 inside another:
318
319 256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
320
321 d9 0100 # tag(256)
322 85 # array(5)
323 63 # text(3)
324 616161 # "aaa"
325 d8 19 # tag(25)
326 00 # unsigned(0)
327 d9 0100 # tag(256)
328 83 # array(3)
329 63 # text(3)
330 626262 # "bbb"
331 63 # text(3)
332 616161 # "aaa"
333 d8 19 # tag(25)
334 01 # unsigned(1)
335 d9 0100 # tag(256)
336 82 # array(2)
337 63 # text(3)
338 636363 # "ccc"
339 d8 19 # tag(25)
340 00 # unsigned(0)
341 d8 19 # tag(25)
342 00 # unsigned(0)
343
344 The decoded data structure might look like this:
345
346 ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
347
348 =head1 IMPLEMENTATIONS
349
350 This section lists known implementations of this extension (L<drop me a
351 mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
352 listed here).
353
354 =over 4
355
356 =item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)
357
358 =back
359