1 | =head1 REGISTRATION INFORMATION |
1 | =head1 REGISTRATION INFORMATION |
2 | |
2 | |
3 | Tag <unassigned> (stringref-namespace) |
3 | Tag 256 (stringref-namespace) |
4 | Data Item multiple |
4 | Data Item multiple |
5 | Semantics mark value as having string references |
5 | Semantics mark value as having string references |
6 | Reference http://cbor.schmorp.de/stringref |
6 | Reference http://cbor.schmorp.de/stringref |
7 | Contact Marc A. Lehmann <cbor@schmorp.de> |
7 | Contact Marc A. Lehmann <cbor@schmorp.de> |
8 | |
8 | |
9 | Tag <unassigned> (stringref) |
9 | Tag 25 (stringref) |
10 | Data Item unsigned integer |
10 | Data Item unsigned integer |
11 | Semantics reference the nth previously seen string |
11 | Semantics reference the nth previously seen string |
12 | Reference http://cbor.schmorp.de/stringref |
12 | Reference http://cbor.schmorp.de/stringref |
13 | Contact Marc A. Lehmann <cbor@schmorp.de> |
13 | Contact Marc A. Lehmann <cbor@schmorp.de> |
14 | |
14 | |
… | |
… | |
49 | This scheme can be used to reduce this overhead with a simple scheme that |
49 | This scheme can be used to reduce this overhead with a simple scheme that |
50 | is easily implementable. |
50 | is easily implementable. |
51 | |
51 | |
52 | =head1 DESCRIPTION |
52 | =head1 DESCRIPTION |
53 | |
53 | |
54 | Stringref consists of two tags, stringref-namespace (value <unassigned>), |
54 | Stringref consists of two tags, stringref-namespace (value C<256>), |
55 | which marks a value as containing string references, and stringref (value |
55 | which marks a value as containing string references, and stringref (value |
56 | <unassigned>), which references a string previously encoded in the value. |
56 | C<25>), which references a string previously encoded in the value. |
57 | |
57 | |
58 | The stringref-namespace tag is used to define a namespace for the string |
58 | The stringref-namespace tag is used to define a namespace for the string |
59 | reference ids. stringref tags are only valid inside CBOR values marked |
59 | reference ids. stringref tags are only valid inside CBOR values marked |
60 | with stringref-namespace. |
60 | with stringref-namespace. |
61 | |
61 | |
… | |
… | |
182 | the array length as the next index to be assigned, and pushing the |
182 | the array length as the next index to be assigned, and pushing the |
183 | string onto the end of the array when it is long enough. |
183 | string onto the end of the array when it is long enough. |
184 | |
184 | |
185 | =head2 IMPLEMENTATION NOTE |
185 | =head2 IMPLEMENTATION NOTE |
186 | |
186 | |
187 | The semantics of stringref tags require the decoder to be aware and |
187 | The semantics of stringref tags require the decoder to be aware and the |
188 | the encoder to be under control of the sequence in which data items |
188 | encoder to be under control of the sequence in which data items are |
189 | are encoded into the CBOR stream. This means these tags cannot be |
189 | encoded into the CBOR stream. This means these tags cannot be implemented |
190 | implemented on top of every generic CBOR encoder/decoder (which might |
190 | on top of every generic CBOR encoder/decoder (which might reorder entries |
191 | reorder entries in a map); they need to be integrated into their works. |
191 | in a map); they typically need to be integrated into their works. |
|
|
192 | |
|
|
193 | =head2 DESIGN RATIONALE |
|
|
194 | |
|
|
195 | The stringref tag was chosen to be short, without requiring standards |
|
|
196 | action. The namespace tag is rare, so doesn't benefit from a short |
|
|
197 | encoding as much. |
|
|
198 | |
|
|
199 | Implicit tagging/counting was chosen to support stream encoders. Having |
|
|
200 | to tag strings first requires either multiple passes over the data (which |
|
|
201 | might not be available, ruling out some encoders) or tagging more strings |
|
|
202 | than needed (wasting space). Explicit tagging also isn't necessarily |
|
|
203 | better even under optimal conditions, as the explicit tags waste space. |
|
|
204 | |
|
|
205 | Stream decoders are affected less by implicit tagging than encoders. |
|
|
206 | |
|
|
207 | The namespace tag was introduced for two reasons: first to allow embedding |
|
|
208 | of CBOR strings into other CBOR strings, secondly for decoding efficiency |
|
|
209 | - the decoder only has to expect stringref tags inside namespaces and |
|
|
210 | therefore doesn't have to maintain extra state outside of them. |
192 | |
211 | |
193 | =head1 EXAMPLES |
212 | =head1 EXAMPLES |
194 | |
213 | |
195 | <TBD> |
214 | The array-of-maps from the rationale example would normally compress to a |
|
|
215 | CBOR text of 83 bytes. Using this extension where possible, this reduces |
|
|
216 | to 74 bytes: |
196 | |
217 | |
|
|
218 | d9 0100 # tag(256) |
|
|
219 | 83 # array(3) |
|
|
220 | a3 # map(3) |
|
|
221 | 44 # bytes(4) |
|
|
222 | 72616e6b # "rank" |
|
|
223 | 04 # unsigned(4) |
|
|
224 | 45 # bytes(5) |
|
|
225 | 636f756e74 # "count" |
|
|
226 | 19 01a1 # unsigned(417) |
|
|
227 | 44 # bytes(4) |
|
|
228 | 6e616d65 # "name" |
|
|
229 | 48 # bytes(8) |
|
|
230 | 436f636b7461696c # "Cocktail" |
|
|
231 | a3 # map(3) |
|
|
232 | d8 19 # tag(25) |
|
|
233 | 02 # unsigned(2) |
|
|
234 | 44 # bytes(4) |
|
|
235 | 42617468 # "Bath" |
|
|
236 | d8 19 # tag(25) |
|
|
237 | 01 # unsigned(1) |
|
|
238 | 19 0138 # unsigned(312) |
|
|
239 | d8 19 # tag(25) |
|
|
240 | 00 # unsigned(0) |
|
|
241 | 04 # unsigned(4) |
|
|
242 | a3 # map(3) |
|
|
243 | d8 19 # tag(25) |
|
|
244 | 02 # unsigned(2) |
|
|
245 | 44 # bytes(4) |
|
|
246 | 466f6f64 # "Food" |
|
|
247 | d8 19 # tag(25) |
|
|
248 | 01 # unsigned(1) |
|
|
249 | 19 02b3 # unsigned(691) |
|
|
250 | d8 19 # tag(25) |
|
|
251 | 00 # unsigned(0) |
|
|
252 | 04 # unsigned(4) |
|
|
253 | |
|
|
254 | The following JSON array illustrates the effect of the index on the |
|
|
255 | minimum string length: |
|
|
256 | |
|
|
257 | [ "1", "222", "333", "4", "555", "666", "777", "888", "999", |
|
|
258 | "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", |
|
|
259 | "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr", |
|
|
260 | "333", |
|
|
261 | "ssss", |
|
|
262 | "qqq", "rrr", "ssss"] |
|
|
263 | |
|
|
264 | The strings "1", "4" and "rrr" are too short to get an index assigned. All |
|
|
265 | others that are not encoded with a stringref do (this assumes that JSON |
|
|
266 | strings are encoded as CBOR byte strings): |
|
|
267 | |
|
|
268 | d9 0100 # tag(256) |
|
|
269 | 98 20 # array(32) |
|
|
270 | 41 # bytes(1) |
|
|
271 | 31 # "1" |
|
|
272 | 43 # bytes(3) |
|
|
273 | 323232 # "222" |
|
|
274 | 43 # bytes(3) |
|
|
275 | 333333 # "333" |
|
|
276 | 41 # bytes(1) |
|
|
277 | 34 # "4" |
|
|
278 | 43 # bytes(3) |
|
|
279 | 353535 # "555" |
|
|
280 | 43 # bytes(3) |
|
|
281 | 363636 # "666" |
|
|
282 | 43 # bytes(3) |
|
|
283 | 373737 # "777" |
|
|
284 | 43 # bytes(3) |
|
|
285 | 383838 # "888" |
|
|
286 | 43 # bytes(3) |
|
|
287 | 393939 # "999" |
|
|
288 | 43 # bytes(3) |
|
|
289 | 616161 # "aaa" |
|
|
290 | 43 # bytes(3) |
|
|
291 | 626262 # "bbb" |
|
|
292 | 43 # bytes(3) |
|
|
293 | 636363 # "ccc" |
|
|
294 | 43 # bytes(3) |
|
|
295 | 646464 # "ddd" |
|
|
296 | 43 # bytes(3) |
|
|
297 | 656565 # "eee" |
|
|
298 | 43 # bytes(3) |
|
|
299 | 666666 # "fff" |
|
|
300 | 43 # bytes(3) |
|
|
301 | 676767 # "ggg" |
|
|
302 | 43 # bytes(3) |
|
|
303 | 686868 # "hhh" |
|
|
304 | 43 # bytes(3) |
|
|
305 | 696969 # "iii" |
|
|
306 | 43 # bytes(3) |
|
|
307 | 6a6a6a # "jjj" |
|
|
308 | 43 # bytes(3) |
|
|
309 | 6b6b6b # "kkk" |
|
|
310 | 43 # bytes(3) |
|
|
311 | 6c6c6c # "lll" |
|
|
312 | 43 # bytes(3) |
|
|
313 | 6d6d6d # "mmm" |
|
|
314 | 43 # bytes(3) |
|
|
315 | 6e6e6e # "nnn" |
|
|
316 | 43 # bytes(3) |
|
|
317 | 6f6f6f # "ooo" |
|
|
318 | 43 # bytes(3) |
|
|
319 | 707070 # "ppp" |
|
|
320 | 43 # bytes(3) |
|
|
321 | 717171 # "qqq" |
|
|
322 | 43 # bytes(3) |
|
|
323 | 727272 # "rrr" |
|
|
324 | d8 19 # tag(25) |
|
|
325 | 01 # unsigned(1) |
|
|
326 | 44 # bytes(4) |
|
|
327 | 73737373 # "ssss" |
|
|
328 | d8 19 # tag(25) |
|
|
329 | 17 # unsigned(23) |
|
|
330 | 43 # bytes(3) |
|
|
331 | 727272 # "rrr" |
|
|
332 | d8 19 # tag(25) |
|
|
333 | 18 18 # unsigned(24) |
|
|
334 | |
|
|
335 | This example shows three stringref-namespace tags, two of which are nested |
|
|
336 | inside another: |
|
|
337 | |
|
|
338 | 256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)]) |
|
|
339 | |
|
|
340 | d9 0100 # tag(256) |
|
|
341 | 85 # array(5) |
|
|
342 | 63 # text(3) |
|
|
343 | 616161 # "aaa" |
|
|
344 | d8 19 # tag(25) |
|
|
345 | 00 # unsigned(0) |
|
|
346 | d9 0100 # tag(256) |
|
|
347 | 83 # array(3) |
|
|
348 | 63 # text(3) |
|
|
349 | 626262 # "bbb" |
|
|
350 | 63 # text(3) |
|
|
351 | 616161 # "aaa" |
|
|
352 | d8 19 # tag(25) |
|
|
353 | 01 # unsigned(1) |
|
|
354 | d9 0100 # tag(256) |
|
|
355 | 82 # array(2) |
|
|
356 | 63 # text(3) |
|
|
357 | 636363 # "ccc" |
|
|
358 | d8 19 # tag(25) |
|
|
359 | 00 # unsigned(0) |
|
|
360 | d8 19 # tag(25) |
|
|
361 | 00 # unsigned(0) |
|
|
362 | |
|
|
363 | The decoded data structure might look like this: |
|
|
364 | |
|
|
365 | ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"] |
|
|
366 | |
|
|
367 | =head1 IMPLEMENTATIONS |
|
|
368 | |
|
|
369 | This section lists known implementations of this extension (L<drop me a |
|
|
370 | mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be |
|
|
371 | listed here). |
|
|
372 | |
|
|
373 | =over 4 |
|
|
374 | |
|
|
375 | =item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation) |
|
|
376 | |
|
|
377 | =back |
|
|
378 | |