1 |
=head1 REGISTRATION INFORMATION |
2 |
|
3 |
Tag 256 (stringref-namespace) |
4 |
Data Item multiple |
5 |
Semantics mark value as having string references |
6 |
Reference http://cbor.schmorp.de/stringref |
7 |
Contact Marc A. Lehmann <cbor@schmorp.de> |
8 |
|
9 |
Tag 25 (stringref) |
10 |
Data Item unsigned integer |
11 |
Semantics reference the nth previously seen string |
12 |
Reference http://cbor.schmorp.de/stringref |
13 |
Contact Marc A. Lehmann <cbor@schmorp.de> |
14 |
|
15 |
=head1 RATIONALE |
16 |
|
17 |
These two tags can be used to efficiently share constant strings, which |
18 |
are common in many data structures. They are an optimisation only, |
19 |
similar to LZSS compresssion, for the string major types in CBOR. |
20 |
|
21 |
Background: many data structures or protocols contain repeated string |
22 |
constants (often called atoms or quarks). For example, this is a |
23 |
real-world (JSON-formatted) data structure that contains many repeated |
24 |
strings used as map keys: |
25 |
|
26 |
[ |
27 |
{ |
28 |
"name" : "Cocktail", |
29 |
"count" : 417, |
30 |
"rank" : 4 |
31 |
}, |
32 |
{ |
33 |
"rank" : 4, |
34 |
"count" : 312, |
35 |
"name" : "Bath" |
36 |
}, |
37 |
{ |
38 |
"count" : 691, |
39 |
"name" : "Food", |
40 |
"rank" : 4 |
41 |
} |
42 |
] |
43 |
|
44 |
This can be encoded nicely with CBOR, but each occurrence of "name", |
45 |
"count" and "rank" will be encoded separately. In highly repetitive data |
46 |
structures (the above is taken from a game save file) these can take up |
47 |
considerable space. |
48 |
|
49 |
This scheme can be used to reduce this overhead with a simple scheme that |
50 |
is easily implementable. |
51 |
|
52 |
=head1 DESCRIPTION |
53 |
|
54 |
Stringref consists of two tags, stringref-namespace (value C<256>), |
55 |
which marks a value as containing string references, and stringref (value |
56 |
C<25>), which references a string previously encoded in the value. |
57 |
|
58 |
The stringref-namespace tag is used to define a namespace for the string |
59 |
reference ids. stringref tags are only valid inside CBOR values marked |
60 |
with stringref-namespace. |
61 |
|
62 |
The purpose of stringref-namespace is three-fold: |
63 |
|
64 |
=over 4 |
65 |
|
66 |
=item 1. It signals the decoder that it needs to keep a copy of every |
67 |
string that is being decoded while decoding the tagged value (and |
68 |
conversely, that there is no need for this extra overhead without this |
69 |
tag). |
70 |
|
71 |
=item 2. It can be used to confine string references to certain subtrees, |
72 |
which might increase encoding efficiency by being able to use smaller |
73 |
references. |
74 |
|
75 |
=item 3. CBOR objects encoded with this scheme can be blindly embedded in |
76 |
existing CBOR streams, as the indices are relative to the outer namespace |
77 |
tag, and don't change when embedded. |
78 |
|
79 |
=back |
80 |
|
81 |
Within a value tagged with stringref-namespace, every string that is |
82 |
encoded with a definite length and has a minimum length is assigned an |
83 |
implicit index, starting from zero. |
84 |
|
85 |
The minimum length required to get an index depends on the index value |
86 |
that would be assigned to it, as stringref references should be shorter |
87 |
than the string they reference, and there is no point in wasting indices |
88 |
on strings that should not be referenced anyway. |
89 |
|
90 |
The lengths are as follows: |
91 |
|
92 |
string index minimum string length in octets |
93 |
0 .. 23 3 |
94 |
24 .. 255 4 |
95 |
256 .. 65535 5 |
96 |
65536 .. 4294967295 7 |
97 |
4294967296 .. 11 |
98 |
|
99 |
The minimum string length is simply the length of the stringref tag (2 |
100 |
octets), plus the minimum size the resulting index takes up, using major |
101 |
type 0 encoding. |
102 |
|
103 |
Each time a string is to be encoded by the encoder, it can choose to |
104 |
encode it as a normal string, or attempt to encode it as a stringref. |
105 |
|
106 |
If it attempts to encode a stringref, it needs to check whether the string |
107 |
has been assigned an index already. If yes, it can emit a stringref tag |
108 |
with the index. |
109 |
|
110 |
In all other cases the encoder must check whether it has to assign an |
111 |
index value to it by checking against the minimum length required for the |
112 |
next index value to be assigned, and assigning the next index value if the |
113 |
string has minimum length or is longer. |
114 |
|
115 |
A typical encoder might want to look up every string (using the major |
116 |
type and the octets) it encodes in some kind of internal map that stores |
117 |
previously-assigned indices. If found, it will emit a stringref. If not, |
118 |
it attempts to assign an index to the string and stores that in its map. |
119 |
|
120 |
Decoders might choose not to look up previous indices for a string, for |
121 |
example, because it only wants to compress strings used as map keys, or |
122 |
because it intrinsically knows that the string is only used once. This is |
123 |
fine, but each time the encoder emits a string, it MUST assign an index if |
124 |
the minimum length is reached or exceeded, even if it doesn't store that |
125 |
index for later use. |
126 |
|
127 |
Byte strings and text strings share the same index namespace, but are |
128 |
distinct, even if their contents are bit-identical. Indefinite length |
129 |
strings and the chunks they consist of are never assigned an index. |
130 |
|
131 |
A typical encoder would emit the stringref-namespace tag once to tag |
132 |
the top-level value in a CBOR stream, but nothing keeps an encoder |
133 |
from nesting stringref-namespace tags. Nesting might be done for |
134 |
convenience, to improve coding efficiency, to embed values or for any |
135 |
other reason. Similarly, an encoder can sort map keys for greater coding |
136 |
efficiency, but is not required to. |
137 |
|
138 |
A decoder follows the same process of assigning indices. A possible |
139 |
implementation could be outlined like this: |
140 |
|
141 |
When a decoder supports stringref, then, upon encountering a |
142 |
stringref-namespace tag, it should initialise an array and start decoding |
143 |
the tagged value. The array stores, for each index, the major type (byte |
144 |
or text) and the string value. Implementations are free to use their |
145 |
internal format, of course - languages that do not distinguish between |
146 |
text and byte strings can just store their internal string type in the |
147 |
array. |
148 |
|
149 |
Since stringref-namespace tags can be nested, the decoder needs to save |
150 |
and restore the outer array before starting and after ending the decoding |
151 |
of the tagged value. In a recursive implementation this could simply look |
152 |
like this: |
153 |
|
154 |
global stringref_array; // global per encoder |
155 |
local outer_stringref_array; |
156 |
|
157 |
outer_stringref_array = stringref_array; |
158 |
stringref_array = allocate_array (); |
159 |
|
160 |
value = decode_cbor_value (); |
161 |
|
162 |
deallocate_array (stringref_array); |
163 |
stringref_array = outer_stringref_array; |
164 |
|
165 |
return value; |
166 |
|
167 |
When a stringref tag is encountered, it MUST contain an unsigned integer |
168 |
(major type 0). The integer MUST be less than the number of entries in the |
169 |
array. The decoder than looks up the string in the array, given the index, |
170 |
and copies it to the result (string values should not be shared unless |
171 |
the language normally does so - if strings are mutable, a copy should be |
172 |
decoded). |
173 |
|
174 |
When a definite-length text or byte string is encountered, it is decoded |
175 |
normally. Afterwards, the decoder checks whether it needs to assign an |
176 |
index to the string, by comparing the string length against the minimum |
177 |
length required by the index. If the string is shorter, nothing happens, |
178 |
otherwise the next free index is used and the string (and the information |
179 |
whether it is a byte or text string) is recorded at the end of the array. |
180 |
|
181 |
In languages with dynamic arrays, this could be accomplished by using |
182 |
the array length as the next index to be assigned, and pushing the |
183 |
string onto the end of the array when it is long enough. |
184 |
|
185 |
=head2 IMPLEMENTATION NOTE |
186 |
|
187 |
The semantics of stringref tags require the decoder to be aware and the |
188 |
encoder to be under control of the sequence in which data items are |
189 |
encoded into the CBOR stream. This means these tags cannot be implemented |
190 |
on top of every generic CBOR encoder/decoder (which might reorder entries |
191 |
in a map); they typically need to be integrated into their works. |
192 |
|
193 |
=head1 EXAMPLES |
194 |
|
195 |
The array-of-maps from the rationale example would normally compress to a |
196 |
CBOR text of 83 bytes. Using this extension where possible, this reduces |
197 |
to 74 bytes: |
198 |
|
199 |
d9 0100 # tag(256) |
200 |
83 # array(3) |
201 |
a3 # map(3) |
202 |
44 # bytes(4) |
203 |
72616e6b # "rank" |
204 |
04 # unsigned(4) |
205 |
45 # bytes(5) |
206 |
636f756e74 # "count" |
207 |
19 01a1 # unsigned(417) |
208 |
44 # bytes(4) |
209 |
6e616d65 # "name" |
210 |
48 # bytes(8) |
211 |
436f636b7461696c # "Cocktail" |
212 |
a3 # map(3) |
213 |
d8 19 # tag(25) |
214 |
02 # unsigned(2) |
215 |
44 # bytes(4) |
216 |
42617468 # "Bath" |
217 |
d8 19 # tag(25) |
218 |
01 # unsigned(1) |
219 |
19 0138 # unsigned(312) |
220 |
d8 19 # tag(25) |
221 |
00 # unsigned(0) |
222 |
04 # unsigned(4) |
223 |
a3 # map(3) |
224 |
d8 19 # tag(25) |
225 |
02 # unsigned(2) |
226 |
44 # bytes(4) |
227 |
466f6f64 # "Food" |
228 |
d8 19 # tag(25) |
229 |
01 # unsigned(1) |
230 |
19 02b3 # unsigned(691) |
231 |
d8 19 # tag(25) |
232 |
00 # unsigned(0) |
233 |
04 # unsigned(4) |
234 |
|
235 |
The following JSON array illustrates the effect of the index on the |
236 |
minimum string length: |
237 |
|
238 |
[ "1", "222", "333", "4", "555", "666", "777", "888", "999", |
239 |
"aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii", |
240 |
"jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr", |
241 |
"333", |
242 |
"ssss", |
243 |
"qqq", "rrr", "ssss"] |
244 |
|
245 |
The strings "1", "4" and "rrr" are too short to get an index assigned. All others that are |
246 |
not encoded with a stringref do: |
247 |
|
248 |
d9 0100 # tag(256) |
249 |
98 20 # array(32) |
250 |
41 # bytes(1) |
251 |
31 # "1" |
252 |
43 # bytes(3) |
253 |
323232 # "222" |
254 |
43 # bytes(3) |
255 |
333333 # "333" |
256 |
41 # bytes(1) |
257 |
34 # "4" |
258 |
43 # bytes(3) |
259 |
353535 # "555" |
260 |
43 # bytes(3) |
261 |
363636 # "666" |
262 |
43 # bytes(3) |
263 |
373737 # "777" |
264 |
43 # bytes(3) |
265 |
383838 # "888" |
266 |
43 # bytes(3) |
267 |
393939 # "999" |
268 |
43 # bytes(3) |
269 |
616161 # "aaa" |
270 |
43 # bytes(3) |
271 |
626262 # "bbb" |
272 |
43 # bytes(3) |
273 |
636363 # "ccc" |
274 |
43 # bytes(3) |
275 |
646464 # "ddd" |
276 |
43 # bytes(3) |
277 |
656565 # "eee" |
278 |
43 # bytes(3) |
279 |
666666 # "fff" |
280 |
43 # bytes(3) |
281 |
676767 # "ggg" |
282 |
43 # bytes(3) |
283 |
686868 # "hhh" |
284 |
43 # bytes(3) |
285 |
696969 # "iii" |
286 |
43 # bytes(3) |
287 |
6a6a6a # "jjj" |
288 |
43 # bytes(3) |
289 |
6b6b6b # "kkk" |
290 |
43 # bytes(3) |
291 |
6c6c6c # "lll" |
292 |
43 # bytes(3) |
293 |
6d6d6d # "mmm" |
294 |
43 # bytes(3) |
295 |
6e6e6e # "nnn" |
296 |
43 # bytes(3) |
297 |
6f6f6f # "ooo" |
298 |
43 # bytes(3) |
299 |
707070 # "ppp" |
300 |
43 # bytes(3) |
301 |
717171 # "qqq" |
302 |
43 # bytes(3) |
303 |
727272 # "rrr" |
304 |
44 # bytes(4) |
305 |
73737373 # "ssss" |
306 |
d8 19 # tag(25) |
307 |
01 # unsigned(1) |
308 |
d8 19 # tag(25) |
309 |
17 # unsigned(23) |
310 |
43 # bytes(3) |
311 |
727272 # "rrr" |
312 |
d8 19 # tag(25) |
313 |
18 18 # unsigned(24) |
314 |
|
315 |
This example shows three stringref-namespace tags, two of which are nested |
316 |
inside another: |
317 |
|
318 |
256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)]) |
319 |
|
320 |
d9 0100 # tag(256) |
321 |
85 # array(5) |
322 |
63 # text(3) |
323 |
616161 # "aaa" |
324 |
d8 19 # tag(25) |
325 |
00 # unsigned(0) |
326 |
d9 0100 # tag(256) |
327 |
83 # array(3) |
328 |
63 # text(3) |
329 |
626262 # "bbb" |
330 |
63 # text(3) |
331 |
616161 # "aaa" |
332 |
d8 19 # tag(25) |
333 |
01 # unsigned(1) |
334 |
d9 0100 # tag(256) |
335 |
82 # array(2) |
336 |
63 # text(3) |
337 |
636363 # "ccc" |
338 |
d8 19 # tag(25) |
339 |
00 # unsigned(0) |
340 |
d8 19 # tag(25) |
341 |
00 # unsigned(0) |
342 |
|
343 |
The decoded data structure might look like this: |
344 |
|
345 |
["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"] |
346 |
|
347 |
|