CBOR-XS/doc/stringref.pod

=head1 REGISTRATION INFORMATION

   Tag             256 (stringref-namespace)
   Data Item       multiple
   Semantics       mark value as having string references
   Reference       http://cbor.schmorp.de/stringref
   Contact         Marc A. Lehmann <cbor@schmorp.de>

   Tag             25 (stringref)
   Data Item       unsigned integer
   Semantics       reference the nth previously seen string
   Reference       http://cbor.schmorp.de/stringref
   Contact         Marc A. Lehmann <cbor@schmorp.de>

=head1 RATIONALE

These two tags can be used to efficiently share constant strings, which
are common in many data structures. They are an optimisation only,
similar to LZSS compresssion, for the string major types in CBOR.

Background: many data structures or protocols contain repeated string
constants (often called atoms or quarks). For example, this is a
real-world (JSON-formatted) data structure that contains many repeated
strings used as map keys:

  [
     {
       "name" : "Cocktail",
       "count" : 417,
       "rank" : 4
     },
     {
       "rank" : 4,
       "count" : 312,
       "name" : "Bath"
     },
     {
       "count" : 691,
       "name" : "Food",
       "rank" : 4
     }
  ]

This can be encoded nicely with CBOR, but each occurrence of "name",
"count" and "rank" will be encoded separately.  In highly repetitive data
structures (the above is taken from a game save file) these can take up
considerable space.

This scheme can be used to reduce this overhead with a simple scheme that
is easily implementable.

=head1 DESCRIPTION

Stringref consists of two tags, stringref-namespace (value C<256>),
which marks a value as containing string references, and stringref (value
C<25>), which references a string previously encoded in the value.

The stringref-namespace tag is used to define a namespace for the string
reference ids. stringref tags are only valid inside CBOR values marked
with stringref-namespace.

The purpose of stringref-namespace is three-fold:

=over 4

=item 1. It signals the decoder that it needs to keep a copy of every
string that is being decoded while decoding the tagged value (and
conversely, that there is no need for this extra overhead without this
tag).

=item 2. It can be used to confine string references to certain subtrees,
which might increase encoding efficiency by being able to use smaller
references.

=item 3. CBOR objects encoded with this scheme can be blindly embedded in
existing CBOR streams, as the indices are relative to the outer namespace
tag, and don't change when embedded.

=back

Within a value tagged with stringref-namespace, every string that is
encoded with a definite length and has a minimum length is assigned an
implicit index, starting from zero.

The minimum length required to get an index depends on the index value
that would be assigned to it, as stringref references should be shorter
than the string they reference, and there is no point in wasting indices
on strings that should not be referenced anyway.

The lengths are as follows:

   string index              minimum string length in octets
   0 .. 23                   3
   24 .. 255                 4
   256 .. 65535              5
   65536 .. 4294967295       7
   4294967296 ..             11

The minimum string length is simply the length of the stringref tag (2
octets), plus the minimum size the resulting index takes up, using major
type 0 encoding.

Each time a string is to be encoded by the encoder, it can choose to
encode it as a normal string, or attempt to encode it as a stringref.

If it attempts to encode a stringref, it needs to check whether the string
has been assigned an index already. If yes, it can emit a stringref tag
with the index.

In all other cases the encoder must check whether it has to assign an
index value to it by checking against the minimum length required for the
next index value to be assigned, and assigning the next index value if the
string has minimum length or is longer.

A typical encoder might want to look up every string (using the major
type and the octets) it encodes in some kind of internal map that stores
previously-assigned indices. If found, it will emit a stringref. If not,
it attempts to assign an index to the string and stores that in its map.

Decoders might choose not to look up previous indices for a string, for
example, because it only wants to compress strings used as map keys, or
because it intrinsically knows that the string is only used once. This is
fine, but each time the encoder emits a string, it MUST assign an index if
the minimum length is reached or exceeded, even if it doesn't store that
index for later use.

Byte strings and text strings share the same index namespace, but are
distinct, even if their contents are bit-identical. Indefinite length
strings and the chunks they consist of are never assigned an index.

A typical encoder would emit the stringref-namespace tag once to tag
the top-level value in a CBOR stream, but nothing keeps an encoder
from nesting stringref-namespace tags. Nesting might be done for
convenience, to improve coding efficiency, to embed values or for any
other reason. Similarly, an encoder can sort map keys for greater coding
efficiency, but is not required to.

A decoder follows the same process of assigning indices. A possible
implementation could be outlined like this:

When a decoder supports stringref, then, upon encountering a
stringref-namespace tag, it should initialise an array and start decoding
the tagged value. The array stores, for each index, the major type (byte
or text) and the string value. Implementations are free to use their
internal format, of course - languages that do not distinguish between
text and byte strings can just store their internal string type in the
array.

Since stringref-namespace tags can be nested, the decoder needs to save
and restore the outer array before starting and after ending the decoding
of the tagged value. In a recursive implementation this could simply look
like this:

   global stringref_array; // global per encoder
   local outer_stringref_array;

   outer_stringref_array = stringref_array;
   stringref_array = allocate_array ();

   value = decode_cbor_value ();

   deallocate_array (stringref_array);
   stringref_array = outer_stringref_array;

   return value;

When a stringref tag is encountered, it MUST contain an unsigned integer
(major type 0). The integer MUST be less than the number of entries in the
array. The decoder than looks up the string in the array, given the index,
and copies it to the result (string values should not be shared unless
the language normally does so - if strings are mutable, a copy should be
decoded).

When a definite-length text or byte string is encountered, it is decoded
normally. Afterwards, the decoder checks whether it needs to assign an
index to the string, by comparing the string length against the minimum
length required by the index. If the string is shorter, nothing happens,
otherwise the next free index is used and the string (and the information
whether it is a byte or text string) is recorded at the end of the array.

In languages with dynamic arrays, this could be accomplished by using
the array length as the next index to be assigned, and pushing the
string onto the end of the array when it is long enough.

=head2 IMPLEMENTATION NOTE

The semantics of stringref tags require the decoder to be aware and the
encoder to be under control of the sequence in which data items are
encoded into the CBOR stream. This means these tags cannot be implemented
on top of every generic CBOR encoder/decoder (which might reorder entries
in a map); they typically need to be integrated into their works.

=head1 EXAMPLES

The array-of-maps from the rationale example would normally compress to a
CBOR text of 83 bytes. Using this extension where possible, this reduces
to 74 bytes:

   d9 0100                      # tag(256)
      83                        # array(3)
         a3                     # map(3)
            44                  # bytes(4)
               72616e6b         # "rank"
            04                  # unsigned(4)
            45                  # bytes(5)
               636f756e74       # "count"
            19 01a1             # unsigned(417)
            44                  # bytes(4)
               6e616d65         # "name"
            48                  # bytes(8)
               436f636b7461696c # "Cocktail"
         a3                     # map(3)
            d8 19               # tag(25)
               02               # unsigned(2)
            44                  # bytes(4)
               42617468         # "Bath"
            d8 19               # tag(25)
               01               # unsigned(1)
            19 0138             # unsigned(312)
            d8 19               # tag(25)
               00               # unsigned(0)
            04                  # unsigned(4)
         a3                     # map(3)
            d8 19               # tag(25)
               02               # unsigned(2)
            44                  # bytes(4)
               466f6f64         # "Food"
            d8 19               # tag(25)
               01               # unsigned(1)
            19 02b3             # unsigned(691)
            d8 19               # tag(25)
               00               # unsigned(0)
            04                  # unsigned(4)

The following JSON array illustrates the effect of the index on the
minimum string length:

   [  "1", "222", "333",   "4", "555", "666", "777", "888", "999",
    "aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
    "jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
    "333",
    "ssss",
    "qqq", "rrr", "ssss"]

The strings "1", "4" and "rrr" are too short to get an index assigned. All
others that are not encoded with a stringref do (this assumes that JSON
strings are encoded as CBOR byte strings):

   d9 0100           # tag(256)
      98 20          # array(32)
         41          # bytes(1)
            31       # "1"
         43          # bytes(3)
            323232   # "222"
         43          # bytes(3)
            333333   # "333"
         41          # bytes(1)
            34       # "4"
         43          # bytes(3)
            353535   # "555"
         43          # bytes(3)
            363636   # "666"
         43          # bytes(3)
            373737   # "777"
         43          # bytes(3)
            383838   # "888"
         43          # bytes(3)
            393939   # "999"
         43          # bytes(3)
            616161   # "aaa"
         43          # bytes(3)
            626262   # "bbb"
         43          # bytes(3)
            636363   # "ccc"
         43          # bytes(3)
            646464   # "ddd"
         43          # bytes(3)
            656565   # "eee"
         43          # bytes(3)
            666666   # "fff"
         43          # bytes(3)
            676767   # "ggg"
         43          # bytes(3)
            686868   # "hhh"
         43          # bytes(3)
            696969   # "iii"
         43          # bytes(3)
            6a6a6a   # "jjj"
         43          # bytes(3)
            6b6b6b   # "kkk"
         43          # bytes(3)
            6c6c6c   # "lll"
         43          # bytes(3)
            6d6d6d   # "mmm"
         43          # bytes(3)
            6e6e6e   # "nnn"
         43          # bytes(3)
            6f6f6f   # "ooo"
         43          # bytes(3)
            707070   # "ppp"
         43          # bytes(3)
            717171   # "qqq"
         43          # bytes(3)
            727272   # "rrr"
         d8 19       # tag(25)
            01       # unsigned(1)
         44          # bytes(4)
            73737373 # "ssss"
         d8 19       # tag(25)
            17       # unsigned(23)
         43          # bytes(3)
            727272   # "rrr"
         d8 19       # tag(25)
            18 18    # unsigned(24)

This example shows three stringref-namespace tags, two of which are nested
inside another:

   256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])

   d9 0100               # tag(256)
      85                 # array(5)
         63              # text(3)
            616161       # "aaa"
         d8 19           # tag(25)
            00           # unsigned(0)
         d9 0100         # tag(256)
            83           # array(3)
               63        # text(3)
                  626262 # "bbb"
               63        # text(3)
                  616161 # "aaa"
               d8 19     # tag(25)
                  01     # unsigned(1)
         d9 0100         # tag(256)
            82           # array(2)
               63        # text(3)
                  636363 # "ccc"
               d8 19     # tag(25)
                  00     # unsigned(0)
         d8 19           # tag(25)
            00           # unsigned(0)

The decoded data structure might look like this:

  ["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]

=head1 IMPLEMENTATIONS

This section lists known implementations of this extension (L<drop me a
mail|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
listed here).

=over 4

=item * [Perl] L<CBOR::XS|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)

=back

Revision:	1.5
Committed:	Wed Apr 25 06:37:12 2018 UTC (6 years, 1 month ago) by root
Branch:	MAIN
Changes since 1.4:	+5 -4 lines
Log Message:	* empty log message *
#	Content
1	=head1 REGISTRATION INFORMATION
2
3	Tag 256 (stringref-namespace)
4	Data Item multiple
5	Semantics mark value as having string references
6	Reference http://cbor.schmorp.de/stringref
7	Contact Marc A. Lehmann <cbor@schmorp.de>
8
9	Tag 25 (stringref)
10	Data Item unsigned integer
11	Semantics reference the nth previously seen string
12	Reference http://cbor.schmorp.de/stringref
13	Contact Marc A. Lehmann <cbor@schmorp.de>
14
15	=head1 RATIONALE
16
17	These two tags can be used to efficiently share constant strings, which
18	are common in many data structures. They are an optimisation only,
19	similar to LZSS compresssion, for the string major types in CBOR.
20
21	Background: many data structures or protocols contain repeated string
22	constants (often called atoms or quarks). For example, this is a
23	real-world (JSON-formatted) data structure that contains many repeated
24	strings used as map keys:
25
26	[
27	{
28	"name" : "Cocktail",
29	"count" : 417,
30	"rank" : 4
31	},
32	{
33	"rank" : 4,
34	"count" : 312,
35	"name" : "Bath"
36	},
37	{
38	"count" : 691,
39	"name" : "Food",
40	"rank" : 4
41	}
42	]
43
44	This can be encoded nicely with CBOR, but each occurrence of "name",
45	"count" and "rank" will be encoded separately. In highly repetitive data
46	structures (the above is taken from a game save file) these can take up
47	considerable space.
48
49	This scheme can be used to reduce this overhead with a simple scheme that
50	is easily implementable.
51
52	=head1 DESCRIPTION
53
54	Stringref consists of two tags, stringref-namespace (value C<256>),
55	which marks a value as containing string references, and stringref (value
56	C<25>), which references a string previously encoded in the value.
57
58	The stringref-namespace tag is used to define a namespace for the string
59	reference ids. stringref tags are only valid inside CBOR values marked
60	with stringref-namespace.
61
62	The purpose of stringref-namespace is three-fold:
63
64	=over 4
65
66	=item 1. It signals the decoder that it needs to keep a copy of every
67	string that is being decoded while decoding the tagged value (and
68	conversely, that there is no need for this extra overhead without this
69	tag).
70
71	=item 2. It can be used to confine string references to certain subtrees,
72	which might increase encoding efficiency by being able to use smaller
73	references.
74
75	=item 3. CBOR objects encoded with this scheme can be blindly embedded in
76	existing CBOR streams, as the indices are relative to the outer namespace
77	tag, and don't change when embedded.
78
79	=back
80
81	Within a value tagged with stringref-namespace, every string that is
82	encoded with a definite length and has a minimum length is assigned an
83	implicit index, starting from zero.
84
85	The minimum length required to get an index depends on the index value
86	that would be assigned to it, as stringref references should be shorter
87	than the string they reference, and there is no point in wasting indices
88	on strings that should not be referenced anyway.
89
90	The lengths are as follows:
91
92	string index minimum string length in octets
93	0 .. 23 3
94	24 .. 255 4
95	256 .. 65535 5
96	65536 .. 4294967295 7
97	4294967296 .. 11
98
99	The minimum string length is simply the length of the stringref tag (2
100	octets), plus the minimum size the resulting index takes up, using major
101	type 0 encoding.
102
103	Each time a string is to be encoded by the encoder, it can choose to
104	encode it as a normal string, or attempt to encode it as a stringref.
105
106	If it attempts to encode a stringref, it needs to check whether the string
107	has been assigned an index already. If yes, it can emit a stringref tag
108	with the index.
109
110	In all other cases the encoder must check whether it has to assign an
111	index value to it by checking against the minimum length required for the
112	next index value to be assigned, and assigning the next index value if the
113	string has minimum length or is longer.
114
115	A typical encoder might want to look up every string (using the major
116	type and the octets) it encodes in some kind of internal map that stores
117	previously-assigned indices. If found, it will emit a stringref. If not,
118	it attempts to assign an index to the string and stores that in its map.
119
120	Decoders might choose not to look up previous indices for a string, for
121	example, because it only wants to compress strings used as map keys, or
122	because it intrinsically knows that the string is only used once. This is
123	fine, but each time the encoder emits a string, it MUST assign an index if
124	the minimum length is reached or exceeded, even if it doesn't store that
125	index for later use.
126
127	Byte strings and text strings share the same index namespace, but are
128	distinct, even if their contents are bit-identical. Indefinite length
129	strings and the chunks they consist of are never assigned an index.
130
131	A typical encoder would emit the stringref-namespace tag once to tag
132	the top-level value in a CBOR stream, but nothing keeps an encoder
133	from nesting stringref-namespace tags. Nesting might be done for
134	convenience, to improve coding efficiency, to embed values or for any
135	other reason. Similarly, an encoder can sort map keys for greater coding
136	efficiency, but is not required to.
137
138	A decoder follows the same process of assigning indices. A possible
139	implementation could be outlined like this:
140
141	When a decoder supports stringref, then, upon encountering a
142	stringref-namespace tag, it should initialise an array and start decoding
143	the tagged value. The array stores, for each index, the major type (byte
144	or text) and the string value. Implementations are free to use their
145	internal format, of course - languages that do not distinguish between
146	text and byte strings can just store their internal string type in the
147	array.
148
149	Since stringref-namespace tags can be nested, the decoder needs to save
150	and restore the outer array before starting and after ending the decoding
151	of the tagged value. In a recursive implementation this could simply look
152	like this:
153
154	global stringref_array; // global per encoder
155	local outer_stringref_array;
156
157	outer_stringref_array = stringref_array;
158	stringref_array = allocate_array ();
159
160	value = decode_cbor_value ();
161
162	deallocate_array (stringref_array);
163	stringref_array = outer_stringref_array;
164
165	return value;
166
167	When a stringref tag is encountered, it MUST contain an unsigned integer
168	(major type 0). The integer MUST be less than the number of entries in the
169	array. The decoder than looks up the string in the array, given the index,
170	and copies it to the result (string values should not be shared unless
171	the language normally does so - if strings are mutable, a copy should be
172	decoded).
173
174	When a definite-length text or byte string is encountered, it is decoded
175	normally. Afterwards, the decoder checks whether it needs to assign an
176	index to the string, by comparing the string length against the minimum
177	length required by the index. If the string is shorter, nothing happens,
178	otherwise the next free index is used and the string (and the information
179	whether it is a byte or text string) is recorded at the end of the array.
180
181	In languages with dynamic arrays, this could be accomplished by using
182	the array length as the next index to be assigned, and pushing the
183	string onto the end of the array when it is long enough.
184
185	=head2 IMPLEMENTATION NOTE
186
187	The semantics of stringref tags require the decoder to be aware and the
188	encoder to be under control of the sequence in which data items are
189	encoded into the CBOR stream. This means these tags cannot be implemented
190	on top of every generic CBOR encoder/decoder (which might reorder entries
191	in a map); they typically need to be integrated into their works.
192
193	=head1 EXAMPLES
194
195	The array-of-maps from the rationale example would normally compress to a
196	CBOR text of 83 bytes. Using this extension where possible, this reduces
197	to 74 bytes:
198
199	d9 0100 # tag(256)
200	83 # array(3)
201	a3 # map(3)
202	44 # bytes(4)
203	72616e6b # "rank"
204	04 # unsigned(4)
205	45 # bytes(5)
206	636f756e74 # "count"
207	19 01a1 # unsigned(417)
208	44 # bytes(4)
209	6e616d65 # "name"
210	48 # bytes(8)
211	436f636b7461696c # "Cocktail"
212	a3 # map(3)
213	d8 19 # tag(25)
214	02 # unsigned(2)
215	44 # bytes(4)
216	42617468 # "Bath"
217	d8 19 # tag(25)
218	01 # unsigned(1)
219	19 0138 # unsigned(312)
220	d8 19 # tag(25)
221	00 # unsigned(0)
222	04 # unsigned(4)
223	a3 # map(3)
224	d8 19 # tag(25)
225	02 # unsigned(2)
226	44 # bytes(4)
227	466f6f64 # "Food"
228	d8 19 # tag(25)
229	01 # unsigned(1)
230	19 02b3 # unsigned(691)
231	d8 19 # tag(25)
232	00 # unsigned(0)
233	04 # unsigned(4)
234
235	The following JSON array illustrates the effect of the index on the
236	minimum string length:
237
238	[ "1", "222", "333", "4", "555", "666", "777", "888", "999",
239	"aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
240	"jjj", "kkk", "lll", "mmm", "nnn", "ooo", "ppp", "qqq", "rrr",
241	"333",
242	"ssss",
243	"qqq", "rrr", "ssss"]
244
245	The strings "1", "4" and "rrr" are too short to get an index assigned. All
246	others that are not encoded with a stringref do (this assumes that JSON
247	strings are encoded as CBOR byte strings):
248
249	d9 0100 # tag(256)
250	98 20 # array(32)
251	41 # bytes(1)
252	31 # "1"
253	43 # bytes(3)
254	323232 # "222"
255	43 # bytes(3)
256	333333 # "333"
257	41 # bytes(1)
258	34 # "4"
259	43 # bytes(3)
260	353535 # "555"
261	43 # bytes(3)
262	363636 # "666"
263	43 # bytes(3)
264	373737 # "777"
265	43 # bytes(3)
266	383838 # "888"
267	43 # bytes(3)
268	393939 # "999"
269	43 # bytes(3)
270	616161 # "aaa"
271	43 # bytes(3)
272	626262 # "bbb"
273	43 # bytes(3)
274	636363 # "ccc"
275	43 # bytes(3)
276	646464 # "ddd"
277	43 # bytes(3)
278	656565 # "eee"
279	43 # bytes(3)
280	666666 # "fff"
281	43 # bytes(3)
282	676767 # "ggg"
283	43 # bytes(3)
284	686868 # "hhh"
285	43 # bytes(3)
286	696969 # "iii"
287	43 # bytes(3)
288	6a6a6a # "jjj"
289	43 # bytes(3)
290	6b6b6b # "kkk"
291	43 # bytes(3)
292	6c6c6c # "lll"
293	43 # bytes(3)
294	6d6d6d # "mmm"
295	43 # bytes(3)
296	6e6e6e # "nnn"
297	43 # bytes(3)
298	6f6f6f # "ooo"
299	43 # bytes(3)
300	707070 # "ppp"
301	43 # bytes(3)
302	717171 # "qqq"
303	43 # bytes(3)
304	727272 # "rrr"
305	d8 19 # tag(25)
306	01 # unsigned(1)
307	44 # bytes(4)
308	73737373 # "ssss"
309	d8 19 # tag(25)
310	17 # unsigned(23)
311	43 # bytes(3)
312	727272 # "rrr"
313	d8 19 # tag(25)
314	18 18 # unsigned(24)
315
316	This example shows three stringref-namespace tags, two of which are nested
317	inside another:
318
319	256(["aaa", 25(0), 256(["bbb", "aaa", 25(1)]), 256(["ccc", 25(0)]), 25(0)])
320
321	d9 0100 # tag(256)
322	85 # array(5)
323	63 # text(3)
324	616161 # "aaa"
325	d8 19 # tag(25)
326	00 # unsigned(0)
327	d9 0100 # tag(256)
328	83 # array(3)
329	63 # text(3)
330	626262 # "bbb"
331	63 # text(3)
332	616161 # "aaa"
333	d8 19 # tag(25)
334	01 # unsigned(1)
335	d9 0100 # tag(256)
336	82 # array(2)
337	63 # text(3)
338	636363 # "ccc"
339	d8 19 # tag(25)
340	00 # unsigned(0)
341	d8 19 # tag(25)
342	00 # unsigned(0)
343
344	The decoded data structure might look like this:
345
346	["aaa","aaa",["bbb","aaa","aaa"],["ccc","ccc"],"aaa"]
347
348	=head1 IMPLEMENTATIONS
349
350	This section lists known implementations of this extension (L<drop me a
351	mail\|mailto:cbor@schmorp.de?Subject=CBOR-stringref> if you want to be
352	listed here).
353
354	=over 4
355
356	=item * [Perl] L<CBOR::XS\|http://software.schmorp.de/pkg/CBOR-XS.html> (reference implementation)
357
358	=back
359