1 |
root |
1.1 |
=head1 REGISTRATION INFORMATION |
2 |
|
|
|
3 |
|
|
Tag <unassigned> (stringref-namespace) |
4 |
|
|
Data Item multiple |
5 |
|
|
Semantics mark value as having string references |
6 |
|
|
Reference http://cbor.schmorp.de/stringref |
7 |
|
|
Contact Marc A. Lehmann <cbor@schmorp.de> |
8 |
|
|
|
9 |
|
|
Tag <unassigned> (stringref) |
10 |
|
|
Data Item unsigned integer |
11 |
|
|
Semantics reference the nth previously seen string |
12 |
|
|
Reference http://cbor.schmorp.de/stringref |
13 |
|
|
Contact Marc A. Lehmann <cbor@schmorp.de> |
14 |
|
|
|
15 |
|
|
=head1 RATIONALE |
16 |
|
|
|
17 |
|
|
These two tags can be used to efficiently share constant strings, which |
18 |
|
|
are common in many data structures. They are an optimisation only, |
19 |
|
|
similar to LZSS compresssion, for the string major types in CBOR. |
20 |
|
|
|
21 |
|
|
Background: many data structures or protocols contain repeated string |
22 |
|
|
constants (often called atoms or quarks). For example, this is a |
23 |
|
|
real-world (JSON-formatted) data structure that contains many repeated |
24 |
|
|
strings used as map keys: |
25 |
|
|
|
26 |
|
|
[ |
27 |
|
|
{ |
28 |
|
|
"name" : "Cocktail", |
29 |
|
|
"count" : 417, |
30 |
|
|
"rank" : 4 |
31 |
|
|
}, |
32 |
|
|
{ |
33 |
|
|
"rank" : 4, |
34 |
|
|
"count" : 312, |
35 |
|
|
"name" : "Bath" |
36 |
|
|
}, |
37 |
|
|
{ |
38 |
|
|
"count" : 691, |
39 |
|
|
"name" : "Food", |
40 |
|
|
"rank" : 4 |
41 |
|
|
} |
42 |
|
|
] |
43 |
|
|
|
44 |
|
|
This can be encoded nicely with CBOR, but each occurrence of "name", |
45 |
|
|
"count" and "rank" will be encoded separately. In highly repetitive data |
46 |
|
|
structures (the above is taken from a game save file) these can take up |
47 |
|
|
considerable space. |
48 |
|
|
|
49 |
|
|
This scheme can be used to reduce this overhead with a simple scheme that |
50 |
|
|
is easily implementable. |
51 |
|
|
|
52 |
|
|
=head1 DESCRIPTION |
53 |
|
|
|
54 |
|
|
Stringref consists of two tags, stringref-namespace (value <unassigned>), |
55 |
|
|
which marks a value as containing string references, and stringref (value |
56 |
|
|
<unassigned>), which references a string previously encoded in the value. |
57 |
|
|
|
58 |
|
|
The stringref-namespace tag is used to define a namespace for the string |
59 |
|
|
reference ids. stringref tags are only valid inside CBOR values marked |
60 |
|
|
with stringref-namespace. |
61 |
|
|
|
62 |
|
|
The purpose of stringref-namespace is three-fold: |
63 |
|
|
|
64 |
|
|
=over 4 |
65 |
|
|
|
66 |
|
|
=item 1. It signals the decoder that it needs to keep a copy of every |
67 |
|
|
string that is being decoded while decoding the tagged value (and |
68 |
|
|
conversely, that there is no need for this extra overhead without this |
69 |
|
|
tag). |
70 |
|
|
|
71 |
|
|
=item 2. It can be used to confine string references to certain subtrees, |
72 |
|
|
which might increase encoding efficiency by being able to use smaller |
73 |
|
|
references. |
74 |
|
|
|
75 |
|
|
=item 3. CBOR objects encoded with this scheme can be blindly embedded in |
76 |
|
|
existing CBOR streams, as the indices are relative to the outer namespace |
77 |
|
|
tag, and don't change when embedded. |
78 |
|
|
|
79 |
|
|
=back |
80 |
|
|
|
81 |
|
|
Within a value tagged with stringref-namespace, every string that is |
82 |
|
|
encoded with a definite length and has a minimum length is assigned an |
83 |
|
|
implicit index, starting from zero. |
84 |
|
|
|
85 |
|
|
The minimum length required to get an index depends on the index value |
86 |
|
|
that would be assigned to it, as stringref references should be shorter |
87 |
|
|
than the string they reference, and there is no point in wasting indices |
88 |
|
|
on strings that should not be referenced anyway. |
89 |
|
|
|
90 |
|
|
The lengths are as follows: |
91 |
|
|
|
92 |
|
|
string index minimum string length in octets |
93 |
|
|
0 .. 23 3 |
94 |
|
|
24 .. 255 4 |
95 |
|
|
256 .. 65535 5 |
96 |
|
|
65536 .. 4294967295 6 |
97 |
|
|
4294967296 .. 7 |
98 |
|
|
|
99 |
|
|
The minimum string length is simply the length of the stringref tag (2 |
100 |
|
|
octets), plus the minimum size the resulting index takes up, using major |
101 |
|
|
type 0 encoding. |
102 |
|
|
|
103 |
|
|
Each time a string is to be encoded by the encoder, it can choose to |
104 |
|
|
encode it as a normal string, or attempt to encode it as a stringref. |
105 |
|
|
|
106 |
|
|
If it attempts to encode a stringref, it needs to check whether the string |
107 |
|
|
has been assigned an index already. If yes, it can emit a stringref tag |
108 |
|
|
with the index. |
109 |
|
|
|
110 |
|
|
In all other cases the encoder must check whether it has to assign an |
111 |
|
|
index value to it by checking against the minimum length required for the |
112 |
|
|
next index value to be assigned, and assigning the next index value if the |
113 |
|
|
string has minimum length or is longer. |
114 |
|
|
|
115 |
|
|
A typical encoder might want to look up every string (using the major |
116 |
|
|
type and the octets) it encodes in some kind of internal map that stores |
117 |
|
|
previously-assigned indices. If found, it will emit a stringref. If not, |
118 |
|
|
it attempts to assign an index to the string and stores that in its map. |
119 |
|
|
|
120 |
|
|
Decoders might choose not to look up previous indices for a string, for |
121 |
|
|
example, because it only wants to compress strings used as map keys, or |
122 |
|
|
because it intrinsically knows that the string is only used once. This is |
123 |
|
|
fine, but each time the encoder emits a string, it MUST assign an index if |
124 |
|
|
the minimum length is reached or exceeded, even if it doesn't store that |
125 |
|
|
index for later use. |
126 |
|
|
|
127 |
|
|
Byte strings and text strings share the same index namespace, but are |
128 |
|
|
distinct, even if their contents are bit-identical. Indefinite length |
129 |
|
|
strings and the chunks they consist of are never assigned an index. |
130 |
|
|
|
131 |
|
|
A typical encoder would emit the stringref-namespace tag once to tag |
132 |
|
|
the top-level value in a CBOR stream, but nothing keeps an encoder |
133 |
|
|
from nesting stringref-namespace tags. Nesting might be done for |
134 |
|
|
convenience, to improve coding efficiency, to embed values or for any |
135 |
|
|
other reason. Similarly, an encoder can sort map keys for greater coding |
136 |
|
|
efficiency, but is not required to. |
137 |
|
|
|
138 |
|
|
A decoder follows the same process of assigning indices. A possible |
139 |
|
|
implementation could be outlined like this: |
140 |
|
|
|
141 |
|
|
When a decoder supports stringref, then, upon encountering a |
142 |
|
|
stringref-namespace tag, it should initialise an array and start decoding |
143 |
|
|
the tagged value. The array stores, for each index, the major type (byte |
144 |
|
|
or text) and the string value. Implementations are free to use their |
145 |
|
|
internal format, of course - languages that do not distinguish between |
146 |
|
|
text and byte strings can just store their internal string type in the |
147 |
|
|
array. |
148 |
|
|
|
149 |
|
|
Since stringref-namespace tags can be nested, the decoder needs to save |
150 |
|
|
and restore the outer array before starting and after ending the decoding |
151 |
|
|
of the tagged value. In a recursive implementation this could simply look |
152 |
|
|
like this: |
153 |
|
|
|
154 |
|
|
global stringref_array; // global per encoder |
155 |
|
|
local outer_stringref_array; |
156 |
|
|
|
157 |
|
|
outer_stringref_array = stringref_array; |
158 |
|
|
stringref_array = allocate_array (); |
159 |
|
|
|
160 |
|
|
value = decode_cbor_value (); |
161 |
|
|
|
162 |
|
|
deallocate_array (stringref_array); |
163 |
|
|
stringref_array = outer_stringref_array; |
164 |
|
|
|
165 |
|
|
return value; |
166 |
|
|
|
167 |
|
|
When a stringref tag is encountered, it MUST contain an unsigned integer |
168 |
|
|
(major type 0). The integer MUST be less than the number of entries in the |
169 |
|
|
array. The decoder than looks up the string in the array, given the index, |
170 |
|
|
and copies it to the result (string values should not be shared unless |
171 |
|
|
the language normally does so - if strings are mutable, a copy should be |
172 |
|
|
decoded). |
173 |
|
|
|
174 |
|
|
When a definite-length text or byte string is encountered, it is decoded |
175 |
|
|
normally. Afterwards, the decoder checks whether it needs to assign an |
176 |
|
|
index to the string, by comparing the string length against the minimum |
177 |
|
|
length required by the index. If the string is shorter, nothing happens, |
178 |
|
|
otherwise the next free index is used and the string (and the information |
179 |
|
|
whether it is a byte or text string) is recorded at the end of the array. |
180 |
|
|
|
181 |
|
|
In languages with dynamic arrays, this could be accomplished by using |
182 |
|
|
the array length as the next index to be assigned, and pushing the |
183 |
|
|
string onto the end of the array when it is long enough. |
184 |
|
|
|
185 |
|
|
=head2 IMPLEMENTATION NOTE |
186 |
|
|
|
187 |
|
|
The semantics of stringref tags require the decoder to be aware and |
188 |
|
|
the encoder to be under control of the sequence in which data items |
189 |
|
|
are encoded into the CBOR stream. This means these tags cannot be |
190 |
|
|
implemented on top of every generic CBOR encoder/decoder (which might |
191 |
|
|
reorder entries in a map); they need to be integrated into their works. |
192 |
|
|
|
193 |
|
|
=head1 EXAMPLES |
194 |
|
|
|
195 |
|
|
<TBD> |
196 |
|
|
|