ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.2
Committed: Thu Nov 28 09:13:12 2013 UTC (10 years, 6 months ago) by root
Branch: MAIN
Changes since 1.1: +2 -2 lines
Log Message:
*** empty log message ***

File Contents

# Content
1 =head1 REGISTRATION INFORMATION
2
3 Tag <unassigned> (stringref-namespace)
4 Data Item multiple
5 Semantics mark value as having string references
6 Reference http://cbor.schmorp.de/stringref
7 Contact Marc A. Lehmann <cbor@schmorp.de>
8
9 Tag <unassigned> (stringref)
10 Data Item unsigned integer
11 Semantics reference the nth previously seen string
12 Reference http://cbor.schmorp.de/stringref
13 Contact Marc A. Lehmann <cbor@schmorp.de>
14
15 =head1 RATIONALE
16
17 These two tags can be used to efficiently share constant strings, which
18 are common in many data structures. They are an optimisation only,
19 similar to LZSS compresssion, for the string major types in CBOR.
20
21 Background: many data structures or protocols contain repeated string
22 constants (often called atoms or quarks). For example, this is a
23 real-world (JSON-formatted) data structure that contains many repeated
24 strings used as map keys:
25
26 [
27 {
28 "name" : "Cocktail",
29 "count" : 417,
30 "rank" : 4
31 },
32 {
33 "rank" : 4,
34 "count" : 312,
35 "name" : "Bath"
36 },
37 {
38 "count" : 691,
39 "name" : "Food",
40 "rank" : 4
41 }
42 ]
43
44 This can be encoded nicely with CBOR, but each occurrence of "name",
45 "count" and "rank" will be encoded separately. In highly repetitive data
46 structures (the above is taken from a game save file) these can take up
47 considerable space.
48
49 This scheme can be used to reduce this overhead with a simple scheme that
50 is easily implementable.
51
52 =head1 DESCRIPTION
53
54 Stringref consists of two tags, stringref-namespace (value <unassigned>),
55 which marks a value as containing string references, and stringref (value
56 <unassigned>), which references a string previously encoded in the value.
57
58 The stringref-namespace tag is used to define a namespace for the string
59 reference ids. stringref tags are only valid inside CBOR values marked
60 with stringref-namespace.
61
62 The purpose of stringref-namespace is three-fold:
63
64 =over 4
65
66 =item 1. It signals the decoder that it needs to keep a copy of every
67 string that is being decoded while decoding the tagged value (and
68 conversely, that there is no need for this extra overhead without this
69 tag).
70
71 =item 2. It can be used to confine string references to certain subtrees,
72 which might increase encoding efficiency by being able to use smaller
73 references.
74
75 =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76 existing CBOR streams, as the indices are relative to the outer namespace
77 tag, and don't change when embedded.
78
79 =back
80
81 Within a value tagged with stringref-namespace, every string that is
82 encoded with a definite length and has a minimum length is assigned an
83 implicit index, starting from zero.
84
85 The minimum length required to get an index depends on the index value
86 that would be assigned to it, as stringref references should be shorter
87 than the string they reference, and there is no point in wasting indices
88 on strings that should not be referenced anyway.
89
90 The lengths are as follows:
91
92 string index minimum string length in octets
93 0 .. 23 3
94 24 .. 255 4
95 256 .. 65535 5
96 65536 .. 4294967295 7
97 4294967296 .. 11
98
99 The minimum string length is simply the length of the stringref tag (2
100 octets), plus the minimum size the resulting index takes up, using major
101 type 0 encoding.
102
103 Each time a string is to be encoded by the encoder, it can choose to
104 encode it as a normal string, or attempt to encode it as a stringref.
105
106 If it attempts to encode a stringref, it needs to check whether the string
107 has been assigned an index already. If yes, it can emit a stringref tag
108 with the index.
109
110 In all other cases the encoder must check whether it has to assign an
111 index value to it by checking against the minimum length required for the
112 next index value to be assigned, and assigning the next index value if the
113 string has minimum length or is longer.
114
115 A typical encoder might want to look up every string (using the major
116 type and the octets) it encodes in some kind of internal map that stores
117 previously-assigned indices. If found, it will emit a stringref. If not,
118 it attempts to assign an index to the string and stores that in its map.
119
120 Decoders might choose not to look up previous indices for a string, for
121 example, because it only wants to compress strings used as map keys, or
122 because it intrinsically knows that the string is only used once. This is
123 fine, but each time the encoder emits a string, it MUST assign an index if
124 the minimum length is reached or exceeded, even if it doesn't store that
125 index for later use.
126
127 Byte strings and text strings share the same index namespace, but are
128 distinct, even if their contents are bit-identical. Indefinite length
129 strings and the chunks they consist of are never assigned an index.
130
131 A typical encoder would emit the stringref-namespace tag once to tag
132 the top-level value in a CBOR stream, but nothing keeps an encoder
133 from nesting stringref-namespace tags. Nesting might be done for
134 convenience, to improve coding efficiency, to embed values or for any
135 other reason. Similarly, an encoder can sort map keys for greater coding
136 efficiency, but is not required to.
137
138 A decoder follows the same process of assigning indices. A possible
139 implementation could be outlined like this:
140
141 When a decoder supports stringref, then, upon encountering a
142 stringref-namespace tag, it should initialise an array and start decoding
143 the tagged value. The array stores, for each index, the major type (byte
144 or text) and the string value. Implementations are free to use their
145 internal format, of course - languages that do not distinguish between
146 text and byte strings can just store their internal string type in the
147 array.
148
149 Since stringref-namespace tags can be nested, the decoder needs to save
150 and restore the outer array before starting and after ending the decoding
151 of the tagged value. In a recursive implementation this could simply look
152 like this:
153
154 global stringref_array; // global per encoder
155 local outer_stringref_array;
156
157 outer_stringref_array = stringref_array;
158 stringref_array = allocate_array ();
159
160 value = decode_cbor_value ();
161
162 deallocate_array (stringref_array);
163 stringref_array = outer_stringref_array;
164
165 return value;
166
167 When a stringref tag is encountered, it MUST contain an unsigned integer
168 (major type 0). The integer MUST be less than the number of entries in the
169 array. The decoder than looks up the string in the array, given the index,
170 and copies it to the result (string values should not be shared unless
171 the language normally does so - if strings are mutable, a copy should be
172 decoded).
173
174 When a definite-length text or byte string is encountered, it is decoded
175 normally. Afterwards, the decoder checks whether it needs to assign an
176 index to the string, by comparing the string length against the minimum
177 length required by the index. If the string is shorter, nothing happens,
178 otherwise the next free index is used and the string (and the information
179 whether it is a byte or text string) is recorded at the end of the array.
180
181 In languages with dynamic arrays, this could be accomplished by using
182 the array length as the next index to be assigned, and pushing the
183 string onto the end of the array when it is long enough.
184
185 =head2 IMPLEMENTATION NOTE
186
187 The semantics of stringref tags require the decoder to be aware and
188 the encoder to be under control of the sequence in which data items
189 are encoded into the CBOR stream. This means these tags cannot be
190 implemented on top of every generic CBOR encoder/decoder (which might
191 reorder entries in a map); they need to be integrated into their works.
192
193 =head1 EXAMPLES
194
195 <TBD>
196