ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/CBOR-XS/doc/stringref.pod
Revision: 1.1
Committed: Tue Nov 26 09:43:39 2013 UTC (10 years, 6 months ago) by root
Branch: MAIN
Log Message:
*** empty log message ***

File Contents

# User Rev Content
1 root 1.1 =head1 REGISTRATION INFORMATION
2    
3     Tag <unassigned> (stringref-namespace)
4     Data Item multiple
5     Semantics mark value as having string references
6     Reference http://cbor.schmorp.de/stringref
7     Contact Marc A. Lehmann <cbor@schmorp.de>
8    
9     Tag <unassigned> (stringref)
10     Data Item unsigned integer
11     Semantics reference the nth previously seen string
12     Reference http://cbor.schmorp.de/stringref
13     Contact Marc A. Lehmann <cbor@schmorp.de>
14    
15     =head1 RATIONALE
16    
17     These two tags can be used to efficiently share constant strings, which
18     are common in many data structures. They are an optimisation only,
19     similar to LZSS compresssion, for the string major types in CBOR.
20    
21     Background: many data structures or protocols contain repeated string
22     constants (often called atoms or quarks). For example, this is a
23     real-world (JSON-formatted) data structure that contains many repeated
24     strings used as map keys:
25    
26     [
27     {
28     "name" : "Cocktail",
29     "count" : 417,
30     "rank" : 4
31     },
32     {
33     "rank" : 4,
34     "count" : 312,
35     "name" : "Bath"
36     },
37     {
38     "count" : 691,
39     "name" : "Food",
40     "rank" : 4
41     }
42     ]
43    
44     This can be encoded nicely with CBOR, but each occurrence of "name",
45     "count" and "rank" will be encoded separately. In highly repetitive data
46     structures (the above is taken from a game save file) these can take up
47     considerable space.
48    
49     This scheme can be used to reduce this overhead with a simple scheme that
50     is easily implementable.
51    
52     =head1 DESCRIPTION
53    
54     Stringref consists of two tags, stringref-namespace (value <unassigned>),
55     which marks a value as containing string references, and stringref (value
56     <unassigned>), which references a string previously encoded in the value.
57    
58     The stringref-namespace tag is used to define a namespace for the string
59     reference ids. stringref tags are only valid inside CBOR values marked
60     with stringref-namespace.
61    
62     The purpose of stringref-namespace is three-fold:
63    
64     =over 4
65    
66     =item 1. It signals the decoder that it needs to keep a copy of every
67     string that is being decoded while decoding the tagged value (and
68     conversely, that there is no need for this extra overhead without this
69     tag).
70    
71     =item 2. It can be used to confine string references to certain subtrees,
72     which might increase encoding efficiency by being able to use smaller
73     references.
74    
75     =item 3. CBOR objects encoded with this scheme can be blindly embedded in
76     existing CBOR streams, as the indices are relative to the outer namespace
77     tag, and don't change when embedded.
78    
79     =back
80    
81     Within a value tagged with stringref-namespace, every string that is
82     encoded with a definite length and has a minimum length is assigned an
83     implicit index, starting from zero.
84    
85     The minimum length required to get an index depends on the index value
86     that would be assigned to it, as stringref references should be shorter
87     than the string they reference, and there is no point in wasting indices
88     on strings that should not be referenced anyway.
89    
90     The lengths are as follows:
91    
92     string index minimum string length in octets
93     0 .. 23 3
94     24 .. 255 4
95     256 .. 65535 5
96     65536 .. 4294967295 6
97     4294967296 .. 7
98    
99     The minimum string length is simply the length of the stringref tag (2
100     octets), plus the minimum size the resulting index takes up, using major
101     type 0 encoding.
102    
103     Each time a string is to be encoded by the encoder, it can choose to
104     encode it as a normal string, or attempt to encode it as a stringref.
105    
106     If it attempts to encode a stringref, it needs to check whether the string
107     has been assigned an index already. If yes, it can emit a stringref tag
108     with the index.
109    
110     In all other cases the encoder must check whether it has to assign an
111     index value to it by checking against the minimum length required for the
112     next index value to be assigned, and assigning the next index value if the
113     string has minimum length or is longer.
114    
115     A typical encoder might want to look up every string (using the major
116     type and the octets) it encodes in some kind of internal map that stores
117     previously-assigned indices. If found, it will emit a stringref. If not,
118     it attempts to assign an index to the string and stores that in its map.
119    
120     Decoders might choose not to look up previous indices for a string, for
121     example, because it only wants to compress strings used as map keys, or
122     because it intrinsically knows that the string is only used once. This is
123     fine, but each time the encoder emits a string, it MUST assign an index if
124     the minimum length is reached or exceeded, even if it doesn't store that
125     index for later use.
126    
127     Byte strings and text strings share the same index namespace, but are
128     distinct, even if their contents are bit-identical. Indefinite length
129     strings and the chunks they consist of are never assigned an index.
130    
131     A typical encoder would emit the stringref-namespace tag once to tag
132     the top-level value in a CBOR stream, but nothing keeps an encoder
133     from nesting stringref-namespace tags. Nesting might be done for
134     convenience, to improve coding efficiency, to embed values or for any
135     other reason. Similarly, an encoder can sort map keys for greater coding
136     efficiency, but is not required to.
137    
138     A decoder follows the same process of assigning indices. A possible
139     implementation could be outlined like this:
140    
141     When a decoder supports stringref, then, upon encountering a
142     stringref-namespace tag, it should initialise an array and start decoding
143     the tagged value. The array stores, for each index, the major type (byte
144     or text) and the string value. Implementations are free to use their
145     internal format, of course - languages that do not distinguish between
146     text and byte strings can just store their internal string type in the
147     array.
148    
149     Since stringref-namespace tags can be nested, the decoder needs to save
150     and restore the outer array before starting and after ending the decoding
151     of the tagged value. In a recursive implementation this could simply look
152     like this:
153    
154     global stringref_array; // global per encoder
155     local outer_stringref_array;
156    
157     outer_stringref_array = stringref_array;
158     stringref_array = allocate_array ();
159    
160     value = decode_cbor_value ();
161    
162     deallocate_array (stringref_array);
163     stringref_array = outer_stringref_array;
164    
165     return value;
166    
167     When a stringref tag is encountered, it MUST contain an unsigned integer
168     (major type 0). The integer MUST be less than the number of entries in the
169     array. The decoder than looks up the string in the array, given the index,
170     and copies it to the result (string values should not be shared unless
171     the language normally does so - if strings are mutable, a copy should be
172     decoded).
173    
174     When a definite-length text or byte string is encountered, it is decoded
175     normally. Afterwards, the decoder checks whether it needs to assign an
176     index to the string, by comparing the string length against the minimum
177     length required by the index. If the string is shorter, nothing happens,
178     otherwise the next free index is used and the string (and the information
179     whether it is a byte or text string) is recorded at the end of the array.
180    
181     In languages with dynamic arrays, this could be accomplished by using
182     the array length as the next index to be assigned, and pushing the
183     string onto the end of the array when it is long enough.
184    
185     =head2 IMPLEMENTATION NOTE
186    
187     The semantics of stringref tags require the decoder to be aware and
188     the encoder to be under control of the sequence in which data items
189     are encoded into the CBOR stream. This means these tags cannot be
190     implemented on top of every generic CBOR encoder/decoder (which might
191     reorder entries in a map); they need to be integrated into their works.
192    
193     =head1 EXAMPLES
194    
195     <TBD>
196