--- JSON-XS/README 2008/03/19 22:31:00 1.23 +++ JSON-XS/README 2008/03/27 06:37:35 1.24 @@ -608,6 +608,216 @@ JSON::XS->new->decode_prefix ("[1] the tail") => ([], 3) +INCREMENTAL PARSING + [This section and the API it details is still EXPERIMENTAL] + + In some cases, there is the need for incremental parsing of JSON texts. + While this module always has to keep both JSON text and resulting Perl + data structure in memory at one time, it does allow you to parse a JSON + stream incrementally. It does so by accumulating text until it has a + full JSON object, which it then can decode. This process is similar to + using "decode_prefix" to see if a full JSON object is available, but is + much more efficient (JSON::XS will only attempt to parse the JSON text + once it is sure it has enough text to get a decisive result, using a + very simple but truly incremental parser). + + The following two methods deal with this. + + [void, scalar or list context] = $json->incr_parse ([$string]) + This is the central parsing function. It can both append new text + and extract objects from the stream accumulated so far (both of + these functions are optional). + + If $string is given, then this string is appended to the already + existing JSON fragment stored in the $json object. + + After that, if the function is called in void context, it will + simply return without doing anything further. This can be used to + add more text in as many chunks as you want. + + If the method is called in scalar context, then it will try to + extract exactly *one* JSON object. If that is successful, it will + return this object, otherwise it will return "undef". If there is a + parse error, this method will croak just as "decode" would do (one + can then use "incr_skip" to skip the errornous part). This is the + most common way of using the method. + + And finally, in list context, it will try to extract as many objects + from the stream as it can find and return them, or the empty list + otherwise. For this to work, there must be no separators between the + JSON objects or arrays, instead they must be concatenated + back-to-back. If an error occurs, an exception will be raised as in + the scalar context case. Note that in this case, any + previously-parsed JSON texts will be lost. + + $lvalue_string = $json->incr_text + This method returns the currently stored JSON fragment as an lvalue, + that is, you can manipulate it. This *only* works when a preceding + call to "incr_parse" in *scalar context* successfully returned an + object. Under all other circumstances you must not call this + function (I mean it. although in simple tests it might actually + work, it *will* fail under real world conditions). As a special + exception, you can also call this method before having parsed + anything. + + This function is useful in two cases: a) finding the trailing text + after a JSON object or b) parsing multiple JSON objects separated by + non-JSON text (such as commas). + + $json->incr_skip + This will reset the state of the incremental parser and will remove + the parsed text from the input buffer. This is useful after + "incr_parse" died, in which case the input buffer and incremental + parser state is left unchanged, to skip the text parsed so far and + to reset the parse state. + + LIMITATIONS + All options that affect decoding are supported, except "allow_nonref". + The reason for this is that it cannot be made to work sensibly: JSON + objects and arrays are self-delimited, i.e. you can concatenate them + back to back and still decode them perfectly. This does not hold true + for JSON numbers, however. + + For example, is the string 1 a single JSON number, or is it simply the + start of 12? Or is 12 a single JSON number, or the concatenation of 1 + and 2? In neither case you can tell, and this is why JSON::XS takes the + conservative route and disallows this case. + + EXAMPLES + Some examples will make all this clearer. First, a simple example that + works similarly to "decode_prefix": We want to decode the JSON object at + the start of a string and identify the portion after the JSON object: + + my $text = "[1,2,3] hello"; + + my $json = new JSON::XS; + + my $obj = $json->incr_parse ($text) + or die "expected JSON object or array at beginning of string"; + + my $tail = $json->incr_text; + # $tail now contains " hello" + + Easy, isn't it? + + Now for a more complicated example: Imagine a hypothetical protocol + where you read some requests from a TCP stream, and each request is a + JSON array, without any separation between them (in fact, it is often + useful to use newlines as "separators", as these get interpreted as + whitespace at the start of the JSON text, which makes it possible to + test said protocol with "telnet"...). + + Here is how you'd do it (it is trivial to write this in an event-based + manner): + + my $json = new JSON::XS; + + # read some data from the socket + while (sysread $socket, my $buf, 4096) { + + # split and decode as many requests as possible + for my $request ($json->incr_parse ($buf)) { + # act on the $request + } + } + + Another complicated example: Assume you have a string with JSON objects + or arrays, all separated by (optional) comma characters (e.g. "[1],[2], + [3]"). To parse them, we have to skip the commas between the JSON texts, + and here is where the lvalue-ness of "incr_text" comes in useful: + + my $text = "[1],[2], [3]"; + my $json = new JSON::XS; + + # void context, so no parsing done + $json->incr_parse ($text); + + # now extract as many objects as possible. note the + # use of scalar context so incr_text can be called. + while (my $obj = $json->incr_parse) { + # do something with $obj + + # now skip the optional comma + $json->incr_text =~ s/^ \s* , //x; + } + + Now lets go for a very complex example: Assume that you have a gigantic + JSON array-of-objects, many gigabytes in size, and you want to parse it, + but you cannot load it into memory fully (this has actually happened in + the real world :). + + Well, you lost, you have to implement your own JSON parser. But JSON::XS + can still help you: You implement a (very simple) array parser and let + JSON decode the array elements, which are all full JSON objects on their + own (this wouldn't work if the array elements could be JSON numbers, for + example): + + my $json = new JSON::XS; + + # open the monster + open my $fh, "incr_parse ($buf); # void context, so no parsing + + # Exit the loop once we found and removed(!) the initial "[". + # In essence, we are (ab-)using the $json object as a simple scalar + # we append data to. + last if $json->incr_text =~ s/^ \s* \[ //x; + } + + # now we have the skipped the initial "[", so continue + # parsing all the elements. + for (;;) { + # in this loop we read data until we got a single JSON object + for (;;) { + if (my $obj = $json->incr_parse) { + # do something with $obj + last; + } + + # add more data + sysread $fh, my $buf, 65536 + or die "read error: $!"; + $json->incr_parse ($buf); # void context, so no parsing + } + + # in this loop we read data until we either found and parsed the + # separating "," between elements, or the final "]" + for (;;) { + # first skip whitespace + $json->incr_text =~ s/^\s*//; + + # if we find "]", we are done + if ($json->incr_text =~ s/^\]//) { + print "finished.\n"; + exit; + } + + # if we find ",", we can continue with the next element + if ($json->incr_text =~ s/^,//) { + last; + } + + # if we find anything else, we have a parse error! + if (length $json->incr_text) { + die "parse error near ", $json->incr_text; + } + + # else add more data + sysread $fh, my $buf, 65536 + or die "read error: $!"; + $json->incr_parse ($buf); # void context, so no parsing + } + + This is a complex example, but most of the complexity comes from the + fact that we are trying to be correct (bear with me if I am wrong, I + never ran the above example :). + MAPPING This section describes how JSON::XS maps Perl values to JSON values and vice versa. These mappings are designed to "do the right thing" in most @@ -736,16 +946,16 @@ You can not currently force the type in other, less obscure, ways. Tell me if you need this capability (but don't forget to explain why - its needed :). + it's needed :). ENCODING/CODESET FLAG NOTES The interested reader might have seen a number of flags that signify encodings or codesets - "utf8", "latin1" and "ascii". There seems to be some confusion on what these do, so here is a short comparison: - "utf8" controls wether the JSON text created by "encode" (and expected + "utf8" controls whether the JSON text created by "encode" (and expected by "decode") is UTF-8 encoded or not, while "latin1" and "ascii" only - control wether "encode" escapes character values outside their + control whether "encode" escapes character values outside their respective codeset range. Neither of these flags conflict with each other, although some combinations make less sense than others. @@ -833,96 +1043,6 @@ in mail), and works because ASCII is a proper subset of most 8-bit and multibyte encodings in use in the world. -COMPARISON - As already mentioned, this module was created because none of the - existing JSON modules could be made to work correctly. First I will - describe the problems (or pleasures) I encountered with various existing - JSON modules, followed by some benchmark values. JSON::XS was designed - not to suffer from any of these problems or limitations. - - JSON 2.xx - A marvellous piece of engineering, this module either uses JSON::XS - directly when available (so will be 100% compatible with it, - including speed), or it uses JSON::PP, which is basically JSON::XS - translated to Pure Perl, which should be 100% compatible with - JSON::XS, just a bit slower. - - You cannot really lose by using this module, especially as it tries - very hard to work even with ancient Perl versions, while JSON::XS - does not. - - JSON 1.07 - Slow (but very portable, as it is written in pure Perl). - - Undocumented/buggy Unicode handling (how JSON handles Unicode values - is undocumented. One can get far by feeding it Unicode strings and - doing en-/decoding oneself, but Unicode escapes are not working - properly). - - No round-tripping (strings get clobbered if they look like numbers, - e.g. the string 2.0 will encode to 2.0 instead of "2.0", and that - will decode into the number 2. - - JSON::PC 0.01 - Very fast. - - Undocumented/buggy Unicode handling. - - No round-tripping. - - Has problems handling many Perl values (e.g. regex results and other - magic values will make it croak). - - Does not even generate valid JSON ("{1,2}" gets converted to "{1:2}" - which is not a valid JSON text. - - Unmaintained (maintainer unresponsive for many months, bugs are not - getting fixed). - - JSON::Syck 0.21 - Very buggy (often crashes). - - Very inflexible (no human-readable format supported, format pretty - much undocumented. I need at least a format for easy reading by - humans and a single-line compact format for use in a protocol, and - preferably a way to generate ASCII-only JSON texts). - - Completely broken (and confusingly documented) Unicode handling - (Unicode escapes are not working properly, you need to set - ImplicitUnicode to *different* values on en- and decoding to get - symmetric behaviour). - - No round-tripping (simple cases work, but this depends on whether - the scalar value was used in a numeric context or not). - - Dumping hashes may skip hash values depending on iterator state. - - Unmaintained (maintainer unresponsive for many months, bugs are not - getting fixed). - - Does not check input for validity (i.e. will accept non-JSON input - and return "something" instead of raising an exception. This is a - security issue: imagine two banks transferring money between each - other using JSON. One bank might parse a given non-JSON request and - deduct money, while the other might reject the transaction with a - syntax error. While a good protocol will at least recover, that is - extra unnecessary work and the transaction will still not succeed). - - JSON::DWIW 0.04 - Very fast. Very natural. Very nice. - - Undocumented Unicode handling (but the best of the pack. Unicode - escapes still don't get parsed properly). - - Very inflexible. - - No round-tripping. - - Does not generate valid JSON texts (key strings are often unquoted, - empty keys result in nothing being output) - - Does not check input for validity. - JSON and YAML You often hear that JSON is a subset of YAML. This is, however, a mass hysteria(*) and very far from the truth (as of the time of this @@ -1077,19 +1197,22 @@ This module is *not* guaranteed to be thread safe and there are no plans to change this until Perl gets thread support (as opposed to the horribly slow so-called "threads" which are simply slow and bloated - process simulations - use fork, its *much* faster, cheaper, better). + process simulations - use fork, it's *much* faster, cheaper, better). (It might actually work, but you have been warned). BUGS While the goal of this module is to be correct, that unfortunately does - not mean its bug-free, only that I think its design is bug-free. It is + not mean it's bug-free, only that I think its design is bug-free. It is still relatively early in its development. If you keep reporting bugs they will be fixed swiftly, though. Please refrain from using rt.cpan.org or any other bug reporting service. I put the contact address into my modules for a reason. +SEE ALSO + The json_xs command line utility for quick experiments. + AUTHOR Marc Lehmann http://home.schmorp.de/