--- JSON-XS/XS.pm 2007/10/11 22:52:52 1.62 +++ JSON-XS/XS.pm 2007/10/11 23:07:43 1.63 @@ -105,9 +105,8 @@ =item $json_text = to_json $perl_scalar -Converts the given Perl data structure (a simple scalar or a reference to -a hash or array) to a UTF-8 encoded, binary string (that is, the string contains -octets only). Croaks on error. +Converts the given Perl data structure to a UTF-8 encoded, binary string +(that is, the string contains octets only). Croaks on error. This function call is functionally identical to: @@ -117,9 +116,9 @@ =item $perl_scalar = from_json $json_text -The opposite of C: expects an UTF-8 (binary) string and tries to -parse that as an UTF-8 encoded JSON text, returning the resulting simple -scalar or reference. Croaks on error. +The opposite of C: expects an UTF-8 (binary) string and tries +to parse that as an UTF-8 encoded JSON text, returning the resulting +reference. Croaks on error. This function call is functionally identical to: @@ -139,6 +138,54 @@ =back +=head1 A FEW NOTES ON UNICODE AND PERL + +Since this often leads to confusion, here are a few very clear words on +how Unicode works in Perl, modulo bugs. + +=over 4 + +=item 1. Perl strings can store characters with ordinal values > 255. + +This enables you to store unicode characters as single characters in a +Perl string - very natural. + +=item 2. Perl does I associate an encoding with your strings. + +Unless you force it to, e.g. when matching it against a regex, or printing +the scalar to a file, in which case Perl either interprets your string as +locale-encoded text, octets/binary, or as Unicode, depending on various +settings. In no case is an encoding stored together with your data, it is +I that decides encoding, not any magical metadata. + +=item 3. The internal utf-8 flag has no meaning with regards to the +encoding of your string. + +Just ignore that flag unless you debug a Perl bug, a module written in +XS or want to dive into the internals of perl. Otherwise it will only +confuse you, as, despite the name, it says nothing about how your string +is encoded. You can have unicode strings with that flag set, with that +flag clear, and you can have binary data with that flag set and that flag +clear. Other possibilities exist, too. + +If you didn't know about that flag, just the better, pretend it doesn't +exist. + +=item 4. A "Unicode String" is simply a string where each character can be +validly interpreted as a Unicode codepoint. + +If you have UTF-8 encoded data, it is no longer a Unicode string, but a +Unicode string encoded in UTF-8, giving you a binary string. + +=item 5. A string containing "high" (> 255) character values is I a UTF-8 string. + +Its a fact. Learn to live with it. + +=back + +I hope this helps :) + + =head1 OBJECT-ORIENTED INTERFACE The object oriented interface lets you configure your own encoding or