--- libecb/ecb.pod	2011/05/31 21:52:31	1.25
+++ libecb/ecb.pod	2021/08/20 20:09:12	1.99
@@ -12,14 +12,14 @@
 
     http://software.schmorp.de/pkg/libecb
 
-It mainly provides a number of wrappers around GCC built-ins, together
-with replacement functions for other compilers. In addition to this,
-it provides a number of other lowlevel C utilities, such as endianness
-detection, byte swapping or bit rotations.
-
-Or in other words, things that should be built into any standard C system,
-but aren't, implemented as efficient as possible with GCC, and still
-correct with other compilers.
+It mainly provides a number of wrappers around many compiler built-ins,
+together with replacement functions for other compilers. In addition
+to this, it provides a number of other lowlevel C utilities, such as
+endianness detection, byte swapping or bit rotations.
+
+Or in other words, things that should be built into any standard C
+system, but aren't, implemented as efficient as possible with GCC (clang,
+msvc...), and still correct with other compilers.
 
 More might come.
 
@@ -56,44 +56,200 @@
 is usually implemented as a macro. Specifically, a "bool" in this manual
 refers to any kind of boolean value, not a specific type.
 
-=head2 GCC ATTRIBUTES
+=head2 TYPES / TYPE SUPPORT
 
-A major part of libecb deals with GCC attributes. These are additional
-attributes that you cna assign to functions, variables and sometimes even
-types - much like C<const> or C<volatile> in C.
-
-While GCC allows declarations to show up in many surprising places,
-but not in many expeted places, the safest way is to put attribute
-declarations before the whole declaration:
+ecb.h makes sure that the following types are defined (in the expected way):
 
-   ecb_const int mysqrt (int a);
-   ecb_unused int i;
+   int8_t       uint8_
+   int16_t      uint16_t
+   int32_t      uint32_
+   int64_t      uint64_t
+   int_fast8_t  uint_fast8_t
+   int_fast16_t uint_fast16_t
+   int_fast32_t uint_fast32_t
+   int_fast64_t uint_fast64_t
+   intptr_t     uintptr_t
+
+The macro C<ECB_PTRSIZE> is defined to the size of a pointer on this
+platform (currently C<4> or C<8>) and can be used in preprocessor
+expressions.
+
+For C<ptrdiff_t> and C<size_t> use C<stddef.h>/C<cstddef>.
+
+=head2 LANGUAGE/ENVIRONMENT/COMPILER VERSIONS
+
+All the following symbols expand to an expression that can be tested in
+preprocessor instructions as well as treated as a boolean (use C<!!> to
+ensure it's either C<0> or C<1> if you need that).
+
+=over
+
+=item ECB_C
+
+True if the implementation defines the C<__STDC__> macro to a true value,
+while not claiming to be C++, i..e C, but not C++.
+
+=item ECB_C99
+
+True if the implementation claims to be compliant to C99 (ISO/IEC
+9899:1999) or any later version, while not claiming to be C++.
+
+Note that later versions (ECB_C11) remove core features again (for
+example, variable length arrays).
+
+=item ECB_C11, ECB_C17
+
+True if the implementation claims to be compliant to C11/C17 (ISO/IEC
+9899:2011, :20187) or any later version, while not claiming to be C++.
+
+=item ECB_CPP
+
+True if the implementation defines the C<__cplusplus__> macro to a true
+value, which is typically true for C++ compilers.
+
+=item ECB_CPP11, ECB_CPP14, ECB_CPP17
+
+True if the implementation claims to be compliant to C++11/C++14/C++17
+(ISO/IEC 14882:2011, :2014, :2017) or any later version.
+
+Note that many C++20 features will likely have their own feature test
+macros (see e.g. L<http://eel.is/c++draft/cpp.predefined#1.8>).
+
+=item ECB_OPTIMIZE_SIZE
+
+Is C<1> when the compiler optimizes for size, C<0> otherwise. This symbol
+can also be defined before including F<ecb.h>, in which case it will be
+unchanged.
+
+=item ECB_GCC_VERSION (major, minor)
+
+Expands to a true value (suitable for testing by the preprocessor) if the
+compiler used is GNU C and the version is the given version, or higher.
+
+This macro tries to return false on compilers that claim to be GCC
+compatible but aren't.
+
+=item ECB_EXTERN_C
+
+Expands to C<extern "C"> in C++, and a simple C<extern> in C.
+
+This can be used to declare a single external C function:
+
+   ECB_EXTERN_C int printf (const char *format, ...);
+
+=item ECB_EXTERN_C_BEG / ECB_EXTERN_C_END
+
+These two macros can be used to wrap multiple C<extern "C"> definitions -
+they expand to nothing in C.
+
+They are most useful in header files:
+
+   ECB_EXTERN_C_BEG
+
+   int mycfun1 (int x);
+   int mycfun2 (int x);
+
+   ECB_EXTERN_C_END
+
+=item ECB_STDFP
+
+If this evaluates to a true value (suitable for testing by the
+preprocessor), then C<float> and C<double> use IEEE 754 single/binary32
+and double/binary64 representations internally I<and> the endianness of
+both types match the endianness of C<uint32_t> and C<uint64_t>.
+
+This means you can just copy the bits of a C<float> (or C<double>) to an
+C<uint32_t> (or C<uint64_t>) and get the raw IEEE 754 bit representation
+without having to think about format or endianness.
+
+This is true for basically all modern platforms, although F<ecb.h> might
+not be able to deduce this correctly everywhere and might err on the safe
+side.
+
+=item ECB_64BIT_NATIVE
+
+Evaluates to a true value (suitable for both preprocessor and C code
+testing) if 64 bit integer types on this architecture are evaluated
+"natively", that is, with similar speeds as 32 bit integers. While 64 bit
+integer support is very common (and in fact required by libecb), 32 bit
+cpus have to emulate operations on them, so you might want to avoid them.
+
+=item ECB_AMD64, ECB_AMD64_X32
+
+These two macros are defined to C<1> on the x86_64/amd64 ABI and the X32
+ABI, respectively, and undefined elsewhere.
+
+The designers of the new X32 ABI for some inexplicable reason decided to
+make it look exactly like amd64, even though it's completely incompatible
+to that ABI, breaking about every piece of software that assumed that
+C<__x86_64> stands for, well, the x86-64 ABI, making these macros
+necessary.
 
-For variables, it is often nicer to put the attribute after the name, and
-avoid multiple declarations using commas:
+=back
+
+=head2 MACRO TRICKERY
+
+=over
+
+=item ECB_CONCAT (a, b)
+
+Expands any macros in C<a> and C<b>, then concatenates the result to form
+a single token. This is mainly useful to form identifiers from components,
+e.g.:
+
+   #define S1 str
+   #define S2 cpy
+
+   ECB_CONCAT (S1, S2)(dst, src); // == strcpy (dst, src);
+
+=item ECB_STRINGIFY (arg)
+
+Expands any macros in C<arg> and returns the stringified version of
+it. This is mainly useful to get the contents of a macro in string form,
+e.g.:
+
+   #define SQL_LIMIT 100
+   sql_exec ("select * from table limit " ECB_STRINGIFY (SQL_LIMIT));
+
+=item ECB_STRINGIFY_EXPR (expr)
 
-   int i ecb_unused;
+Like C<ECB_STRINGIFY>, but additionally evaluates C<expr> to make sure it
+is a valid expression. This is useful to catch typos or cases where the
+macro isn't available:
 
-=over 4
+   #include <errno.h>
 
-=item ecb_attribute ((attrs...))
+   ECB_STRINGIFY      (EDOM); // "33" (on my system at least)
+   ECB_STRINGIFY_EXPR (EDOM); // "33"
 
-A simple wrapper that expands to C<__attribute__((attrs))> on GCC, and to
-nothing on other compilers, so the effect is that only GCC sees these.
+   // now imagine we had a typo:
 
-Example: use the C<deprecated> attribute on a function.
+   ECB_STRINGIFY      (EDAM); // "EDAM"
+   ECB_STRINGIFY_EXPR (EDAM); // error: EDAM undefined
 
-  ecb_attribute((__deprecated__)) void
-  do_not_use_me_anymore (void);
+=back
+
+=head2 ATTRIBUTES
+
+A major part of libecb deals with additional attributes that can be
+assigned to functions, variables and sometimes even types - much like
+C<const> or C<volatile> in C. They are implemented using either GCC
+attributes or other compiler/language specific features. Attributes
+declarations must be put before the whole declaration:
+
+   ecb_const int mysqrt (int a);
+   ecb_unused int i;
+
+=over
 
 =item ecb_unused
 
 Marks a function or a variable as "unused", which simply suppresses a
-warning by GCC when it detects it as unused. This is useful when you e.g.
-declare a variable but do not always use it:
+warning by the compiler when it detects it as unused. This is useful when
+you e.g. declare a variable but do not always use it:
 
   {
-    int var ecb_unused;
+    ecb_unused int var;
 
     #ifdef SOMECONDITION
        var = ...;
@@ -103,9 +259,34 @@
     #endif
   }
 
+=item ecb_deprecated
+
+Similar to C<ecb_unused>, but marks a function, variable or type as
+deprecated. This makes some compilers warn when the type is used.
+
+=item ecb_deprecated_message (message)
+
+Same as C<ecb_deprecated>, but if possible, the specified diagnostic is
+used instead of a generic depreciation message when the object is being
+used.
+
+=item ecb_inline
+
+Expands either to (a compiler-specific equivalent of) C<static inline> or
+to just C<static>, if inline isn't supported. It should be used to declare
+functions that should be inlined, for code size or speed reasons.
+
+Example: inline this function, it surely will reduce codesize.
+
+   ecb_inline int
+   negmul (int a, int b)
+   {
+     return - (a * b);
+   }
+
 =item ecb_noinline
 
-Prevent a function from being inlined - it might be optimised away, but
+Prevents a function from being inlined - it might be optimised away, but
 not inlined into other functions. This is useful if you know your function
 is rarely called and large enough for inlining not to be helpful.
 
@@ -125,6 +306,27 @@
 In this case, the compiler would probably be smart enough to deduce it on
 its own, so this is mainly useful for declarations.
 
+=item ecb_restrict
+
+Expands to the C<restrict> keyword or equivalent on compilers that support
+them, and to nothing on others. Must be specified on a pointer type or
+an array index to indicate that the memory doesn't alias with any other
+restricted pointer in the same scope.
+
+Example: multiply a vector, and allow the compiler to parallelise the
+loop, because it knows it doesn't overwrite input values.
+
+   void
+   multiply (ecb_restrict float *src,
+             ecb_restrict float *dst,
+             int len, float factor)
+   {
+     int i;
+
+     for (i = 0; i < len; ++i)
+       dst [i] = src [i] * factor;
+   }
+
 =item ecb_const
 
 Declares that the function only depends on the values of its arguments,
@@ -186,7 +388,7 @@
 functions), this knowledge can be used in other ways, for example, the
 function will be optimised for size, as opposed to speed, and codepaths
 leading to calls to those functions can automatically be marked as if
-C<ecb_unlikely> had been used to reach them.
+C<ecb_expect_false> had been used to reach them.
 
 Good examples for such functions would be error reporting functions, or
 functions only called in exceptional or rare cases.
@@ -194,7 +396,7 @@
 =item ecb_artificial
 
 Declares the function as "artificial", in this case meaning that this
-function is not really mean to be a function, but more like an accessor
+function is not really meant to be a function, but more like an accessor
 - many methods in C++ classes are mere accessor functions, and having a
 crash reported in such a method, or single-stepping through them, is not
 usually so helpful, especially when it's inlined to just a few instructions.
@@ -222,9 +424,9 @@
 
 =head2 OPTIMISATION HINTS
 
-=over 4
+=over
 
-=item bool ecb_is_constant(expr)
+=item bool ecb_is_constant (expr)
 
 Returns true iff the expression can be deduced to be a compile-time
 constant, and false otherwise.
@@ -252,18 +454,18 @@
       : (n * (uint32_t)rndm16 ()) >> 16;
   }
 
-=item bool ecb_expect (expr, value)
+=item ecb_expect (expr, value)
 
 Evaluates C<expr> and returns it. In addition, it tells the compiler that
 the C<expr> evaluates to C<value> a lot, which can be used for static
 branch optimisations.
 
-Usually, you want to use the more intuitive C<ecb_likely> and
-C<ecb_unlikely> functions instead.
+Usually, you want to use the more intuitive C<ecb_expect_true> and
+C<ecb_expect_false> functions instead.
 
-=item bool ecb_likely (cond)
+=item bool ecb_expect_true (cond)
 
-=item bool ecb_unlikely (cond)
+=item bool ecb_expect_false (cond)
 
 These two functions expect a expression that is true or false and return
 C<1> or C<0>, respectively, so when used in the condition of an C<if> or
@@ -271,18 +473,18 @@
 
   /* these two do the same thing */
   if (some_condition) ...;
-  if (ecb_likely (some_condition)) ...;
+  if (ecb_expect_true (some_condition)) ...;
 
-However, by using C<ecb_likely>, you tell the compiler that the condition
-is likely to be true (and for C<ecb_unlikely>, that it is unlikely to be
-true).
+However, by using C<ecb_expect_true>, you tell the compiler that the
+condition is likely to be true (and for C<ecb_expect_false>, that it is
+unlikely to be true).
 
 For example, when you check for a null pointer and expect this to be a
-rare, exceptional, case, then use C<ecb_unlikely>:
+rare, exceptional, case, then use C<ecb_expect_false>:
 
   void my_free (void *ptr)
   {
-    if (ecb_unlikely (ptr == 0))
+    if (ecb_expect_false (ptr == 0))
       return;
   }
 
@@ -290,6 +492,12 @@
 tell the compiler what the hot path through a function is can increase
 performance considerably.
 
+You might know these functions under the name C<likely> and C<unlikely>
+- while these are common aliases, we find that the expect name is easier
+to understand when quickly skimming code. If you wish, you can use
+C<ecb_likely> instead of C<ecb_expect_true> and C<ecb_unlikely> instead of
+C<ecb_expect_false> - these are simply aliases.
+
 A very good example is in a function that reserves more space for some
 memory block (for example, inside an implementation of a string stream) -
 each time something is added, you have to check for a buffer overrun, but
@@ -299,26 +507,27 @@
   ecb_inline void
   reserve (int size)
   {
-    if (ecb_unlikely (current + size > end))
+    if (ecb_expect_false (current + size > end))
       real_reserve_method (size); /* presumably noinline */
   }
 
-=item bool ecb_assume (cond)
+=item ecb_assume (cond)
 
-Try to tell the compiler that some condition is true, even if it's not
-obvious.
+Tries to tell the compiler that some condition is true, even if it's not
+obvious. This is not a function, but a statement: it cannot be used in
+another expression.
 
 This can be used to teach the compiler about invariants or other
 conditions that might improve code generation, but which are impossible to
 deduce form the code itself.
 
-For example, the example reservation function from the C<ecb_unlikely>
+For example, the example reservation function from the C<ecb_expect_false>
 description could be written thus (only C<ecb_assume> was added):
 
   ecb_inline void
   reserve (int size)
   {
-    if (ecb_unlikely (current + size > end))
+    if (ecb_expect_false (current + size > end))
       real_reserve_method (size); /* presumably noinline */
 
     ecb_assume (current + size <= end);
@@ -333,13 +542,13 @@
 completely, as it knows that C<< current + 1 > end >> is false and the
 call will never be executed.
 
-=item bool ecb_unreachable ()
+=item ecb_unreachable ()
 
 This function does nothing itself, except tell the compiler that it will
 never be executed. Apart from suppressing a warning in some cases, this
-function can be used to implement C<ecb_assume> or similar functions.
+function can be used to implement C<ecb_assume> or similar functionality.
 
-=item bool ecb_prefetch (addr, rw, locality)
+=item ecb_prefetch (addr, rw, locality)
 
 Tells the compiler to try to prefetch memory at the given C<addr>ess
 for either reading (C<rw> = 0) or writing (C<rw> = 1). A C<locality> of
@@ -349,6 +558,9 @@
 need to be accessible (it could be a null pointer for example), but C<rw>
 and C<locality> must be compile-time constants.
 
+This is a statement, not a function: you cannot use it as part of an
+expression.
+
 An obvious way to use this is to prefetch some data far away, in a big
 array you loop over. This prefetches memory some 128 array elements later,
 in the hope that it will be ready when the CPU arrives at that location.
@@ -377,9 +589,9 @@
 
 =back
 
-=head2 BIT FIDDLING / BITSTUFFS
+=head2 BIT FIDDLING / BIT WIZARDRY
 
-=over 4
+=over
 
 =item bool ecb_big_endian ()
 
@@ -393,44 +605,634 @@
 
 =item int ecb_ctz32 (uint32_t x)
 
+=item int ecb_ctz64 (uint64_t x)
+
+=item int ecb_ctz (T x) [C++]
+
 Returns the index of the least significant bit set in C<x> (or
 equivalently the number of bits set to 0 before the least significant bit
-set), starting from 0. If C<x> is 0 the result is undefined. A common use
-case is to compute the integer binary logarithm, i.e.,  C<floor (log2
-(n))>. For example:
+set), starting from 0. If C<x> is 0 the result is undefined.
+
+For smaller types than C<uint32_t> you can safely use C<ecb_ctz32>.
+
+The overloaded C++ C<ecb_ctz> function supports C<uint8_t>, C<uint16_t>,
+C<uint32_t> and C<uint64_t> types.
+
+For example:
 
   ecb_ctz32 (3) = 0
   ecb_ctz32 (6) = 1
 
+=item bool ecb_is_pot32 (uint32_t x)
+
+=item bool ecb_is_pot64 (uint32_t x)
+
+=item bool ecb_is_pot (T x) [C++]
+
+Returns true iff C<x> is a power of two or C<x == 0>.
+
+For smaller types than C<uint32_t> you can safely use C<ecb_is_pot32>.
+
+The overloaded C++ C<ecb_is_pot> function supports C<uint8_t>, C<uint16_t>,
+C<uint32_t> and C<uint64_t> types.
+
+=item int ecb_ld32 (uint32_t x)
+
+=item int ecb_ld64 (uint64_t x)
+
+=item int ecb_ld64 (T x) [C++]
+
+Returns the index of the most significant bit set in C<x>, or the number
+of digits the number requires in binary (so that C<< 2**ld <= x <
+2**(ld+1) >>). If C<x> is 0 the result is undefined. A common use case is
+to compute the integer binary logarithm, i.e. C<floor (log2 (n))>, for
+example to see how many bits a certain number requires to be encoded.
+
+This function is similar to the "count leading zero bits" function, except
+that that one returns how many zero bits are "in front" of the number (in
+the given data type), while C<ecb_ld> returns how many bits the number
+itself requires.
+
+For smaller types than C<uint32_t> you can safely use C<ecb_ld32>.
+
+The overloaded C++ C<ecb_ld> function supports C<uint8_t>, C<uint16_t>,
+C<uint32_t> and C<uint64_t> types.
+
 =item int ecb_popcount32 (uint32_t x)
 
-Returns the number of bits set to 1 in C<x>. For example:
+=item int ecb_popcount64 (uint64_t x)
+
+=item int ecb_popcount (T x) [C++]
+
+Returns the number of bits set to 1 in C<x>.
+
+For smaller types than C<uint32_t> you can safely use C<ecb_popcount32>.
+
+The overloaded C++ C<ecb_popcount> function supports C<uint8_t>, C<uint16_t>,
+C<uint32_t> and C<uint64_t> types.
+
+For example:
 
   ecb_popcount32 (7) = 3
   ecb_popcount32 (255) = 8
 
+=item uint8_t  ecb_bitrev8  (uint8_t  x)
+
+=item uint16_t ecb_bitrev16 (uint16_t x)
+
+=item uint32_t ecb_bitrev32 (uint32_t x)
+
+=item T ecb_bitrev (T x) [C++]
+
+Reverses the bits in x, i.e. the MSB becomes the LSB, MSB-1 becomes LSB+1
+and so on.
+
+The overloaded C++ C<ecb_bitrev> function supports C<uint8_t>, C<uint16_t> and C<uint32_t> types.
+
+Example:
+
+   ecb_bitrev8 (0xa7) = 0xea
+   ecb_bitrev32 (0xffcc4411) = 0x882233ff
+
+=item T ecb_bitrev (T x) [C++]
+
+Overloaded C++ bitrev function.
+
+C<T> must be one of C<uint8_t>, C<uint16_t> or C<uint32_t>.
+
 =item uint32_t ecb_bswap16 (uint32_t x)
 
 =item uint32_t ecb_bswap32 (uint32_t x)
 
-These two functions return the value of the 16-bit (32-bit) value C<x>
-after reversing the order of bytes (0x11223344 becomes 0x44332211).
+=item uint64_t ecb_bswap64 (uint64_t x)
 
-=item uint32_t ecb_rotr32 (uint32_t x, unsigned int count)
+=item T ecb_bswap (T x)
+
+These functions return the value of the 16-bit (32-bit, 64-bit) value
+C<x> after reversing the order of bytes (0x11223344 becomes 0x44332211 in
+C<ecb_bswap32>).
+
+The overloaded C++ C<ecb_bswap> function supports C<uint8_t>, C<uint16_t>,
+C<uint32_t> and C<uint64_t> types.
+
+=item uint8_t  ecb_rotl8  (uint8_t  x, unsigned int count)
+
+=item uint16_t ecb_rotl16 (uint16_t x, unsigned int count)
 
 =item uint32_t ecb_rotl32 (uint32_t x, unsigned int count)
 
-These two functions return the value of C<x> after rotating all the bits
-by C<count> positions to the right or left respectively.
+=item uint64_t ecb_rotl64 (uint64_t x, unsigned int count)
+
+=item uint8_t  ecb_rotr8  (uint8_t  x, unsigned int count)
+
+=item uint16_t ecb_rotr16 (uint16_t x, unsigned int count)
+
+=item uint32_t ecb_rotr32 (uint32_t x, unsigned int count)
+
+=item uint64_t ecb_rotr64 (uint64_t x, unsigned int count)
+
+These two families of functions return the value of C<x> after rotating
+all the bits by C<count> positions to the right (C<ecb_rotr>) or left
+(C<ecb_rotl>). There are no restrictions on the value C<count>, i.e. both
+zero and values equal or larger than the word width work correctly. Also,
+notwithstanding C<count> being unsigned, negative numbers work and shift
+to the opposite direction.
+
+Current GCC/clang versions understand these functions and usually compile
+them to "optimal" code (e.g. a single C<rol> or a combination of C<shld>
+on x86).
+
+=item T ecb_rotl (T x, unsigned int count) [C++]
+
+=item T ecb_rotr (T x, unsigned int count) [C++]
+
+Overloaded C++ rotl/rotr functions.
+
+C<T> must be one of C<uint8_t>, C<uint16_t>, C<uint32_t> or C<uint64_t>.
+
+=back
+
+=head2 BIT MIXING, HASHING
+
+Sometimes you have an integer and want to distribute its bits well, for
+example, to use it as a hash in a hashtable. A common example is pointer
+values, which often only have a limited range (e.g. low and high bits are
+often zero).
+
+The following functions try to mix the bits to get a good bias-free
+distribution. They were mainly made for pointers, but the underlying
+integer functions are exposed as well.
+
+As an added benefit, the functions are reversible, so if you find it
+convenient to store only the hash value, you can recover the original
+pointer from the hash ("unmix"), as long as your pinters are 32 or 64 bit
+(if this isn't the case on your platform, drop us a note and we will add
+functions for other bit widths).
+
+The unmix functions are very slightly slower than the mix functions, so
+it is equally very slightly preferable to store the original values wehen
+convenient.
+
+The underlying algorithm if subject to change, so currently these
+functions are not suitable for persistent hash tables, as their result
+value can change between diferent versions of libecb.
+
+=over
+
+=item uintptr_t ecb_ptrmix (void *ptr)
+
+Mixes the bits of a pointer so the result is suitable for hash table
+lookups. In other words, this hashes the pointer value.
+
+=item uintptr_t ecb_ptrmix (T *ptr) [C++]
+
+Overload the C<ecb_ptrmix> function to work for any pointer in C++.
+
+=item void *ecb_ptrunmix (uintptr_t v)
+
+Unmix the hash value into the original pointer. This only works as long
+as the hash value is not truncated, i.e. you used C<uintptr_t> (or
+equivalent) throughout to store it.
+
+=item T *ecb_ptrunmix<T> (uintptr_t v) [C++]
+
+The somewhat less useful template version of C<ecb_ptrunmix> for
+C++. Example:
+
+   sometype *myptr;
+   uintptr_t hash = ecb_ptrmix (myptr);
+   sometype *orig = ecb_ptrunmix<sometype> (hash);
+
+=item uint32_t ecb_mix32 (uint32_t v)
+
+=item uint64_t ecb_mix64 (uint64_t v)
+
+Sometimes you don't have a pointer but an integer whose values are very
+badly distributed. In this case you cna sue these integer versions of the
+mixing function. No C++ template is provided currently.
+
+=item uint32_t ecb_unmix32 (uint32_t v)
+
+=item uint64_t ecb_unmix64 (uint64_t v)
+
+The reverse of the C<ecb_mix> functions - they take a mixed/hashed value
+and recover the original value.
+
+=back
+
+=head2 HOST ENDIANNESS CONVERSION
+
+=over
+
+=item uint_fast16_t ecb_be_u16_to_host (uint_fast16_t v)
+
+=item uint_fast32_t ecb_be_u32_to_host (uint_fast32_t v)
+
+=item uint_fast64_t ecb_be_u64_to_host (uint_fast64_t v)
+
+=item uint_fast16_t ecb_le_u16_to_host (uint_fast16_t v)
+
+=item uint_fast32_t ecb_le_u32_to_host (uint_fast32_t v)
+
+=item uint_fast64_t ecb_le_u64_to_host (uint_fast64_t v)
+
+Convert an unsigned 16, 32 or 64 bit value from big or little endian to host byte order.
+
+The naming convention is C<ecb_>(C<be>|C<le>)C<_u>C<16|32|64>C<_to_host>,
+where C<be> and C<le> stand for big endian and little endian, respectively.
+
+=item uint_fast16_t ecb_host_to_be_u16 (uint_fast16_t v)
+
+=item uint_fast32_t ecb_host_to_be_u32 (uint_fast32_t v)
+
+=item uint_fast64_t ecb_host_to_be_u64 (uint_fast64_t v)
+
+=item uint_fast16_t ecb_host_to_le_u16 (uint_fast16_t v)
+
+=item uint_fast32_t ecb_host_to_le_u32 (uint_fast32_t v)
+
+=item uint_fast64_t ecb_host_to_le_u64 (uint_fast64_t v)
+
+Like above, but converts I<from> host byte order to the specified
+endianness.
 
-Current GCC versions understand these functions and usually compile them
-to "optimal" code (e.g. a single C<roll> on x86).
+=back
+
+In C++ the following additional template functions are supported:
+
+=over
+
+=item T ecb_be_to_host (T v)
+
+=item T ecb_le_to_host (T v)
+
+=item T ecb_host_to_be (T v)
+
+=item T ecb_host_to_le (T v)
+
+=back
+
+These functions work like their C counterparts, above, but use templates,
+which make them useful in generic code.
+
+C<T> must be one of C<uint8_t>, C<uint16_t>, C<uint32_t> or C<uint64_t>
+(so unlike their C counterparts, there is a version for C<uint8_t>, which
+again can be useful in generic code).
+
+=head2 UNALIGNED LOAD/STORE
+
+These function load or store unaligned multi-byte values.
+
+=over
+
+=item uint_fast16_t ecb_peek_u16_u (const void *ptr)
+
+=item uint_fast32_t ecb_peek_u32_u (const void *ptr)
+
+=item uint_fast64_t ecb_peek_u64_u (const void *ptr)
+
+These functions load an unaligned, unsigned 16, 32 or 64 bit value from
+memory.
+
+=item uint_fast16_t ecb_peek_be_u16_u (const void *ptr)
+
+=item uint_fast32_t ecb_peek_be_u32_u (const void *ptr)
+
+=item uint_fast64_t ecb_peek_be_u64_u (const void *ptr)
+
+=item uint_fast16_t ecb_peek_le_u16_u (const void *ptr)
+
+=item uint_fast32_t ecb_peek_le_u32_u (const void *ptr)
+
+=item uint_fast64_t ecb_peek_le_u64_u (const void *ptr)
+
+Like above, but additionally convert from big endian (C<be>) or little
+endian (C<le>) byte order to host byte order while doing so.
+
+=item ecb_poke_u16_u (void *ptr, uint16_t v)
+
+=item ecb_poke_u32_u (void *ptr, uint32_t v)
+
+=item ecb_poke_u64_u (void *ptr, uint64_t v)
+
+These functions store an unaligned, unsigned 16, 32 or 64 bit value to
+memory.
+
+=item ecb_poke_be_u16_u (void *ptr, uint_fast16_t v)
+
+=item ecb_poke_be_u32_u (void *ptr, uint_fast32_t v)
+
+=item ecb_poke_be_u64_u (void *ptr, uint_fast64_t v)
+
+=item ecb_poke_le_u16_u (void *ptr, uint_fast16_t v)
+
+=item ecb_poke_le_u32_u (void *ptr, uint_fast32_t v)
+
+=item ecb_poke_le_u64_u (void *ptr, uint_fast64_t v)
+
+Like above, but additionally convert from host byte order to big endian
+(C<be>) or little endian (C<le>) byte order while doing so.
+
+=back
+
+In C++ the following additional template functions are supported:
+
+=over
+
+=item T ecb_peek<T>      (const void *ptr)
+
+=item T ecb_peek_be<T>   (const void *ptr)
+
+=item T ecb_peek_le<T>   (const void *ptr)
+
+=item T ecb_peek_u<T>    (const void *ptr)
+
+=item T ecb_peek_be_u<T> (const void *ptr)
+
+=item T ecb_peek_le_u<T> (const void *ptr)
+
+Similarly to their C counterparts, these functions load an unsigned 8, 16,
+32 or 64 bit value from memory, with optional conversion from big/little
+endian.
+
+Since the type cannot be deduced, it has to be specified explicitly, e.g.
+
+   uint_fast16_t v = ecb_peek<uint16_t> (ptr);
+
+C<T> must be one of C<uint8_t>, C<uint16_t>, C<uint32_t> or C<uint64_t>.
+
+Unlike their C counterparts, these functions support 8 bit quantities
+(C<uint8_t>) and also have an aligned version (without the C<_u> prefix),
+all of which hopefully makes them more useful in generic code.
+
+=item ecb_poke      (void *ptr, T v)
+
+=item ecb_poke_be   (void *ptr, T v)
+
+=item ecb_poke_le   (void *ptr, T v)
+
+=item ecb_poke_u    (void *ptr, T v)
+
+=item ecb_poke_be_u (void *ptr, T v)
+
+=item ecb_poke_le_u (void *ptr, T v)
+
+Again, similarly to their C counterparts, these functions store an
+unsigned 8, 16, 32 or z64 bit value to memory, with optional conversion to
+big/little endian.
+
+C<T> must be one of C<uint8_t>, C<uint16_t>, C<uint32_t> or C<uint64_t>.
+
+Unlike their C counterparts, these functions support 8 bit quantities
+(C<uint8_t>) and also have an aligned version (without the C<_u> prefix),
+all of which hopefully makes them more useful in generic code.
+
+=back
+
+=head2 FAST INTEGER TO STRING
+
+Libecb defines a set of very fast integer to decimal string (or integer
+to ascii, short C<i2a>) functions.  These work by converting the integer
+to a fixed point representation and then successively multiplying out
+the topmost digits. Unlike some other, also very fast, libraries, ecb's
+algorithm should be completely branchless per digit, and does not rely on
+the presence of special cpu functions (such as clz).
+
+There is a high level API that takes an C<int32_t>, C<uint32_t>,
+C<int64_t> or C<uint64_t> as argument, and a low-level API, which is
+harder to use but supports slightly more formatting options.
+
+=head3 HIGH LEVEL API
+
+The high level API consists of four functions, one each for C<int32_t>,
+C<uint32_t>, C<int64_t> and C<uint64_t>:
+
+Example:
+
+   char buf[ECB_I2A_MAX_DIGITS + 1];
+   char *end = ecb_i2a_i32 (buf, 17262);
+   *end = 0;
+   // buf now contains "17262"
+
+=over
+
+=item ECB_I2A_I32_DIGITS (=11)
+
+=item char *ecb_i2a_u32 (char *ptr, uint32_t value)
+
+Takes an C<uint32_t> I<value> and formats it as a decimal number starting
+at I<ptr>, using at most C<ECB_I2A_I32_DIGITS> characters. Returns a
+pointer to just after the generated string, where you would normally put
+the terminating C<0> character. This function outputs the minimum number
+of digits.
+
+=item ECB_I2A_U32_DIGITS (=10)
+
+=item char *ecb_i2a_i32 (char *ptr, int32_t value)
+
+Same as C<ecb_i2a_u32>, but formats a C<int32_t> value, including a minus
+sign if needed.
+
+=item ECB_I2A_I64_DIGITS (=20)
+
+=item char *ecb_i2a_u64 (char *ptr, uint64_t value)
+
+=item ECB_I2A_U64_DIGITS (=21)
+
+=item char *ecb_i2a_i64 (char *ptr, int64_t value)
+
+Similar to their 32 bit counterparts, these take a 64 bit argument.
+
+=item ECB_I2A_MAX_DIGITS (=21)
+
+Instead of using a type specific length macro, you can just use
+C<ECB_I2A_MAX_DIGITS>, which is good enough for any C<ecb_i2a> function.
+
+=back
+
+=head3 LOW-LEVEL API
+
+The functions above use a number of low-level APIs which have some strict
+limitations, but can be used as building blocks (studying C<ecb_i2a_i32>
+and related functions is recommended).
+
+There are three families of functions: functions that convert a number
+to a fixed number of digits with leading zeroes (C<ecb_i2a_0N>, C<0>
+for "leading zeroes"), functions that generate up to N digits, skipping
+leading zeroes (C<_N>), and functions that can generate more digits, but
+the leading digit has limited range (C<_xN>).
+
+None of the functions deal with negative numbers.
+
+Example: convert an IP address in an u32 into dotted-quad:
+
+   uint32_t ip = 0x0a000164; // 10.0.1.100
+   char ips[3 * 4 + 3 + 1];
+   char *ptr = ips;
+   ptr = ecb_i2a_3 (ptr,  ip >> 24        ); *ptr++ = '.';
+   ptr = ecb_i2a_3 (ptr, (ip >> 16) & 0xff); *ptr++ = '.';
+   ptr = ecb_i2a_3 (ptr, (ip >>  8) & 0xff); *ptr++ = '.';
+   ptr = ecb_i2a_3 (ptr,  ip        & 0xff); *ptr++ = 0;
+   printf ("ip: %s\n", ips); // prints "ip: 10.0.1.100"
+
+=over
+
+=item char *ecb_i2a_02  (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_03  (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_04  (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_05  (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_06  (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_07  (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_08  (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_09  (char *ptr, uint32_t value) // 64 bit
+
+The C<< ecb_i2a_0I<N> > functions take an unsigned I<value> and convert
+them to exactly I<N> digits, returning a pointer to the first character
+after the digits. The I<value> must be in range. The functions marked with
+I<32 bit> do their calculations internally in 32 bit, the ones marked with
+I<64 bit> internally use 64 bit integers, which might be slow on 32 bit
+architectures (the high level API decides on 32 vs. 64 bit versions using
+C<ECB_64BIT_NATIVE>).
+
+=item char *ecb_i2a_2   (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_3   (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_4   (char *ptr, uint32_t value) // 32 bit
+
+=item char *ecb_i2a_5   (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_6   (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_7   (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_8   (char *ptr, uint32_t value) // 64 bit
+
+=item char *ecb_i2a_9   (char *ptr, uint32_t value) // 64 bit
+
+Similarly, the C<< ecb_i2a_I<N> > functions take an unsigned I<value>
+and convert them to at most I<N> digits, suppressing leading zeroes, and
+returning a pointer to the first character after the digits.
+
+=item ECB_I2A_MAX_X5 (=59074)
+
+=item char *ecb_i2a_x5  (char *ptr, uint32_t value) // 32 bit
+
+=item ECB_I2A_MAX_X10 (=2932500665)
+
+=item char *ecb_i2a_x10 (char *ptr, uint32_t value) // 64 bit
+
+The C<< ecb_i2a_xI<N> >> functions are similar to the C<< ecb_i2a_I<N> >
+functions, but they can generate one digit more, as long as the number
+is within range, which is given by the symbols C<ECB_I2A_MAX_X5> (almost
+16 bit range) and C<ECB_I2A_MAX_X10> (a bit more than 31 bit range),
+respectively.
+
+For example, the digit part of a 32 bit signed integer just fits into the
+C<ECB_I2A_MAX_X10> range, so while C<ecb_i2a_x10> cannot convert a 10
+digit number, it can convert all 32 bit signed numbers. Sadly, it's not
+good enough for 32 bit unsigned numbers.
+
+=back
+
+=head2 FLOATING POINT FIDDLING
+
+=over
+
+=item ECB_INFINITY [-UECB_NO_LIBM]
+
+Evaluates to positive infinity if supported by the platform, otherwise to
+a truly huge number.
+
+=item ECB_NAN [-UECB_NO_LIBM]
+
+Evaluates to a quiet NAN if supported by the platform, otherwise to
+C<ECB_INFINITY>.
+
+=item float ecb_ldexpf (float x, int exp) [-UECB_NO_LIBM]
+
+Same as C<ldexpf>, but always available.
+
+=item uint32_t ecb_float_to_binary16  (float  x) [-UECB_NO_LIBM]
+
+=item uint32_t ecb_float_to_binary32  (float  x) [-UECB_NO_LIBM]
+
+=item uint64_t ecb_double_to_binary64 (double x) [-UECB_NO_LIBM]
+
+These functions each take an argument in the native C<float> or C<double>
+type and return the IEEE 754 bit representation of it (binary16/half,
+binary32/single or binary64/double precision).
+
+The bit representation is just as IEEE 754 defines it, i.e. the sign bit
+will be the most significant bit, followed by exponent and mantissa.
+
+This function should work even when the native floating point format isn't
+IEEE compliant, of course at a speed and code size penalty, and of course
+also within reasonable limits (it tries to convert NaNs, infinities and
+denormals, but will likely convert negative zero to positive zero).
+
+On all modern platforms (where C<ECB_STDFP> is true), the compiler should
+be able to completely optimise away the 32 and 64 bit functions.
+
+These functions can be helpful when serialising floats to the network - you
+can serialise the return value like a normal uint16_t/uint32_t/uint64_t.
+
+Another use for these functions is to manipulate floating point values
+directly.
+
+Silly example: toggle the sign bit of a float.
+
+   /* On gcc-4.7 on amd64, */
+   /* this results in a single add instruction to toggle the bit, and 4 extra */
+   /* instructions to move the float value to an integer register and back. */
+
+   x = ecb_binary32_to_float (ecb_float_to_binary32 (x) ^ 0x80000000U)
+
+=item float  ecb_binary16_to_float  (uint16_t x) [-UECB_NO_LIBM]
+
+=item float  ecb_binary32_to_float  (uint32_t x) [-UECB_NO_LIBM]
+
+=item double ecb_binary64_to_double (uint64_t x) [-UECB_NO_LIBM]
+
+The reverse operation of the previous function - takes the bit
+representation of an IEEE binary16, binary32 or binary64 number (half,
+single or double precision) and converts it to the native C<float> or
+C<double> format.
+
+This function should work even when the native floating point format isn't
+IEEE compliant, of course at a speed and code size penalty, and of course
+also within reasonable limits (it tries to convert normals and denormals,
+and might be lucky for infinities, and with extraordinary luck, also for
+negative zero).
+
+On all modern platforms (where C<ECB_STDFP> is true), the compiler should
+be able to optimise away this function completely.
+
+=item uint16_t ecb_binary32_to_binary16 (uint32_t x)
+
+=item uint32_t ecb_binary16_to_binary32 (uint16_t x)
+
+Convert a IEEE binary32/single precision to binary16/half format, and vice
+versa, handling all details (round-to-nearest-even, subnormals, infinity
+and NaNs) correctly.
+
+These are functions are available under C<-DECB_NO_LIBM>, since
+they do not rely on the platform floating point format. The
+C<ecb_float_to_binary16> and C<ecb_binary16_to_float> functions are
+usually what you want.
 
 =back
 
 =head2 ARITHMETIC
 
-=over 4
+=over
 
 =item x = ecb_mod (m, n)
 
@@ -444,11 +1246,11 @@
 
 C<n> must be strictly positive (i.e. C<< >= 1 >>), while C<m> must be
 negatable, that is, both C<m> and C<-m> must be representable in its
-type (this typically includes the minimum signed integer value, the same
+type (this typically excludes the minimum signed integer value, the same
 limitation as for C</> and C<%> in C).
 
-Current GCC versions compile this into an efficient branchless sequence on
-many systems.
+Current GCC/clang versions compile this into an efficient branchless
+sequence on almost all CPUs.
 
 For example, when you want to rotate forward through the members of an
 array for increasing C<m> (which might be negative), then you should use
@@ -458,11 +1260,20 @@
    for (m = -100; m <= 100; ++m)
      int elem = myarray [ecb_mod (m, ecb_array_length (myarray))];
 
+=item x = ecb_div_rd (val, div)
+
+=item x = ecb_div_ru (val, div)
+
+Returns C<val> divided by C<div> rounded down or up, respectively.
+C<val> and C<div> must have integer types and C<div> must be strictly
+positive. Note that these functions are implemented with macros in C
+and with function templates in C++.
+
 =back
 
 =head2 UTILITY
 
-=over 4
+=over
 
 =item element_count = ecb_array_length (name)
 
@@ -476,4 +1287,53 @@
 
 =back
 
+=head2 SYMBOLS GOVERNING COMPILATION OF ECB.H ITSELF
+
+These symbols need to be defined before including F<ecb.h> the first time.
+
+=over
+
+=item ECB_NO_THREADS
+
+If F<ecb.h> is never used from multiple threads, then this symbol can
+be defined, in which case memory fences (and similar constructs) are
+completely removed, leading to more efficient code and fewer dependencies.
+
+Setting this symbol to a true value implies C<ECB_NO_SMP>.
+
+=item ECB_NO_SMP
+
+The weaker version of C<ECB_NO_THREADS> - if F<ecb.h> is used from
+multiple threads, but never concurrently (e.g. if the system the program
+runs on has only a single CPU with a single core, no hyperthreading and so
+on), then this symbol can be defined, leading to more efficient code and
+fewer dependencies.
+
+=item ECB_NO_LIBM
+
+When defined to C<1>, do not export any functions that might introduce
+dependencies on the math library (usually called F<-lm>) - these are
+marked with [-UECB_NO_LIBM].
+
+=back
+
+=head1 UNDOCUMENTED FUNCTIONALITY
+
+F<ecb.h> is full of undocumented functionality as well, some of which is
+intended to be internal-use only, some of which we forgot to document, and
+some of which we hide because we are not sure we will keep the interface
+stable.
+
+While you are welcome to rummage around and use whatever you find useful
+(we can't stop you), keep in mind that we will change undocumented
+functionality in incompatible ways without thinking twice, while we are
+considerably more conservative with documented things.
+
+=head1 AUTHORS
+
+C<libecb> is designed and maintained by:
+
+   Emanuele Giaquinta <e.giaquinta@glauco.it>
+   Marc Alexander Lehmann <schmorp@schmorp.de>
+