2015-11-15 Variable byte encoding of unsigned integers.
Order can be preserved with variable byte integer encoding. Different implementations indicate optimisations in JavaScript.
A common way to encode unsigned integer is to have a sequence of bytes, with 7-bit data, and 1-bit length encoding per byte. The naive implementation usually does not preserve the integer order into the lexicographical order of the encoded bytes, but is possible.
Encoding:
bytes.append(number % 128)
number = number / 128
while number > 0:
number = number – 1
bytes.append(128 + (number % 128))
number = number / 128
bytes.reverse()
Decoding:
number = 0
while true:
byte = getNextByte()
number = number * 128 + (byte % 128)
if byte & 128:
number = number + 1
else:
break
The key elements to ensure that the lexicographical order of the binary strings preserve the order of the integers are:
- The binary string for an integer is unique by adding/subtracting 1 to the number. This both adds slightly larger rang, and guarantees we do not have the case where 1000000 00000001 and 00000001 would both decode to 1, – in contrast to some variable byte integer encodings.
- When being order-preserving, it is given that the most significant bit must be the one used for byte length encoding, and it must be 1 if there are more bytes to read and zero otherwise.
This is useful if you have a lexicographically ordered binary keys, and want to encode arbitrary unsigned integers where you do not know the size beforehand. The concrete case I was thinking about was efficient encoding of timestamps appended into a sorted data structure, without hitting the year 2038 bug.
I implemented a couple of versions in JavaScript with the following conclusions on v8:
- Implementation using typed arrays are 10-30x faster than using arrays (when optimised correctly).
- Optimising for only encoding 32bit integers, instead of up to 53 integer bits from a double, can yield about 30% performance.
- You need to benchmark each version individually, as the JIT otherwise can make the results incomparable, (even if you warm it up a bit).
- In JavaScript on V8 on a slow laptop, with above optimisations you can encode 1MB/s version, versus 40KB/s with the naïve version.
- GIST available here: https://gist.github.com/rasmuserik/47ce8e20e88f8443a79d