The essence of the trick: make the sign bit the least significant bit. Then all the leading bytes for integers of small magnitude remain zeros (not 0xFF like in two-complement integers), and compression becomes easy.
This is an amazing explanation, and immediately makes the formula `(val<<1) ^ (val>>31)` intuitive, compared to trying to interpret it as mapping x -> 2|x| - 1.
I used this trick for delta compression for an ArrayBuffer backed data exchange format. Since it was getting sent from a web server, we got gzip for free, but could still shrink data sizes by filtering it in such a way as to make it easier to compress. Delta compressing correlated data points yields a lot of small, but randomly distributed positive/negative values, so by zig-zag encoding them but saving as 32bit values, you end up with a lot of runs of 0's, which get compressed very efficiently, and only require a quick pass through the data to restore the original values
Came here to say the same thing. When I first saw the protobuf zigzag encoding, I saw the same pattern. At first, I thought I must be wrong because they aren't describing it this way. Nope, it really is that simple.
In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.
I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.
The `fast_decode` as written is undefined behavior for some inputs (INT_MAX for example) because of the integer overflow in `2*x`. This can be fixed by casting to an unsigned int first.
Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.
It's pretty cool how a simple encoding scheme can have profound effects on compression. This is a big part of what makes DCT-style block compression work as well as it does. You rig the game such that all of your higher frequency components wind up at the end - ideally as zeroes after quantization. So, a simple RLE scheme can do a lot more work than it otherwise could.
The idea behind this is to scan diagonally by frequency so that big numbers are grouped together at the start. You can choose to zero out some of the bits near the end since those only contain high frequency components that are harder to notice. RLE then has the ability to work on the trailing zeroes, or other methods can be used instead.
reminds me of this video on using "-2" as a base for numbers (can represent all positive and negative integers without a sign bit) https://www.youtube.com/watch?v=GWA_NraxOUw
Protobuf was the first major project I was aware of that uses zigzag encoding for integers. They then encode those integers using the typical 7 bit encoding scheme where the msb indicates there's another byte following. I'm sure they weren't the first by any means though.
I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.
Yeah, I don't like that method for encoding variable length integers.
I usually use something I came up with years ago where the low 2 or 3 bits are the length - 1 in bytes. It's more compact and can be serialized/deserialized without branches in the common case. It uses the count leading zeroes instruction for encoding. I'm sure somebody else has come up with the idea as well.
The downside is you can only represent 30 bits for u32 and 61 for u64. In many cases you know you don't have values that large, so it's fine.
If you're encoding arrays of integers, Lemire has a library that uses vector instructions to encode them in the optimal number of bits. That's even better, but the use cases are far more restricted.
They use a variant of that in most Modern SAT solvers for the literal encoding by the way, i.e. in order to encode negation of variables, which traditionally are represented with negative integers. Mostly, because they then use the encoded literal as indeces in an array. Historically, this also had performance reasons if I remember correctly. I feel it has no benefit anymore, and people simply got used to it (and never touch that part of the code anyway). But I might be wrong. I did not yet benchmarked any alternative, but it is on my todo-list (woth low priority).
A nice bijection from integers to natural numbers. Mappings exist for rationals -> natural numbers and other countable sets, but may be less practical.
That either doesn’t map to rationals (making 1/1 and 2/2 two different numbers) or not a bijection (0x0000000100000001 and 0x0000000200000002 both map back to 1) and doesn’t work on “the integers”.
Most people just care about having a lossless round-trip conversion, from the source representation to a storage representation and then back to the source representation, not about having every possible bit sequence in the storage representation being valid. Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.
> Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.
It is not that difficult. https://en.wikipedia.org/wiki/Calkin–Wilf_tree#Stern's_diato... defines fusc(n), a function that’s fairly easy to compute (it is recursively defined, but only requires a recursion depth of the number of bits in the binary presentation of n), with fusc(n)/fusc(n+1) being a mapping from the natural numbers to all positive rational numbers.
“The parent of any rational number can be determined after placing the number into simplest terms, as a fraction a/b for which greatest common divisor of an and b is 1. If a/b < 1, the parent of a/b is a/(b − a); if a/b > 1, the parent of a/b is (a − b)/b.”*
Repeatedly using that procedure, you eventually reach 1, and then have found the path from 1 down to your starting number a/b. if you have that finding the n for which fusc(n)/fusc(n+1) = a/b is trivial.
2/2 is not irreducible, so not an issue. But yes, it does create some waste in the encoding. Not obvious how to fix it, as coprimality is hard to encode for.
it's cool just the name is enough to tell you what the encoding is. he didn't mention it in the post but it can be visualized as starting from 0, then zigzagging right and left to 1, -1, 2, -2 ... and labelling each point with consecutive natural numbers.