Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Making all your integers positive with zigzag encoding (lemire.me)
109 points by jjgreen on Nov 25, 2022 | hide | past | favorite | 40 comments


The essence of the trick: make the sign bit the least significant bit. Then all the leading bytes for integers of small magnitude remain zeros (not 0xFF like in two-complement integers), and compression becomes easy.


This is an amazing explanation, and immediately makes the formula `(val<<1) ^ (val>>31)` intuitive, compared to trying to interpret it as mapping x -> 2|x| - 1.


I used this trick for delta compression for an ArrayBuffer backed data exchange format. Since it was getting sent from a web server, we got gzip for free, but could still shrink data sizes by filtering it in such a way as to make it easier to compress. Delta compressing correlated data points yields a lot of small, but randomly distributed positive/negative values, so by zig-zag encoding them but saving as 32bit values, you end up with a lot of runs of 0's, which get compressed very efficiently, and only require a quick pass through the data to restore the original values


Came here to say the same thing. When I first saw the protobuf zigzag encoding, I saw the same pattern. At first, I thought I must be wrong because they aren't describing it this way. Nope, it really is that simple.

In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.

I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.


In other words, a left rotation by 1 bit.


No. From the typical 2's compliment number encoding used on all modern computers, it is not a simple rotation.

Think about the encoding of minus 1 for an example that differs...


Argh, you're right. Classic mistake of tunnel-visioning on a subset of the domain.


The `fast_decode` as written is undefined behavior for some inputs (INT_MAX for example) because of the integer overflow in `2*x`. This can be fixed by casting to an unsigned int first.

Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.


It's pretty cool how a simple encoding scheme can have profound effects on compression. This is a big part of what makes DCT-style block compression work as well as it does. You rig the game such that all of your higher frequency components wind up at the end - ideally as zeroes after quantization. So, a simple RLE scheme can do a lot more work than it otherwise could.


Sorry. Are you inferring this is how DCT works or are you making a comparison?

Ive just read up about DCT and don't really have a clue, but it doesn't seems applicable?

That being so. I'm not sure how RLE at the bit level would help. Surely encoding run lengths of 5ish bits isn't going to compress much of anything.


The idea behind this is to scan diagonally by frequency so that big numbers are grouped together at the start. You can choose to zero out some of the bits near the end since those only contain high frequency components that are harder to notice. RLE then has the ability to work on the trailing zeroes, or other methods can be used instead.


Zigzag is a phase after DCT+quantization where you iterate through the resulting matrix in a particular order. Nothing to do with the DCT itself.

https://commons.wikimedia.org/wiki/File:JPEG_example_zigzag....


reminds me of this video on using "-2" as a base for numbers (can represent all positive and negative integers without a sign bit) https://www.youtube.com/watch?v=GWA_NraxOUw


Protobuf was the first major project I was aware of that uses zigzag encoding for integers. They then encode those integers using the typical 7 bit encoding scheme where the msb indicates there's another byte following. I'm sure they weren't the first by any means though.

I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.


This format makes it impossible to write fast decoders, as discussed elsewhere on the same blog as TFA: https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-...

So it would be sweet to instead use a format for which it is possible to write fast decoders


Yeah, I don't like that method for encoding variable length integers.

I usually use something I came up with years ago where the low 2 or 3 bits are the length - 1 in bytes. It's more compact and can be serialized/deserialized without branches in the common case. It uses the count leading zeroes instruction for encoding. I'm sure somebody else has come up with the idea as well.

The downside is you can only represent 30 bits for u32 and 61 for u64. In many cases you know you don't have values that large, so it's fine.

If you're encoding arrays of integers, Lemire has a library that uses vector instructions to encode them in the optimal number of bits. That's even better, but the use cases are far more restricted.


I first ran into this encoding scheme in Avro, which you could also describe as a format for dynamic JSON-like values :)


Does your protocol provide any advantages over bincode?


Yes, but not as a general purpose encoding. It's more fit for it's intended purpose.


Used in the [postcard](https://docs.rs/postcard/latest/postcard/) crate for encoding variably sized signed integers.


I first encountered this in the Flac source[1]. They do the bit twiddling a bit differently than in this article:

    uval = (val<<1) ^ (val>>31);
Variations of this have probably been used countless of times in other libraries.

[1] https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi...


Author of postcard here, I don't remember exactly where I got my bit twiddling example from, lemme see if I have notes from that.

postcard's zigzag encoding matches phoboslab's psuedocode.

Edit, not totally sure, but this wiki page rings a bell, and is probably where I got my impl from: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigza...

Edit 2: I also explain why I do this (it compresses better) in my wire format specification: https://postcard.jamesmunns.com/wire-format.html#signed-inte...


A special case of Hopcroft and Ullman's pairing functions!

https://en.wikipedia.org/wiki/Pairing_function


Nice! The same idea is the essence of the Hilbert's paradox of the grand Hotel (https://en.wikipedia.org/wiki/Hilbert%27s_paradox_of_the_Gra...), especially for the case Infinitely many new guests.


They use a variant of that in most Modern SAT solvers for the literal encoding by the way, i.e. in order to encode negation of variables, which traditionally are represented with negative integers. Mostly, because they then use the encoded literal as indeces in an array. Historically, this also had performance reasons if I remember correctly. I feel it has no benefit anymore, and people simply got used to it (and never touch that part of the code anyway). But I might be wrong. I did not yet benchmarked any alternative, but it is on my todo-list (woth low priority).


Wouldn’t they use a plain bitmask instead, the lsb indicating negation, with 3 being the negation of 2? That’s zigzag encoding with a negative zero.


If I understand you correctly, then that exactly what I am talking about. You might see this as bitmask. That's correct.


A nice bijection from integers to natural numbers. Mappings exist for rationals -> natural numbers and other countable sets, but may be less practical.


Easy enough. If you have a rational class represented as a pair of irreducible int32_ts, you can simply do ((u64)numerator << 32) | (u64)denominator.


That either doesn’t map to rationals (making 1/1 and 2/2 two different numbers) or not a bijection (0x0000000100000001 and 0x0000000200000002 both map back to 1) and doesn’t work on “the integers”.


Most people just care about having a lossless round-trip conversion, from the source representation to a storage representation and then back to the source representation, not about having every possible bit sequence in the storage representation being valid. Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.


> Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.

It is not that difficult. https://en.wikipedia.org/wiki/Calkin–Wilf_tree#Stern's_diato... defines fusc(n), a function that’s fairly easy to compute (it is recursively defined, but only requires a recursion depth of the number of bits in the binary presentation of n), with fusc(n)/fusc(n+1) being a mapping from the natural numbers to all positive rational numbers.

That Wikipedia page also hints at how to implement the reverse. https://en.wikipedia.org/wiki/Calkin–Wilf_tree#Definition_an...:

“The parent of any rational number can be determined after placing the number into simplest terms, as a fraction a/b for which greatest common divisor of an and b is 1. If a/b < 1, the parent of a/b is a/(b − a); if a/b > 1, the parent of a/b is (a − b)/b.”*

Repeatedly using that procedure, you eventually reach 1, and then have found the path from 1 down to your starting number a/b. if you have that finding the n for which fusc(n)/fusc(n+1) = a/b is trivial.


"pair of irreducible int32_ts"

2/2 is not irreducible, so not an issue. But yes, it does create some waste in the encoding. Not obvious how to fix it, as coprimality is hard to encode for.


Found this recently in propreitary code base for (in abstract terms) storing enormous numbers of price values. Very interesting


I have to fill out a captcha just to read an article / blog post? (Edit: and enable cookies)


Or install Ublock Origin in Firefox blocking the javascript and read the entire article without a captcha getting in the way.


I had uBlock Origin and uMatrix installed. Not sure what changed, but I can access the article now.


it's cool just the name is enough to tell you what the encoding is. he didn't mention it in the post but it can be visualized as starting from 0, then zigzagging right and left to 1, -1, 2, -2 ... and labelling each point with consecutive natural numbers.


Can anyone suggest high-quality resources for learning about compression?


A really tremendous varint/VLQ encoder (using a zig-zag encoding and an generalized base):

https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...

and the decoder:

https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...

The common varint, which you see in applications everywhere (e.g., git), is just base-128 version of the above general scheme!

But base-32 or base-8 or base-64 varints can be a big win for some purposes. Remove the zig-zag encoding if negative integers don't occur.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: