Making all your integers positive with zigzag encoding

nine_k · on Nov 26, 2022

The essence of the trick: make the sign bit the least significant bit. Then all the leading bytes for integers of small magnitude remain zeros (not 0xFF like in two-complement integers), and compression becomes easy.

krackers · on Nov 26, 2022

This is an amazing explanation, and immediately makes the formula `(val<<1) ^ (val>>31)` intuitive, compared to trying to interpret it as mapping x -> 2|x| - 1.

hackcasual · on Nov 26, 2022

I used this trick for delta compression for an ArrayBuffer backed data exchange format. Since it was getting sent from a web server, we got gzip for free, but could still shrink data sizes by filtering it in such a way as to make it easier to compress. Delta compressing correlated data points yields a lot of small, but randomly distributed positive/negative values, so by zig-zag encoding them but saving as 32bit values, you end up with a lot of runs of 0's, which get compressed very efficiently, and only require a quick pass through the data to restore the original values

80x86 · on Nov 28, 2022

Came here to say the same thing. When I first saw the protobuf zigzag encoding, I saw the same pattern. At first, I thought I must be wrong because they aren't describing it this way. Nope, it really is that simple.

In earlier computer systems (pre 1980s), this kind of variety was more common. In my career, I have only been exposed to twos-complement or float formats. I would bet that this is true for many other engineers as well.

I'm glad the protobuf folks used it. I'm also glad Dr. Lemire wrote an article about it. It is another clever trick to have in the toolbox.

10000truths · on Nov 26, 2022

In other words, a left rotation by 1 bit.

pifm_guy · on Nov 26, 2022

No. From the typical 2's compliment number encoding used on all modern computers, it is not a simple rotation.

Think about the encoding of minus 1 for an example that differs...

10000truths · on Nov 27, 2022

Argh, you're right. Classic mistake of tunnel-visioning on a subset of the domain.

raphlinus · on Nov 25, 2022

The `fast_decode` as written is undefined behavior for some inputs (INT_MAX for example) because of the integer overflow in `2*x`. This can be fixed by casting to an unsigned int first.

Also a warning, right shifting a negative number gives the expected result (arithmetic right shift) in C++20, but is still implementation defined even in the C23 draft spec. Likely you're good, but the spec doesn't guarantee it.

bob1029 · on Nov 25, 2022

It's pretty cool how a simple encoding scheme can have profound effects on compression. This is a big part of what makes DCT-style block compression work as well as it does. You rig the game such that all of your higher frequency components wind up at the end - ideally as zeroes after quantization. So, a simple RLE scheme can do a lot more work than it otherwise could.

benj111 · on Nov 25, 2022

Sorry. Are you inferring this is how DCT works or are you making a comparison?

Ive just read up about DCT and don't really have a clue, but it doesn't seems applicable?

That being so. I'm not sure how RLE at the bit level would help. Surely encoding run lengths of 5ish bits isn't going to compress much of anything.

oakwhiz · on Nov 25, 2022

The idea behind this is to scan diagonally by frequency so that big numbers are grouped together at the start. You can choose to zero out some of the bits near the end since those only contain high frequency components that are harder to notice. RLE then has the ability to work on the trailing zeroes, or other methods can be used instead.

bob1029 · on Nov 25, 2022

Zigzag is a phase after DCT+quantization where you iterate through the resulting matrix in a particular order. Nothing to do with the DCT itself.

https://commons.wikimedia.org/wiki/File:JPEG_example_zigzag....

IanWard1 · on Nov 25, 2022

reminds me of this video on using "-2" as a base for numbers (can represent all positive and negative integers without a sign bit) https://www.youtube.com/watch?v=GWA_NraxOUw

slashdev · on Nov 25, 2022

Protobuf was the first major project I was aware of that uses zigzag encoding for integers. They then encode those integers using the typical 7 bit encoding scheme where the msb indicates there's another byte following. I'm sure they weren't the first by any means though.

I'm currently using this in a binary format for serializing dynamic JSON-like values that I invented and am implementing in Rust. I will release it as open source sometime next year.

anonymoushn · on Nov 26, 2022

This format makes it impossible to write fast decoders, as discussed elsewhere on the same blog as TFA: https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-...

So it would be sweet to instead use a format for which it is possible to write fast decoders

slashdev · on Nov 26, 2022

Yeah, I don't like that method for encoding variable length integers.

I usually use something I came up with years ago where the low 2 or 3 bits are the length - 1 in bytes. It's more compact and can be serialized/deserialized without branches in the common case. It uses the count leading zeroes instruction for encoding. I'm sure somebody else has come up with the idea as well.

The downside is you can only represent 30 bits for u32 and 61 for u64. In many cases you know you don't have values that large, so it's fine.

If you're encoding arrays of integers, Lemire has a library that uses vector instructions to encode them in the optimal number of bits. That's even better, but the use cases are far more restricted.

tikhonj · on Nov 25, 2022

I first ran into this encoding scheme in Avro, which you could also describe as a format for dynamic JSON-like values :)

ramranch · on Nov 25, 2022

Does your protocol provide any advantages over bincode?

slashdev · on Nov 26, 2022

Yes, but not as a general purpose encoding. It's more fit for it's intended purpose.

nynx · on Nov 25, 2022

Used in the [postcard](https://docs.rs/postcard/latest/postcard/) crate for encoding variably sized signed integers.

phoboslab · on Nov 25, 2022

I first encountered this in the Flac source[1]. They do the bit twiddling a bit differently than in this article:

    uval = (val<<1) ^ (val>>31);

Variations of this have probably been used countless of times in other libraries.

[1] https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi...

jamesmunns · on Nov 25, 2022

Author of postcard here, I don't remember exactly where I got my bit twiddling example from, lemme see if I have notes from that.

postcard's zigzag encoding matches phoboslab's psuedocode.

Edit, not totally sure, but this wiki page rings a bell, and is probably where I got my impl from: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigza...

Edit 2: I also explain why I do this (it compresses better) in my wire format specification: https://postcard.jamesmunns.com/wire-format.html#signed-inte...

qsort · on Nov 26, 2022

A special case of Hopcroft and Ullman's pairing functions!

https://en.wikipedia.org/wiki/Pairing_function

squaredot · on Nov 26, 2022

Nice! The same idea is the essence of the Hilbert's paradox of the grand Hotel (https://en.wikipedia.org/wiki/Hilbert%27s_paradox_of_the_Gra...), especially for the case Infinitely many new guests.

k0k0r0 · on Nov 25, 2022

They use a variant of that in most Modern SAT solvers for the literal encoding by the way, i.e. in order to encode negation of variables, which traditionally are represented with negative integers. Mostly, because they then use the encoded literal as indeces in an array. Historically, this also had performance reasons if I remember correctly. I feel it has no benefit anymore, and people simply got used to it (and never touch that part of the code anyway). But I might be wrong. I did not yet benchmarked any alternative, but it is on my todo-list (woth low priority).

Ferrotin · on Nov 25, 2022

Wouldn’t they use a plain bitmask instead, the lsb indicating negation, with 3 being the negation of 2? That’s zigzag encoding with a negative zero.

k0k0r0 · on Nov 25, 2022

If I understand you correctly, then that exactly what I am talking about. You might see this as bitmask. That's correct.

deleterofworlds · on Nov 25, 2022

A nice bijection from integers to natural numbers. Mappings exist for rationals -> natural numbers and other countable sets, but may be less practical.

fluoridation · on Nov 25, 2022

Easy enough. If you have a rational class represented as a pair of irreducible int32_ts, you can simply do ((u64)numerator << 32) | (u64)denominator.

Someone · on Nov 25, 2022

That either doesn’t map to rationals (making 1/1 and 2/2 two different numbers) or not a bijection (0x0000000100000001 and 0x0000000200000002 both map back to 1) and doesn’t work on “the integers”.

fluoridation · on Nov 25, 2022

Most people just care about having a lossless round-trip conversion, from the source representation to a storage representation and then back to the source representation, not about having every possible bit sequence in the storage representation being valid. Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.

Someone · on Nov 26, 2022

> Univocally mapping the rationals onto a set of contiguous naturals is extremely tricky, computationally.

It is not that difficult. https://en.wikipedia.org/wiki/Calkin–Wilf_tree#Stern's_diato... defines fusc(n), a function that’s fairly easy to compute (it is recursively defined, but only requires a recursion depth of the number of bits in the binary presentation of n), with fusc(n)/fusc(n+1) being a mapping from the natural numbers to all positive rational numbers.

That Wikipedia page also hints at how to implement the reverse. https://en.wikipedia.org/wiki/Calkin–Wilf_tree#Definition_an...:

“The parent of any rational number can be determined after placing the number into simplest terms, as a fraction a/b for which greatest common divisor of an and b is 1. If a/b < 1, the parent of a/b is a/(b − a); if a/b > 1, the parent of a/b is (a − b)/b.”*

Repeatedly using that procedure, you eventually reach 1, and then have found the path from 1 down to your starting number a/b. if you have that finding the n for which fusc(n)/fusc(n+1) = a/b is trivial.

karamanolev · on Nov 26, 2022

"pair of irreducible int32_ts"

2/2 is not irreducible, so not an issue. But yes, it does create some waste in the encoding. Not obvious how to fix it, as coprimality is hard to encode for.

DontchaKnowit · on Nov 25, 2022

Found this recently in propreitary code base for (in abstract terms) storing enormous numbers of price values. Very interesting

bo1024 · on Nov 25, 2022

I have to fill out a captcha just to read an article / blog post? (Edit: and enable cookies)

pwg · on Nov 25, 2022

Or install Ublock Origin in Firefox blocking the javascript and read the entire article without a captcha getting in the way.

bo1024 · on Nov 25, 2022

I had uBlock Origin and uMatrix installed. Not sure what changed, but I can access the article now.

zaptheimpaler · on Nov 26, 2022

it's cool just the name is enough to tell you what the encoding is. he didn't mention it in the post but it can be visualized as starting from 0, then zigzagging right and left to 1, -1, 2, -2 ... and labelling each point with consecutive natural numbers.

shoo · on Nov 25, 2022

Can anyone suggest high-quality resources for learning about compression?

bugfix-66 · on Nov 25, 2022

A really tremendous varint/VLQ encoder (using a zig-zag encoding and an generalized base):

https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...

and the decoder:

https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...

The common varint, which you see in applications everywhere (e.g., git), is just base-128 version of the above general scheme!

But base-32 or base-8 or base-64 varints can be a big win for some purposes. Remove the zig-zag encoding if negative integers don't occur.