Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to compare two packed bitfields without having to unpack each field (2019) (devblogs.microsoft.com/oldnewthing)
81 points by luu on May 22, 2024 | hide | past | favorite | 26 comments


The author says who needs SIMD registers! And indeed on early processors like the 68000 you didn't have dedicated SIMD registers so in order to do SIMD within a register (SWAR) you had to do some variant on the techniques presented.

This was used in a fast 4 sample player where it could do math on four 8 bit wave table indices all in one 32 bit operation on the 68000. I had forgotten the details so this was a nice reminder of how it must have worked. I would otherwise have to disassemble it since source no longer exists. Speed was necessary since it was for the Atari ST and the work was done in an interrupt which was quite frequent and it was important to keep the overhead as minimal as possibe.


what does the disassembly look like


To find that out one would need to get the executable for The Duel : Test Drive II for the Atari ST and then disassemble the title screen.

But in any case you can compare the Chip Tune version of the title in Test Drive I vs the MOD player style tune in II

I https://www.youtube.com/watch?v=RGOdHT29Jpc

II https://www.youtube.com/watch?v=1GrdvcghDXE


You can get a fast equivalent to his zero-padding method on recent-ish x86 and ARM SVE. The x86 assembly would look like (untested):

  compare:  ;; lhs in eax, rhs in ebx
    mov ecx, 0b11111011111101111
    pdep eax, eax, ecx
    pdep ebx, ebx, ecx
    sub eax, ebx
    test eax, 0b10000010000001000
    mov eax, 0
    setnz al
    ret


I just benchmarked the three versions (in Rust, with criterion) on my Coffee Lake CPU. The quartiles are:

Naive version, best case: [481.40 ps 487.84 ps 494.56 ps]

Naive version, mid case: [751.56 ps 758.84 ps 766.17 ps]

Naive version, worst case: [953.71 ps 970.95 ps 994.65 ps]

Packed bitfield version: [685.98 ps 698.25 ps 711.57 ps]

So on average, the naive version and packed bitfield versions are within 10% of one another. Modern CPUs frequently don't benefit much from these kinds of tricks anymore.


The suggested code tests for bitfield borrows with

    auto c = ((~x & y) | (~(x ^ y) & (x - y));
    c &= 0x8410;
    return c == 0;
where the literal bitmap contains the most significant bit of each bitfield.

Couldn't this be written equivalently as

     auto c = x ^ y ^ (x - y);
     c &= 0x10820;
where the literal bitmap now contains the bits just left of each bitfield?


I think part of the challenge is to not exceed the word size, 16 bits in this case.


that's useful with explicit simd instructions which let you process 32 (pairs of corresponding) 16-bit pixels with one avx-512 instruction, or on a 16-bit machine like the 80286, but unnecessary if you're processing a single pair of 16-bit pixels with 32-bit instructions. on the other hand, you might have 32-bit vectors or something, for example for the standard axis-aligned bounding box test


Is the vector method really faster than mask/subtract/compare?

They never really say in the article.


I haven't benchmarked it, but almost certainly yes - The unpacking shouldn't be that big of a deal, but three branches vs 0 branches likely is.

And, the function as a whole is much more vectoriseable - if its in an inner loop, the compiler ought to be able to inline it and autovectorise the loop.


You can reduce the branches by just eliminating the early return returns. And short circuits, but you can use bitwise and instead to encourage the compiler not to generate extra branches (the compiler is generally free to do this anyway as there isn't any side effect in the example code which would prohibit optimization). Therefore, what we should really compare the improved version against is:

bool IsEveryComponentGreaterThanOrEqual(uint16_t x, uint16_t y) { int rv; auto xr = x & 0xF100; auto yr = y & 0xF100; rv = xr >= yr;

auto xg = x & 0x07E0; auto yg = y & 0x07E0; rv &= xg >= yg;

auto xb = x & 0x001F; auto yb = y & 0x001F; rv &= xb >= yb;

return rv != 0; }

According to the Compiler Explorer, the one presented in the article still has far fewer instructions on x86_64. Interestingly, the optimization of eliminating the bitshifts doesn't change how it is compiled which says something about trusting your compiler for some of this stuff. https://godbolt.org/z/zWf5GnPPb

However, one more interesting thing is that the modified version I made is nearly identical in size on ARMv8 for the one in the article vs my version (12 vs 11 instructions). https://godbolt.org/z/d4TPqGhKe


Thanks for the links! Seems like we don't need to mask prior to the first compare, since the low order bits only matter if the high order bits are equal (in which case, we can also safely compare lower bits). Or am I missing something?

  bool IsEveryComponentGreaterThanOrEqual(uint16_t x, uint16_t y) {
    int rv = x >= y;
    auto xg = x & 0x07E0; auto yg = y & 0x07E0; rv &= xg >= yg;
    auto xb = x & 0x001F; auto yb = y & 0x001F; rv &= xb >= yb;
    return rv != 0;
  }


Those branches are unnecessary and easily avoided.


Based on my quick and dirty benchmarking, it's similar in terms of performance. The vector comparison is consistent, the unpacking comparison is about 1.5x faster if it can exit early, but 4x slower otherwise. Here are the results: https://pastebin.com/QL4sNzhj (the code is straight from the article, the benchmarking done with google benchmark).


It is much worse than just the three extra test-and-branch operations, because they will be serialized through one ALU.


I saw in a wild bitshift methods text-only-page many years ago that had a method to do an approximate distance between two 2d vectors using only bitwise operators. Anybody know it? I've been unable to find it for about 8~ years


Something I've been greatly enjoying in using Zig is builtin support for packed bitfields in the form of packed structs.

The data structure from the article would be:

    const RGB = packed struct(u16) {
                    b: u5,
                    g: u6,
                    r: u5,
                }
So the test becomes

       const gte: bool = x.r >= y.r and x.g >= y.g and x.b >= y.b;
Okay, so this doesn't get us the optimal form described in the article.

Or does it? Since the compiler knows about packed structs, it could perform the optimization for me.

Does it, right this instant? Eh, probably not. But compilers have a tendency to improve, and this is a local pattern which would be fairly easy to recognize. The recognition is the point: it's much, much harder to recognize the intent in the initial implementations described in the article, and then replace them with the optimal version. The way I was able to write it in Zig, in addition to being far more expressive of the algorithm's intent, conveys to the compiler: compare the values of these bitfields, I don't care how. The compiler doesn't have to prove that I have no other reason for the other operations, besides making said comparison possible: it can just emit the optimal code.


C++ also supports this. However, you still end up needing to do a bunch of operations, the compiler just hides it from you which is why I think Chen opted not to use it.

  struct my_bitfield
  {
    uint16_t b:5;
    uint16_t g:6;
    uint16_t r:5;
  };

  bool IsEveryComponentGreaterThanOrEqual4(my_bitfield x, my_bitfield y)
  {
    return x.b > y.b && x.g > y.b && x.r > y.r;
  }


Some differences: the standard doesn't guarantee that C++ bitfields will be densely packed (this is mostly a technicality), you can't have non-const bitfields, and you can't take a pointer to a bitfield. Zig's pointer-to-bitfield is a special type, for obvious reasons, but they can come in handy.

I have no idea if the requirement that each bitfield be reified as its declared type inhibits optimization of that example or not, in practice. Maybe, maybe not.


C bitfields aren't packed. You have to resort to compiler extensions to get that behavior.


Not sure where you got that from. I think you’re confusing it with structs.

> Consecutive bit fields are packed together, but each bit field must fit within a single object of its specified type

[1] https://www.gnu.org/software/c-intro-and-ref/manual/html_nod...


Which C compiler doesn't pack bitfields? Anything on x86 or ARM is bound by the ABI to pack in a standard and link-compatible way.

Seems there would only be a problem on some proprietary compiler for an embedded, bespoke target.


I think SDCC has weird bitfields, originally due to a bug but now baked into the ABI for some of its platforms?

But there are only a handful of ways to lay out bitfields:

* Use little-endian or big-endian bit order? (this need not match the byte order, but usually does - IIRC recent GCC no longer supports any platforms where it doesn't?). This is the only one you can't control. Networking headers rely on a 2-way preprocessor branch so no other ways are really possible, at least up to `unsigned int` (may be 16) bits.

* If padding is needed, is it on the left or on the right? or, on the top or on the bottom? (this might always be tied to bit-endianness, but theoretically it is detached). For maximum portability, you must add explicit padding, doing math with `CHAR_BIT`.

* What exactly happens if an over-sized bitfield is attempted? (beware, there is no stable ABI for this in practice!)

* Do zero-sized bitfields align to the next unit?

* Can bitfields cross units? Is perhaps a larger-than-user-specified unit used? Consider `struct {uint8_t a:4, b:8, c:4};`

* If a mixture of types is used, do they get aggregated unconditionally, by size, by type? Ignoring signedness or not? Consider `struct { int a:1; long b:1; long long c:1; }`

* Is `int` the same as `signed int` or `unsigned int`, and for what widths?

It's unfortunate that testing some of these can't be done at compile time (a few can via `sizeof`), so you may need to either manually inspect assembly code, or else rely on optimization to do the emit-constants-via-strings trick when cross-compiling (C++, if supported on your platform, may do better than C at accessible compile-time optimization).


Interesting optimization, particularly for something like an FPGA implementation but, as the article says, also useful for improving an inner loop. Thanks for sharing it.


correct me if i'm wrong, but in an fpga, you can just directly compute the borrow bits you want? nothing obliges you to keep routing them up the carry chain, does it?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: