purplesyringa's post below is ~7 assembly instructions without any need of lookup tables.
And the instructions purplesyringa uses are Add, AND, OR, and Compare. Modern CPUs can execute ~4 ADD/AND/OR instructions per clock tick (even if they are SIMD or even AVX512 instructions), so you're looking at under 4ish clock cycles to purplesyringa's solution.
Its likely the best solution in the whole thread. The main advantage of pshufb is its extreme flexibility due to the lookup table, so there's advantages in learning that methodology.
But in general, sequences of simpler instructions are much faster.
---------
The "hard part" of purplesyringa's code is that its branchless.
Learning to write branchless (especially using the pcmpeq* line of instructions) is a bit of a hassle and worth practicing for sure. AVX512 added explicit mask registers that do it better but 128-bit SSE / 256-bit AVX is still more widespread today.
And the instructions purplesyringa uses are Add, AND, OR, and Compare. Modern CPUs can execute ~4 ADD/AND/OR instructions per clock tick (even if they are SIMD or even AVX512 instructions), so you're looking at under 4ish clock cycles to purplesyringa's solution.
Its likely the best solution in the whole thread. The main advantage of pshufb is its extreme flexibility due to the lookup table, so there's advantages in learning that methodology.
But in general, sequences of simpler instructions are much faster.
---------
The "hard part" of purplesyringa's code is that its branchless.
Learning to write branchless (especially using the pcmpeq* line of instructions) is a bit of a hassle and worth practicing for sure. AVX512 added explicit mask registers that do it better but 128-bit SSE / 256-bit AVX is still more widespread today.