Its not clear to me what the advantage is. The regular algorithm does not have a conditional in the inner loop and runs over data in a fairly cache friendly was. It's also O(N^3). What's the benefit here?
As mentioned in the post, the conditional is trivially converted to a cmov or a bitwise expression. A column-oriented algorithm makes sense when the data is naturally laid out with bitpacked columns.