Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From SMHasher test results quality of xxhash seems higher. It has less bias / higher uniformity that CRC.

What bothers me with probability calculations, is that they always assume perfect uniformity. I've never seen any estimates how bias affects collision probability and how to modify the probability formula to account for non-perfect uniformity of a hash function.



It doesn't matter, though. xxhash is better than crc32 for hashing keys in a hash table, but both of them are inappropriate for file checksums -- especially as part of a data archival/durability strategy.

It's not obvious to me that per-page checksums in an archive format for comic books are useful at all, but if you really wanted them for some reason then crc32 (fast, common, should detect bad RAM or a decoder bug) or sha256 (slower, common, should detect any change to the bitstream) seem like reasonable choices and xxhash/xxh3 seems like LARPing.


> both of them are inappropriate for file checksums

CRCs like CRC32 were born for this kind of work. CRCs detect corruption when transmitting/storing data. What do you mean when you say that it's inappropriate for file checksums? It's ideal for file checksums.


Uniformity isn’t directly important for error detection. CRC-32 has the nice property that it’s guaranteed to detect all burst errors up to 32 bits in size, while hashes do that with probability at best 2^−b of course. (But it’s valid to care about detecting larger errors with higher probability, yes.)


> Uniformity isn’t directly important for error detection.

Is there any proof of this? I'm interested in reading more about it.

> detect all burst errors up to 32 bits in size

What if errors are not consecutive bits?


There’s a whole field’s worth of really cool stuff about error correction that I wish I knew a fraction of enough to give reading recommendations about, but my comment wasn’t that deep – it’s just that in hashes, you obviously care about distribution because that’s almost the entire point of non-cryptographic hashes, and in error correction you only care that x ≠ y implies f(x) ≠ f(y) with high probability, which is only directly related in the obvious way of making use of the output space (even though it’s probably indirectly related in some interesting subtler ways).

E.g. f(x) = concat(xxhash32(x), 0xf00) is just as good at error detection as xxhash32 but is a terrible hash, and, as mentioned, CRC-32 is infinitely better at detecting certain types of errors than any universal hash family.


This seems to make sense, but I need to read more about error correction to fully understand it. I was considering possibility that data could also contain patterns where error detection performs poorly due to bias, and I haven't seen how to include these estimates in probability calculations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: