I think it's just reading unaligned. That's just a ~2x loss of throughput from L1, but the second the problem is large enough that the work being done doesn't reliably fit into the L1, that doesn't matter a bit anymore.
In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.
In general for x86, unaligned writes are worth doing work to avoid, but reads are in most situations not really an issue.