Recognizing your username, I'd probably best not argue, but wouldn't iterating across (say) two component arrays only cost you two pages in the cache at any given time, since you're doing a sequential scan? You should have the same number of cache misses overall, unless you're doing something very complicated for each entity to cause the cache to vacate one of the pages.
Of course, if you access N components you need N pages in the cache concurrently, which is going to fall over for a not-too-large N. But N=2 or N=3 seems unlikely to kill spatial locality.
I can imagine it gets a little more complicated with prefetch, but you're still using the prefetched pages -- you just need to prefetch pages for two separate arrays (potentially at different rates based on component size) rather than one. Do these details end up snowballing in a way I'm not seeing, or are there details I'm just missing outright?
> two component arrays only cost you two pages in the cache at any given time
It sounds like you're thinking of virtual memory (i.e. pages of memory either being in RAM or on disk). But CPU caching is about the much smaller L1, L2, and L3 caches directly on the chip itself.
Let's say you have two kinds of components, A and B. You have those stored in two contiguous arrays:
AAAAAAAAAAAAAAAAAAA...
BBBBBBBBBBBBBBBBBBB...
Each entity has an A and B and your system needs to access both of those to do its work. The code will look like:
for each entity:
access some data component A
access some data component B
do some computation
On the first access of A, the A component for that entity and a bunch of subsequent As for other entities get loaded into the cache line. On the next access of B, the B for that entity along with a bunch of subsequent Bs gets loaded into a cache line. If you are lucky the A and B arrays will be at addresses such that the chip is able to put them in different cache lines. At that point, I think you'll mostly be OK.
But if you're unlucky and they are fighting for the same cache line, then each access can end up evicting the previous one and forcing a main memory look up for every single component.
> It sounds like you're thinking of virtual memory (i.e. pages of memory either being in RAM or on disk).
I think I mostly just used the wrong term (page instead of line) -- I'm on board with the distinction you're drawing.
> But if you're unlucky and they are fighting for the same cache line, then each access can end up evicting the previous one and forcing a main memory look up for every single component.
Yeah, I think this specifically is what I'm having a hard time imagining. I guess it depends on how which memory regions get mapped into which cache lines, and I think I'm assuming something closer to fully-associative, while you're allowing for something closer to direct-mapped.
I can see the trouble -- you don't really have much control over the mapping, unless you make some really deep assumptions about the CPU architecture and go to great lengths to position your arrays appropriately in memory. That would certainly be in the original spirit of data-oriented design, but it's perhaps a bit beyond reasonable for most systems.
I'm not an expert at the details of hardware architecture, but my understanding is that you're basically stuck with whatever associativity and cache policy the chip supports. That, and the addresses that your components happen to be at, will determine whether they end up fighting over cache lines or not.
'false sharing' == two items next to another in memory are accessed by different thread, as they are closed in memory they are then in the same cache line which is 'shared' between two CPUs: read only is okay but read&write is a performance killer
Here it's a cache (associativity) conflict, as gugagore said it can impacts performance even in read only situation.. And yes, you're describing a way to avoid the conflicts.
Of course, if you access N components you need N pages in the cache concurrently, which is going to fall over for a not-too-large N. But N=2 or N=3 seems unlikely to kill spatial locality.
I can imagine it gets a little more complicated with prefetch, but you're still using the prefetched pages -- you just need to prefetch pages for two separate arrays (potentially at different rates based on component size) rather than one. Do these details end up snowballing in a way I'm not seeing, or are there details I'm just missing outright?