Splitting the domain into sub-domains is done so that every sub-domain fits in cache, taking care of the actual available cache size for the multicore case in parallel. The number of data accesses, when cache conflicts do not occur, is reduced to 2 × (3 + 3 + 3) / 6 = 3 (3.1) per field update, instead of 6 as seen previously in relation 2.5 section 2.2. Let us explain that. For the electric field update, we need to load the magnetic field, the electric field, and then write the electric field. Without cache conflict, we may compute twice as many values inside domains as in the monolithic case, but there remains to compute the values at the interfaces of the domains. The extra-cost becomes negligible if the sub-domains size is large... which is not possible if the cache size is too small. In practice, a 2 MB data cache size is required to really improve performance. If this article could be read by processors makers, let us ask them for more cache per core, not only more cores. In any case, we prefer more cache than higher clock frequencies.
https://misskey-taube.s3.wasabisys.com/files/d3f7ffe4-41cd-47ba-8cd7-10ab272d41e8.png