Cache and memory locality
Accessing the DRAM memory is an expensive action that consumes a lot of processing unit cycles. Processor includes intermediate in-die levels of memory called cache that act as read-write buffers in order to optimize and speedup memory accesses. Cache are small memory levels with high bandwidth and low latencies. Intel KNL processors for instance have the MCDRAM and two levels of cache:
- MCDRAM can be configured as a large cache level with a peak bandwidth of 460 GB/s, almost 5 times the DRAM bandwidth. The latency is however equivalent to the DDR.
- L2 cache is a cache level of 1 MB shared between 2 tiles with a peak bandwidth above 2TB/s.
- L1 cache is the last one before vector registers and is owned by each core. L1 is the smallest one with the largest bandwidth above 9 TB/s.
Memory locality refers to where is the data (in DDR, in cache…) when the processing unit wants to access it. Good memory locality means that the data to be processed is the closest as possible to the processing unit (ideally in L1 cache) and that data that will be used after are properly located to minimize memory transfers and optimize bandwidth. In case of bad memory locality, processing units may have to wait until the data is found. Levels of cache are scanned from the closest (L1) to the farthest (L2 on KNL, L3 on Haswell) until the MCDRAM or the DRAM memory. When the data is not in a specific cache level, we call it a cache miss, success is called a cache hit. Good memory locality also means that the data that are regularly reused should be loaded at an intermediate cache level. This is called cache reuse. For instance, loops can perform some treatments on the same pieces of data at multiple iterations. If this piece of data is not in cache, it will be reloaded and written multiple times in DRAM (cache misses) with potential slow down. If the code speed is limited by these memory accesses, the code is said to be memory bound.
Improving the memory locality is an essential optimization that can enhance performances on previous, current and future architectures. It can consist on manually dividing a large set of data into subsets that are sufficiently small to fit in the different levels of cache when the compiler is not able to do it itself efficiently. Data accesses are optimized so that blocks are loaded at the most local level when used by the computing units. This technique is usually referred to as cache-blocking. This method goes along with a better cache reuse. Typical examples are stencil problems or Matrix-Matrix multiplications.
Memory locality issues in PIC codes
A memory bottleneck that arises in PIC codes and can significantly affect overall performance is cache reuse of the grid quantities during interpolation processes such as the field gathering or or the charge/current deposition processes.
Both interpolation processes are composed of a loop over all particles. In the field gathering, the electric and the magnetic fields at the particle positions are computed from the nodes of the field grids. In the current or the charge deposition, the current or the charge grid are computed from the current or the charge generated by each particle. In both case, there is exchange of data between grids and particles.
During these processes, part of the grid data is put in cache. This corresponds to the red part in Fig. 1. Particle that are located in the cached area will access the nodes rapidly. The red part of the grid will be reuse until particle 4. When the loop reach particle 5, there is a cache miss. The code has to load the nearby nodes from DRAM and will probably release the cache to load corresponding area. In this case, particle 5 to 8 may benefit from cache reuse. Then at particle 9, there is new cache misses if the red part is not in cache anymore.
Usually, particles contiguous in memory are also next each other in the space domain after initialization. However, as time evolves, the particle distribution often becomes increasingly random, leading to numerous cache misses in the deposition/gathering routines and decrease in performance. This problem is more likely to affect 3D simulations that use more data for the grids and the particles.
Tiling: a L2 cache-blocking method for PIC codes
The tiling  consists on dividing MPI domains into tiles also sometimes refer to as blocks or patchs. A tile is nothing else than a new spatial subdivision of the domain. A tile contains its own local grid arrays (at least temporarily) and its own particle arrays. Tiling is a L2 cache-blocking technique on Intel Haswell and Intel KNL. It also with older architectures. Tile size has to be carefully chosen so that a single tile fits in L2 cache. On KNL, since L2 is shared between 2 cores, we want a tile to fit in half the L2 cache (512 KB) for each core. In the code, a tile is a FORTRAN structure (
type(particle_tile)). This structure contains all arrays for the particle properties. When a particle leaves a tile, data is transferred to the new tile and removed from the previous one. In the PICSAR stand-alone FORTRAN code, fields are not permanent. The choice has been made to keep global arrays (for each MPI domain). A temporary local array for each tile is created before each process: the field gathering, the current deposition, the charge deposition and the Maxwell solver. Local arrays are then copied back to the global ones. Tile arrays have guard cells for the interpolation processes. However, keeping global arrays avoids exchanging guard cells between tiles.
How works the tiling is schematically described in Fig. 2 for Intel Haswell processors. On CORI, Intel Haswell processors have a common L3 cache of 40MB and a L2 cache per core of 256 KB. Tile fields are supposed to migrate in L2 during interpolation processes. Part of the particle arrays can be stored in L3. Cache blocking strategy can also be applied at the particle level as explained for the field gathering. Therefore, smaller block of particles can be localized in L2 next to the fields. On KNL, the similar schematic can be used except that L3 is changed for MCDRAM that can contain all the code if below 16GB.
The electromagnetic field and currents arrays used for the Maxwell solvers can be either tiled or not tiled depending on the Maxwell solver used. For low-order Maxwell solver stencils, it has been shown that tiling can strongly benefit stencil computations by increasing memory locality and cache reuse. For pseudo-spectral solvers, it is preferable to keep global arrays as is to perform the FFTs.
 – H. Vincenti, M. Lobet, R. Lehe, R. Sasanka, J.-L. Vay, “An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes, Computer Physics Communications”, 210, 145-154 (2017) http://dx.doi.org/10.1016/j.cpc.2016.08.023.
 – M. McCool, A.D. Robison, and J. Reinders. “Structured Parallel Programming: patterns for efficient computation”. Morgan Kaufmann, (2012).
Mathieu Lobet, Henri Vincenti, last update: March 22 2017