In PICSAR, the domain is first decomposed into MPI domains and then again decomposed into tiles (that are computed using OpenMP threads). Each tile contains a structure of arrays corresponding to the different particle properties (positions, momenta, energy…). The tile size has to be carefully chosen to fit in cache.
Interpolation processes such as the field gathering and the current deposition steps are constituted of an intensive loop over particles. In the first case, the field seen by each particle is determined from the nearby nodes of the Maxwell grids (depending of the interpolation order). In the second case, the current generated by each particle is distributed to the nearby nodes of the current grid.
Usually, particles contiguous in memory are also next each other in the space domain after initialization. However, as time evolves, the particle distribution often becomes increasingly random, leading to numerous cache misses in the deposition/gathering routines and decrease in performance. This problem is more likely to affect 3D simulations that use more data for the grids and the particles. Reordering particle to have particles contiguous in memory is the point of the particle sorting.
During interpolation processes, if field grids of tiles can fit in L2 cache, cache reuse is more efficient. Without tiling, spatial sorting of the particles play the same role. Particles are therefore aligned and close in space. A portion of the field grid put in cache for one particle will be reused for all the next particles (iteration of the particle loops) located in this area until a particle contributes to the next block.
With the tiling, fields are already located in L2 cache. The particle sorting is then of interest for vectorization. When several particles (in the particle loop) use the same field components (because located in the same cell or neighbor cells), the field components may be stored and kept in the L1 close to the vector registers for several particle iterations. Then, number of loads/store in lower cache levels is reduced since the same components are used several times.
A cell sorting algorithm, also refers to as bucket or bin sorting, is perfectly adapted here since we want particles to be aligned in memory to share the same current, density and field grid nodes. However, note that in the case of the FDTD Yee’s Maxwell discretization, each electric and magnetic field components have their own grid that can not be superimposed (staggered). The PICSAR implementation of the bin sorting is similar to what has been done in [1,2]. The sorting subroutines are located in
src/housekeeping/sorting.F90. The sorting is performed inside the tiles at the lowest parallelization level. It is called in the PIC loop just after the particle pusher, every given time steps specified by the user. The sorting is not limited to the cell size of the field grids but can be tuned to use smaller or larger sorting cell size. The bin sizes in each direction, called , , , are factors of the field cell size so that , and .
The bin sorting algorithm hardly vectorizes due to possible memory races during the sorting process. However, the bin sorting complexity is in O(np+nc) where np is the number of particles and nc the number of cells. In many cases, each cell contains at least one particle so that np>nc. This is lower than global sorting algorithms which usually have in the best case a complexity in O(np log(np)).
Nonetheless, sorting the particles is still computationally expensive (of the order of the current deposition). In practice, sorting particles at every time steps reveal inefficient in term of speedup from interpolation processes in comparison with the cost of the sorting. After being sorted, the particles will take a certain amount of time before being completely mixed so that interpolation processes will be highly impacted depending on the particle energy distribution, number of particles per cell and size of the cells. Therefore, sorting is usually performed with a period of several iterations. In PICSAR, this is for the moment specified by the user.
The following tests have been performed on the super-computer Edison at NERSC equipped of Intel Ivy Bridge processors. Moreover, the tests have been done on an old version of PICSAR dated April 2016 that does not include most optimized and vectorized subroutines.
First case: large tiles, no vectorization
We consider the case of a homogeneous plasma initialized in the entire domain having periodic boundary conditions. The domain is composed of 100x100x100 cells with 20 particles per cells. The plasma has a density of . The particles are initialized with a thermic velocity . Interpolation processes are performed at order 1.
In these first simulations, the domain is divided into 2x2x2 MPI subdomains then divided into 2x2x2 tiles. The tile size is of 33 MB for particles and 732 KB for fields. The tiles are therefore much larger than the Edison L2 cache (256 KB) and the particles can even not be stored in the L3 cache. Here, non-optimized subroutines with poor vectorization are used for the field gathering and the current deposition.
Different cases are compared, one without sorting and three others with sorting and different bin sizes corresponding to factors , , and . A factor means that the bins have the volume of the cells. However, for the FDTD scheme, the sorting grid is aligned only with one of the field components. This means that for the other field components, the particles contained in a cell are not exactly aligned in memory. A factor means that the bins have a volume ⅛ times lower than the cells. The particles in each beam share the same field grid nodes for any component. Nonetheless, fewer particles are present in bin and the sorting is interesting only for many particles per beam. Sorting is also slightly more expensive. A factor means that the bin volume is 8 times greater than the cells. The particles inside bins will not share all the fields components. However the sorting process will be faster.
As shown in Fig. 1, PICSAR runs faster with the particle sorting. It can be seen that the configuration with bin size factor has the best performances with a speedup pf 1.2. The sorting has a major effect on the field gathering with a speedup of 1.3. It also speeds up the current deposition in a lower extent with factor 1.1 time. The sorting represents a negligible part of the total simulation time, around 1%.
Effects of the sorting is differently revealed in Fig. 2 . After initialization, the particle are sorted. The first iterations therefore correspond to the fastest ones with and without sorting. Then, as the particles get mixed, the time per iteration increases due to cache misses. After each sorting process (every 20 iterations), the particles contiguous in memory are close in space as for the initial state. The time per iteration almost recovers its initial value. If the sorting period is too long, the iteration time after sorting will catch again the no-sorting iteration time. However, performed too often, the sorting computational cost can be above the benefits.
Second case: small tiles, no vectorization
In this section, the same physical case is considered. Non-optimized subroutines for the current deposition, the charge deposition and the field gathering are still used. We increase the tile number per MPI domain to 5x5x5 (tiles have 10x10x10 cells without guard cells). The tile size is now of 4 Mo for the particles with all properties and of 32 KB in average per field grid (including guard cells). Tiles now fit in L2 cache for the fields. As shown in Fig. 3, the sorting has almost no effect on the performance with an efficient tiling and no-vectorized subroutines.
Third case: efficient tiling with vectorized/optimized subroutines
For the last case on Edison, using optimized interpolation subroutines and the sorting reveals efficient again as shown in Fig. 4. The gain from the sorting is nonetheless small. The maximum speedup is given by with a factor of 1.15. The sorting represents around 1% of the simulation time.
Sorting the particle can speedup interpolation processes by improving memory accesses. This study shows that the particle sorting can still be efficient with the tiling with optimized subroutines. However, this study was performed on an outdated version of PICSAR on Edison. Similar runs should be performed on KNL with the last optimized subroutines. To conclude, we have seen fluctuating improvements due to the sorting depending on cases and architectures with in the worth case, similar performance results with and without it.
 DECYK, Viktor K., KARMESIN, Steve Roy, DE BOER, Aeint, et al. Optimization of particle‐in‐cell codes on reduced instruction set computer processors. Computers in Physics, 1996, vol. 10, no 3, p. 290-298.
Mathieu Lobet, last update: February 5 2017