## Optimized Morse-Nielson charge/current deposition

The Morse-Nielson current/charge deposition algorithm is composed of a loop over particles of a domain for which the generated current/charge is computed and projected on the surrounding nodes following a given shape factor (interpolation method).

In PICSAR, this method has been optimized to comply with large vector register on modern architectures such as INTEL Xeon Phi KNL processors and achieve good performance . The following describes the optimized PICSAR implementation of the Morse-Nielson current/charge deposition algorithm.

#### Presentation of the Morse-Nielson current/charge deposition algorithm.

This algorithm also referred to as direct deposition is simple to implement but can not be vectorized in its classical form shown in listing 1 for  order 1 (CIC) shape factor. In the code, xp, yp, zp refer to the particle position arrays for each component, xmin, ymin, zmin to the positions of the minimum grid indexes, dxi = 1/dx, dyi = 1/dy, dzi = 1/dz to the inverse of the space steps. The array rho corresponds to the charge grid ( $\rho$).

SUBROUTINE simplified_charge_deposition(...)

! Declaration and init
! Loop on particles
DO ip =1,np

! --- PART 1 -------
! Computes current position in grid units
ixp = (xp(ip)- xmin )* dxi
iyp = (yp(ip)- ymin )* dyi
izp = (zp(ip)- zmin )* dzi
! Finds node of cell containing particle
j= floor(ixp)
k= floor(iyp)
l= floor(izp)
! Computes weigths w1 .. w8
........

! --- PART 2 -------
! To the 8 vertices of current cell
rho(j,k,l) =rho(j,k,l) + w1
rho(j+1,k,l) =rho(j+1,k,l) + w2
rho(j,k+1,l) =rho(j,k+1,l) + w3
rho(j+1,k+1,l) =rho(j+1,k+1,l) + w4
rho(j,k,l+1) =rho(j,k,l+1) + w5
rho(j+1,k,l +1) =rho(j+1,k,l+1) + w6
rho(j,k+1,l +1) =rho(j,k+1,l+1) + w7
rho(j+1,k+1,l+1) =rho(j+1,k+1,l+1)+ w8
ENDDO

END SUBROUTINE simplified_charge_deposition


The main loop over the particles can be divided into two parts:

1. For each particle, cell node indices and interpolation weights w1..w8 are computed depending on the shape factor and the location of the particle in the cell.
2. For each particle, current/charge from particles are deposited on the surrounding current/charge grid nodes (rho in the example) depending on the shape factor.

The first part of the loop can be vectorized despite roundings and type mixing. In the second part, memory races occur when two particles are located in the same cells.

#### Former vectorized algorithms for Cray machines.

Several vector algorithms were imagined and used on former Cray vector machines [2-7]. However, these techniques are not adapted anymore to current architectures and yield very poor results on new SIMD machines (Intel KNC, KNL processors for instance).

#### New and portable SIMD algorithms

Good vectorization performance on SIMD architectures relies on 3 main points:

• good cache reuse
• Memory alignment
• Contiguous memory access

Some vector deposition routines proposed in contemporary PIC codes use compiler based directives or even C++ Intel intrinsics in the particular case of the Intel compiler, to increase vectorization efficiency .

An portable optimized SIMD algorithm has been developed in PICSAR to target large vector register processors such AVX512 on Intel KNL. The new vector algorithm is detailed in listing 2 for the case of the charge deposition at order 1. Similarly to the Schwarzmeier and Hewitt routine, the main particle loop is done by blocks of lvect particles and divided into two consecutive nested loops:

1. A first nested loop that computes cell indexes and particle weights
2. A second one that adds the particle weights to its 8 nearest vertices.
SUBROUTINE optimized_charge_deposition

! Declaration and init
.....
nnx = ngridx ; nnxy = nnx* ngridy

moff = (0, 1, nnx, nnx+1, nnxy, nnxy+1, nnxy+nnx, nnxy+nnx+1)
mx =(1, 0, 1, 0, 1, 0, 1, 0)
my =(1, 1, 0, 0, 1, 1, 0, 0)
mz =(1, 1, 1, 1, 0, 0, 0, 0)
sgn =(-1, 1, 1, -1, 1, -1, -1, 1)

! ___ PART 1 _________________
! Loop over chunks of particles
DO ip =1,np , LVEC

! 1.1) SIMD loop on particles: computes cell index of particle
! and their weight on vertices
! $OMP SIMD DO n=1, MIN(LVEC, np-ip+1) nn=ip+n-1 ! Calculation relative to particle n ! Computes current position in grid units x = (xp(nn)-xmin)*dxi y = (yp(nn)-ymin)*dyi z = (zp(nn)-zmin)*dzi ! Finds cell containing particles for current positions j=floor(x) ; k=floor(y) ; l=floor(z) ICELL(n)=1+j+nxguard+(k+nyguard+1)*(nx+2*nxguard) &amp;amp;amp; +(l+nzguard+1)*(ny+2*nyguard) ! Computes distance between particle and node ! for current positions sx(n) = x-j ; sy(n) = y-k ; sz(n) = z-l ! Computes particles weights wq(n)=q*w(nn)*invvol ENDDO ! 1.2) Charge deposition on vertices DO n=1, MIN(LVEC ,np-ip+1) ic=ICELL(n) ! SIMD loop on vertices: Add charge density ! contributions to vertices of the current cell !$OMP SIMD
DO nv=1, 8 !!! - VECTOR
ww =(- mx(nv)+ sx(n))*( - my(nv)+ sy(n))* &amp;amp;amp;
(-mz(nv)+ sz(n))* wq(n)* sgn(nv)
rhocells(nv, ic) = rhocells(nv, ic)+ww
ENDDO
ENDDO
ENDDO

! ___ PART 2 _________________
! Reduction of rhocells in rho
DO iz=1, ncz
DO iy=1, ncy
! \$OMP SIMD
DO ix =1, ncx !! VECTOR ( take ncx multiple of vector length )
! Index of (ix,iy,iz) in rhocell: ic
ic=ix +(iy -1)* ncx +(iz -1)* ncxy

! Index of (ix,iy,iz) in rho
igrid = orig + ic +(iy -1)* ngx +(iz -1)* ngxy

! Reduction, we turn around (ix,iy,iz) according to moff
rho(igrid + moff (1)) += rhocells (1, ic)
rho(igrid + moff (2)) += rhocells (2, ic)
rho(igrid + moff (3)) += rhocells (3, ic)
rho(igrid + moff (4)) += rhocells (4, ic)
rho(igrid + moff (5)) += rhocells (5, ic)
rho(igrid + moff (6)) += rhocells (6, ic)
rho(igrid + moff (7)) += rhocells (7, ic)
rho(igrid + moff (8)) += rhocells (8, ic)
ENDDO
ENDDO
ENDDO

END SUBROUTINE optimized_charge_deposition


The new algorithm addresses the two main bottlenecks of the Schwarzmeier and Hewitt algorithm with the two following new features:

• A new data structure for rho is introduced, named rhocells, which enables memory alignment and unit-stride access when depositing charge on the 8 vertices. In rhocells, the 8-nearest vertices are stored contiguously for each cell. The array rhocells are thus of size (8, NCELLS) with NCELLS the total number of cells. The element rhocells(1, icell) is therefore 64 bytes-memory aligned for a given cell icell and the elements rhocells(1:8, icell) entirely fit in one cache line allowing for efficient vector load/stores. This new data-structure in described in Fig. 1. Looking at the left, for each cell icell = (j, k, l) in the main charge array rho, rhocells stores the 8 nearest vertices (j, k, l), (j+1, k, l), (j, k+1, l), (j+1, k+1, l), (j, k, l+1), (j+1, k, l+1), (j, k+1, l+1) and (j+1, k+1, l+1) contiguously for CIC (order 1) shape factor. This array has to remain sufficiently small to be contained in L2 in addition to rho allowing for an efficient cache reuse. For a tile of 11x11x11 cells (8x8x8 cells with 3 guard cells in each direction for order 1 shape factor), rhocells has a size of approximately 80 kB. However, the equivalent of rhocells for the current deposition has a size of 250 kB that is close to the L2 cache size on Haswell (256 kB) and half the L2 cache on KNL for a single core (512 kB).
• For each particle, the 8 different weights ww are now computed using a generic formula that suppresses gather instructions (see section 1.2 in listing 2) formerly needed in the SH algorithm. This also avoids implementing non-portable efficient transpose between the first and second loop, rendering this new algorithm fully portable. This corresponds to the second inner loop (1.2 in listing 2) with the charge deposition on rhocells. During this phase, the vectorization is not applied on the particles but on the vertices for each particle. The elements rhocells(1:8, icell) entirely fit in one vector cache line allowing for efficient vector store (no scatter on non-aligned grid elements). Fig 1. – Description of the previous and new data structure (rhocells) for vectorization of the current and charge deposition at order 1.

After the deposition is done for all particles, rhocells has to be reduced to the main charge array rho (part 2 in listing 2). This step can be vectorized but does not lead to optimal performances due to the non-contiguous access in rho that induces gather-scatter instructions. However, this process is executed Nc times (again, Nc, or ncells, being the total number of cells) against Np times for the direct or Schwarzmeier and Hewit algorithm (Np being the number of macro-particles). In many simulations, the number of particles is much higher than the number of cells, Np >> Nc. Moreover, the reduction is performed just once after all species have contributed to the rhocells data structure. Fig. 2 – rhocellsdata structure for order 2 (left) and order 3 shape factors. For TSC (order 2), the particle deposits its charge to the 27 neighboring vertices (red rectangles and blue points). For a given cell icell = (j,k,l), rhocells store contiguously the 8 vertices (red points) (j, k−1, l−1), (j, k, l−1), (j, k+1, l−1), (j, k−1, l), (j, k+1, l), (j, k−1, l+1), (j, k, l+1) and (j, k+1, l+1). The blue points are not stored in rhocells and are treated scalarly in the algorithm. For QSP (order 3) particle shape factor, the particle deposits its charge to the 64 neighboring vertices (red rectangles). For a given cell icell = (j, k, l), rhocells stores contiguously the 8 vertices (delimited by red areas) (j, k − 1, l − 1), (j, k, l − 1), (j, k + 1, l − 1), (j, k + 1, l − 1), (j, k − 1, l), (j, k, l), (j, k + 1, l), (j, k + 1, l).

For higher shape factor orders that use more than 8 vertices, the algorithm is adapted. The new data structures are shown schematically in Fig. 2. The new data structure depends on the interpolation order:

• TSC (order 2) particle shape: In this case, the particles deposit their charge to the 27 neighboring vertices. However, storing 27 contiguous vertices per cell in rhocells would not be efficient as the reduction of rhocells to rho would be much more expensive with potential cache-reuse inefficiency. Instead, while the same size for rhocells(1:8, 1:NCELLS) is used, the vertices are now grouped in a different way. The new structure for rhocells(1:8, 1:NCELLS) groups 8 points in a (y, z) plane  for each cell icell (red rectangles in Fig. 2). For each cell, each particle adds its charge contribution to 24 points in the three planes at icell − 1, icell and icell + 1. The three remaining central points (blue triangle markers in Fig. 2) can be either treated scalarly for 512-bits wide vector registers or vectorized for 256-bits by artificially adding a virtual point that does not contribute to any charge. Notice that we did not find a generic formulation for the weights ww and we are therefore still performing a ‘‘gather’’ instruction for ww in the loop on the vertex. However, this gather is performed in the y and z directions for the first plane of 8 points (plane ic=−1 on panel) and is subsequently reused on the two other planes ic=0 and ic=1. Gather is thus performed only 8 times out of 24 points and thus has a limited impact on performance, as shown below in the reported test results.
• QSP (order 3) particle shape: In this case, particles deposit their charge to the neighboring vertices. rhocells(1:8, 1:NCELLS) also group 8 points in a (y, z) plane but differently from the TSC case (see red areas in Fig. 2). For each cell, each particle adds its charge contribution to 64 points in the 8 different (y, z) planes at icell−ncx−1, icell−ncx, icell−ncx+1, icell−ncx+2, icell+ncx−1, icell+ncx, icell+ncx+1 and icell+ ncx+2 where ncx is the number of cells in the x direction. This might reduce the flop/byte ratio of the second loop when nnx is large enough so that elements rhocells(1:8, icell) and rhocells(1:8, icell+nnx−1) are not in L1 cache. The vertices could have been grouped in (y, z) planes of 16 points instead of 8 points but this would imply a bigger reduction loop of rhocells in rho and worst performances for a low number of particles. Notice that here again, we did not find an efficient generic formulation for the weights ww and we are therefore still performing a ‘‘gather’’ instruction. However, this gather is performed in the y and z directions and gathered values are subsequently reused for computing the weights at different positions in x. Gather is thus performed only 16 times out of 64 points and thus has a limited impact on performance, as shown below in the reported test results.

For the current deposition, the vectorized deposition algorithm is similar but we use three structures jxcells, jycells and jzcells (analogous to rhocells for the deposition of rho) for the current components jx, jy, jz along directions x, y and z. More details about these algorithms can be found in .

In PICSAR sources, the charge and current depositions routines are localized in the folder particle_deposition. This folder contains charge_deposition and current_deposition.

#### Performance results

The new deposition algorithm has been tested and benchmarked on a single Haswell node of Cori. The version of the code used for these tests is the one available for this paper . Note that for this reason, the results coming from the last version of PICSAR may slightly differ from the presented speedups. An homogeneous hydrogen thermalized plasma is used as a test case. The domain as 100x100x100 cells. Tiles have a size of 10x10x10 cells for the charge deposition and 12x12x12 for the current. There is two species: protons and electrons.

In this study,  we compare the scalar and the vector subroutines for order 1, 2 and 3 interpolation shape factors. Fig. 3 and Fig. 4 respectively show the speedups brought by the vector subroutines (compared with the scalar ones) for the charge and the current depositions as a function of the number of particles per cell. In average, the speedup obtained with the vector algorithm are positive until 2.7 for order 1 interpolation and 64 particles per cell. A perfect speedup of 8 can not be achieved because of all constraints we discussed before: type mixing, gather instructions, reduction. For very low number of particles, the reduction of rhocells or jcells into rho or j becomes relatively costly compared to the deposition itself. When the number of particles increases performances are even better because the reduction operation of rhocells or jcells structures in regular arrays rho and j becomes more and more negligible relatively to particle loops. Moreover, notice that as the vectorization is on vertices, there is no performance bottleneck related to a possibly inhomogeneous distribution of particles on the simulation domain (vectorization works even for 1 particle). Fig 3. – Speed-up given by the optimized charge deposition subroutines as a function of the number of particles per cells for order 1 (red), order 2 (green) and order 3 (blue). Fig 4. – Speed-up given by the optimized current deposition subroutines as a function of the number of particles per cells for order 1 (red), order 2 (green), order 3 (blue).

#### References

 Vincenti, H., Lehe, R., Sasanka, R., & Vay, J. L. “An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes”. Computer Physics Communications (accepted) (2016).
 A. Nishiguchi, S. Orii, T. Yabe, J. Comput. Phys. 61 519 (1985).
 E.J. Horowitz, J. Comput. Phys. 68 56 (1987).
 J.L. Schwarzmeier, T.G. Hewitt, Proceedings, 12th Conf. on Numerical 40 Simulation of Plasmas, San Francisco, 1987.
 A. Heron, J.C. Adam, J. Comput. Phys. 85 284–301 (1989) .
 G. Paruolo, J. Comput. Phys. 89 462–482 (1990).
 David V. Anderson, Dan E. Shumaker, Comput. Phys. Comm. 87 16–34 (1995).
 R. A. Fonseca, J. Vieira, F. Fiuza, A. Davidson, F. S. Tsung, W. B. Mori, and L. O. Silva. Exploiting multi-scale parallelism for large scale numerical modelling of laser wakeeld accelerators. Plasma Physics and Controlled Fusion, 55(12):124011, (2013).
Henri Vincenti, Mathieu Lobet, last update: January 24, 2017