## Optimization of the field gathering

The field gathering is an interpolation process in which electric and magnetic fields previously updated via the Maxwell solver are gathered from the grid nodes to the particle position in order to estimate electric and magnetic fields seen by each particle. The fields known at the particle positions will then be used by the pusher. The field gathering process is schematically described in 2D in Fig 1.

Mathematically, the field gathering in 3D corresponds to:

$E_p = \sum_{i,j,k} E_{i,j,k} S_{\alpha} \left( \| \mathbf{x}_{i,j,k} - \mathbf{x}_p \| \right)$

In this formulae, $E_p$ is the particle electric field located at the position $\mathbf{x}_p$. $E_{i,j,k}$ represents the field for the node $i,j,k$ at the position $\mathbf{x}_{i,j,k}$. $S_{\alpha}$ is the interpolation shape factor of order $\alpha$. This shape factor determines how $E_p$ is interpolated from the nodes and how many nodes are used. $\| \mathbf{x}_{i,j,k} - \mathbf{x}_p \|$ is the distance between the node $i,j,k$ and the particle. In PIC codes, the shape functions are standard B-splines which are defined as the successive convolution products of the rectangular function by itself $S_\alpha =(S_0 * S_0 * ... * S_0)_{\alpha\ times}$ where $S_0 = 1_[-1/2;1/2]$.

At order 1, the particle electric field is determined using the nodes of the cell where this particle is located: $E_{i,j,k}$, $E_{i+1,j,k}$, $E_{i,j+1,k}$, $E_{i,j,k+1}$, $E_{i+1,j+1,k}$, $E_{i,j+1,k+1}$, $E_{i+1,j,k+1}$, $E_{i+1,j+1,k+1}$. They correspond to the blue circle markers in Fig. 1. If the particle is located on the lower left corner of the cell, purple and blue nodes will be necessary for order 2. At order 3, all nodes in Fig. 1 including the green ones will be used.

In the following, we consider order 1 as an example because it is simple to understand and the algorithm is shorter. However, order 3 is more often used in production runs.

### The original field gathering subroutine

The original subroutine for the electric field gathering in 3D at an arbitrary order (between 1 and 3)  is shown in List. 1 in a simplified way. The full version of this code is available in the file src/field_gathering/energy_conserving/field_gathering_on_3d.F90.

After variable declaration and initialization, the field gathering is constituted of a loop over all particles. For each particle, the indexes of the cell where this particle is located is first determined. Then, the weights for the interpolation sx, sy, sz, sx0, sy0, sz0 are computed for all contributing nodes depending on the given shape factor orders nox, noy, noz. The last part of the particle loop is the gathering process itself. This step loops over the contributing nodes (via jj,kk,ll)) to add the node field contributions exg(j0+jj,k+kk,l+ll) to the particle field variable ex(ip).

SUBROUTINE pxr_gete3d_n_energy_conserving

! Variable and array declaration

dxi = 1.0_num/dx
dyi = 1.0_num/dy
dzi = 1.0_num/dz

! Indexes for the different grids
ixmin = -int(nox/2)
ixmax =  int((nox+1)/2)-1
! Similar for iymin, iymax, izmin, izmax
...

ixmin0 = -int((nox)/2)
ixmax0 =  int((nox+1)/2)
! Similar for iymin0, iymax0, izmin0, izmax0
...

! Weights for the interpolation
ALLOCATE(sx(ixmin:ixmax),sy(iymin:iymax),sz(izmin:izmax))
ALLOCATE(sx0(ixmin0:ixmax0),sy0(iymin0:iymax0),sz0(izmin0:izmax0))
! Loop over particles
DO ip=1,np

x = (xp(ip)-xmin)*dxi
y = (yp(ip)-ymin)*dyi
z = (zp(ip)-zmin)*dzi

! Compute grid node indexes j and j0 depending on nox
IF (nox==2*(nox/2)) THEN
j=nint(x)
j0=floor(x-0.5_num)
ELSE
j=floor(x)
j0=floor(x)
END IF
! Compute similarly grid node indexes k,k0,l,l0
! depending on noy and noz
...

xint=x-j
yint=y-k
zint=z-l

! Compute sx coefficients depending
! on interpolation order parameter nox
IF (nox==1) THEN
...
ELSEIF (nox==2) THEN
...
ELSEIF (nox==3) THEN
...
END IF
! Compute similarly sy and sz coefficients depending
! on interpolation order parameters: noy and noz
...

xint=x-0.5_num-j0
yint=y-0.5_num-k0
zint=z-0.5_num-l0

! Compute sx0 coefficients depending
! on interpolation order parameter nox
IF (nox==1) THEN
...
ELSEIF (nox==2) THEN
...
ELSEIF (nox==3) THEN
...
END IF
! Compute similarly sy0 and sz0 coefficients depending
! on interpolation order parameters: noy and noz
...

! Gathering process on Ex
DO ll = izmin, izmax+1
DO kk = iymin, iymax+1
DO jj = ixmin0, ixmax0
ex(ip) += sx0(jj)*sy(kk)*sz(ll)*exg(j0+jj,k+kk,l+ll)
END DO
END DO
END DO

! Gathering process on Ey
...
! Gathering process on Ez
...

ENDDO
END SUBROUTINE


This subroutine contains several issues that prevent a good efficiency on KNL:

• The main loop over particles, that has the largest trip counts, contains a lot of branches (if-statements) for managing all shape factor orders. Branches are slow (especially on KNL due to the lower frequency) and can sometime prevent vectorization. Here, branches are independence of the loop index and only depends on the shape factor orders nox, noy, noz.
• The main loop contains small inner loops that prevent vectorization. The most inner loop is too small (maximum 4 trip counts at order 3) to be vectorized efficiently.

### Optimization of the field gathering

This optimization is constituted of several steps.

#### 1) Removing the branches

Instead of having a single subroutine for all orders, we duplicate the code and create order-specific subroutines (for nox=noy=noz=1, nox=noy=noz=2, and nox=noy=noz=3). By this way, all branches are removed from the intensive particle loop. This simple and trivial optimization can provide tremendous speedups on KNL. On Intel Haswell processors, a speedup of 3.7 at order 1 and 1.6 at order 3 have been reported with a version of PICSAR dated June 2016. On Intel KNL, speedups are larger with values of 5.8 at order 1 and 7 at order 3. At order 1 for instance, the corresponding subroutine in 3D can be seen in src/field_gathering/energy_conserving/field_gathering_o1_3d.F90 with the name gete3d_energy_conserving_scalar_1_1_1 for the electric field.

#### 2) Linearization of the gathering inner loops.

The vetcorization of the particle loop is still prevented due to the gathering inner loop as in List. 2. The idea is to linearize these loops for each component Ex, Ey , Ez.


! Gathering process on Ex
DO ll = izmin, izmax+1
DO kk = iymin, iymax+1
DO jj = ixmin0, ixmax0
ex(ip) += sx0(jj)*sy(kk)*sz(ll)*exg(j0+jj,k+kk,l+ll)
END DO
END DO
END DO

END SUBROUTINE


The resulting code for Ex and a order 1 shape factor is shown in List. 3.

! Linearized gathering process on Ex
ex(ip) += sx0(0)*sy(0)*sz(0)*exg(j0,k,l)
ex(ip) += sx0(1)*sy(0)*sz(0)*exg(j0+1,k,l)
ex(ip) += sx0(0)*sy(1)*sz(0)*exg(j0,k+1,l)
ex(ip) += sx0(1)*sy(1)*sz(0)*exg(j0+1,k+1,l)
ex(ip) += sx0(0)*sy(0)*sz(1)*exg(j0,k,l+1)
ex(ip) += sx0(1)*sy(0)*sz(1)*exg(j0+1,k,l+1)
ex(ip) += sx0(0)*sy(1)*sz(1)*exg(j0,k+1,l+1)
ex(ip) += sx0(1)*sy(1)*sz(1)*exg(j0+1,k+1,l+1)



To allow vectorization, an OpenMP !$OMP SIMD pragma can be added before the particle loop. The last modification concerns the weigth arrays sx, sy, sz, sx0, sy0, sz0. The !$OMP SIMD pragma automatically put variables into private (a vector copy is created for each vector compoenent) but this is not the case for arrays such as the weights. Therefore, private(sx,sy,sz,sx0,sy0,sz0) should be added to the !$OMP SIMD pragma. This solution is not unique. One can also add a new dimension in sx, sy, sz, sx0, sy0, sz0 that corresponds to the vector length lvect. The allocation will becomes:  ! Weights for the interpolation ALLOCATE(sx(lvect,ixmin:ixmax), & sy(lvect,iymin:iymax), & sz(lvect,izmin:izmax)) ALLOCATE(sx0(lvect,ixmin0:ixmax0), & sy0(lvect,iymin0:iymax0), & sz0(lvect,izmin0:izmax0))  Using this method also means dividing the particle loop by block of size lvect:  ! Loop over particles by block of size lvect DO ip=1,np,lvect ! Loop over particles of the block DO n=1,MIN(lvect,np-ip+1) ! Absolute particle indexe nn = ip+n-1 Computation... ENDDO END DO  The computation is then the same except than both relative index n and absolute one nn are used. By comparing both method, we found that using the pragma !$OMP SIMD private(...) is slightly more efficient. The time is almost the same but using pragma is better in a sens that it is more portable.

#### 3) Fused multiply add (FMA) operations

In the present form, the field gathering subroutine can be vectorized but vectorization is still less efficient than the scalar subroutines. The reason is because the linearization of the gathering loops is not optimal and contained a lot of redundancies. For instance in List. 3, all lines have the multiplication by sz(0). This could be done only once. This can be optimized by factoring redundant computation and doing the more FMA as possible. The resulting optimized code is given in List. 6.

! Compute Ex on particle
a = (sx0(n,0)*exg(j0,k,l) &
+ sx0(n,1)*exg(j0+1,k,l))*sy(n,0)
a = a + (sx0(n,0)*exg(j0,k+1,l) &
+ sx0(n,1)*exg(j0+1,k+1,l))*sy(n,1)
ex(nn) = ex(nn) + a*sz(n,0)

a = (sx0(n,0)*exg(j0,k,l+1) &
+ sx0(n,1)*exg(j0+1,k,l+1))*sy(n,0)
a = a + (sx0(n,0)*exg(j0,k+1,l+1) &
+ sx0(n,1)*exg(j0+1,k+1,l+1))*sy(n,1)
ex(nn) = ex(nn) + a*sz(n,1)


This simple rearranging of the code makes the vectorization efficient. The new code is now only composed of FMA. This optimization is particularly important for high interpolation orders such as order 3 for which the number of flops/byte of the gathering process is higher.

Doing the factorization by hand can be tedious. We have therefore developed python scripts that directly generate these Fortran subroutines with factorized calculations. These scripts are located in utils/subroutine_generators.

#### 4) Cache-blocking

We mentioned above the possibility to divide the particle loop by block as shown in List. 5. This method can be considered as cache-blocking. If the fields do not occupied the entire L2 cache, the remaining part can be used to load a block of particle. By dividing the loop in blocks of size blocksize, the block particle properties will be loaded in cache. Fig. 2 helps to understand this idea on processors with L3 cache. On KNL, memory directly goes from MCDRAM to L2 cache.

Although this is not fully understood, this provides small speedups on Haswell and KNL. The gain is higher on KNL due to the lack of L3 cache.

#### 5) Merging loops

Finally, another speedups have been achieved by merging subroutines. At first, the electric and magnetic field gathering have been combined in the same particle loop. This enables to share index and weight computation. Then, the particle pusher has been merged with the field gathering. The two processes are performed together by block. These subroutines can be found in /src/particle_pushers/particle_pusher_manager_3d.F90.

### Final code version

Finally, the code looks like something like List. 7 at order 1.

SUBROUTINE field_gathering_plus_particle_pusher_1_1_1

! Variable and array declaration
! Weights for the interpolation
REAL(num), DIMENSION(0:1) :: sx,sy,sz
REAL(num), DIMENSION(0:1) :: sx0,sy0,sz0

dxi = 1.0_num/dx
dyi = 1.0_num/dy
dzi = 1.0_num/dz

! Loop over blocks of particles of size lvect
DO ip=1,np,lvect

blocksize = MIN(lvect,np-ip+1)

!$OMP SIMD private(sx,sy,sz,sx0,sy0,sz0) ! Loop over particles of the block ! for field gathering DO nn=ip,blocksize+ip-1 x = (xp(nn)-xmin)*dxi y = (yp(nn)-ymin)*dyi z = (zp(nn)-zmin)*dzi ! Compute grid node indexes j,j0,k,k0,l,l0 j=floor(x) j0=floor(x) ... xint=x-j yint=y-k zint=z-l ! Compute sx,sy,sz coefficients ... xint=x-0.5_num-j0 yint=y-0.5_num-k0 zint=z-0.5_num-l0 ! Compute sx0,sy0,sz0 coefficients ... ! Gathering process on Ex a = (sx0(n,0)*exg(j0,k,l) & + sx0(n,1)*exg(j0+1,k,l))*sy(n,0) a = a + (sx0(n,0)*exg(j0,k+1,l) & + sx0(n,1)*exg(j0+1,k+1,l))*sy(n,1) ex(nn) = ex(nn) + a*sz(n,0) a = (sx0(n,0)*exg(j0,k,l+1) & + sx0(n,1)*exg(j0+1,k,l+1))*sy(n,0) a = a + (sx0(n,0)*exg(j0,k+1,l+1) & + sx0(n,1)*exg(j0+1,k+1,l+1))*sy(n,1) ex(nn) = ex(nn) + a*sz(n,1) ! Gathering processes on Ey, Ez, Bx, By, Bz ENDDO ! Loop over particles of the block ! for particle pusher !$OMP SIMD
DO nn=ip,blocksize+ip-1

! Push particles

ENDDO

ENDDO
END SUBROUTINE


### Performance

Fig. 3 shows the times per iteration per particle for the field gathering and the particle pusher after different levels of optimization on a KNL configured in quadrant cache. The test case is a homogeneous hydrogen plasma with a domain 100x100x100 cells and 40 particles per cell. The transition from the original subroutines to the scalar subroutines corresponding to the optimization 1) in previous section leads to speedups of  5.8 at order 1 and 7 at order 3. Once vectorized with cache blocking (optimization 2) and 3)), a new speedup of 1.2 at order 1 and 1.5 at order order 3. The last optimization leads to an additional speedup of 1.6 at order 1.

Finally, all optimizations bring a total average speedup of 11 at order 1 and 3. The most important optimization is the first one. Vectorization comparatively brings a small speedup far from the maximum number. Additional analysis have to performed to improve the field gathering. The current version has gather instructions when accessing the nodes of the field grids. It mixes as for the current and the charge deposition different data types. Intel Advisor is also warning about potential register spilling due to the high number of variables to compute for each particle.

Mathieu Lobet, last update: March 28 2017