A new method [1] for solving time-dependent problems proposed to apply the cartesian domain decomposition technique currently used with low order FDTD solvers to very high order-pseudo-spectral solvers. As in the case of low order schemes, the simulation domain would be divided into several subdomains (see Fig. 1 left) and Maxwell’s equations solved locally on each subdomain either using local convolution or local FFT’s when the order p is very large.

However, there were still two challenges to face before being able to use this technique in production simulation:

- First, this technique induces small truncation errors that can affect simulation results and that needed to be characterized. The fundamental argument still legitimating this decomposition method is that physical information cannot travel faster than the speed of light. Choosing large enough guard regions should therefore ensure that spurious signal coming from stencil truncations at subdomains boundaries would remain in guard regions and would not enter the simulation domain (see Fig. 1 left) after one time step. Truncations errors were recently extensively characterized and an analytical model [2] was developed thanks to which we can now predict the amount of truncation errors in simulations as a function of the numerical parameters used (solver order, guard cell sizes, mesh size). In particular, this model shows that one should use very high finite order-p solvers instead of infinite order solvers. Indeed, truncation errors with finite very high order solvers decreases much faster with the number of guard cells while still bringing pseudo-spectral accuracy on almost the whole range of frequency accessible in the simulation. Practically, the truncation error model shows that ten to twelve guard cells are sufficient to lower truncation errors below machine precisions with up to order 128 Maxwell solvers, for which pseudo-spectral precision is obtained on almost the whole frequency domain. This model will be especially crucial to determine the number of guard cells to use for a given solver order to minimize truncation errors in production simulations.
- Then, this technique will need efficient shared memory implementation of the PIC algorithm. Even tough the number of guard cells required (nguards = 10) seems low, it can however still affect the scaling of the PIC code to a large number of processors. One solution is therefore to enable good shared-memory parallelization of the PIC loop to decrease the total number of MPI-processes (and thus the number of volumes of MPI exchanges) and improve scaling of the pseudo-spectral code to a large number of cores. Shared memory implementation of particle routines and standard solvers have been addressed in the previous section of the this chapter. For FFTs, we use the FFTW library that provide rather good threaded implementation of FFTs. On Xeon Phi architectures, one might also consider using the Intel MKL library that has been designed to fully exploit hardware features of many-core architectures.

Thanks to this recent work, an efficient parallel implementation of pseudo-spectral solvers is now available in WARP/PXR that was recently tested at large scale on the MIRA [3] cluster at Argonne National Lab and benchmarked against standard solvers for relativistic laser-plasma mirror interactions. Scaling tests on MIRA are presented in the performance section and show efficient scaling of pseudo-spectral solvers on up to 800; 000 cores. Benchmarks against standard solvers show significant improvements in terms of time-to-solution and memory footprint required for a given accuracy [4].

### References

[1] – Jean-Luc Vay, Irving Haber, and Brendan B Godfrey. A domain decomposition method for pseudo-spectral electromagnetic simulations of plasmas. Journal of Computational Physics, 243:260-268, jun 2013.

[2] – H. Vincenti and J.-L. Vay. Detailed analysis of the effects of stencil spatial variations with arbitrary high-order finite-difference Maxwell solver. Computer Physics Communications, 200:147-167, marsh 2016.

[3] – Mira team. Mira super-computer. https://www.alcf.anl.gov/mira. Accessed: 2017-02-07.

[4] – G. Blaclard, H. Vincenti, R. Lehe, and J.-L. Vay. An efficient and portable simd algorithm for charge/current deposition in particle-in-cell codes. ArXiv preprint arXiv:1608.05739, 2016.