## Multi-node performance on MIRA

As part of the Director’s Discretionary Allocation obtained on June 2nd 2016, a series of scaling tests have been performed on the Mira super-computer in Argonne. The Mira machine is a 10-petaflops IBM Blue Gene/Q system [1] composed of 49152 nodes of 16 1600 MHz PowerPC A2 cores interconnected with a 5D Torus network.

The case of a homogeneous thermalized plasma of thermal velocity $0.1c$ is considered again, with 1024x1024x3072 grid cells domain and 20 macro-particles per cells. These dimensions allow to divide grid cells equally between MPI domains for all the tests.

At first, a strong scaling test is performed with MPI only (1 OpenMP thread is used per MPI thread) with a number of cores ranging from 20000 to 800 000. The finite-difference time-domain (FDTD) Maxwell solver is used with a stencil of order 2 with 2 guard cells per MPI domain and the Pseudo-Spectral Analytic Time-Domain (PSATD) Maxwell solver is used with a pseudo-stencil of order 128 with 12 guard cells per MPI domain. The results are presented in Fig. 1. The low order FDTD Maxwell solver (circle markers) scales very well to the full machine with an efficiency of 98% on approximately 800 000 cores. The high-order FDTD Maxwell solver (square markers) also exhibits a very good scaling with an efficiency of 83% on half of the MIRA machine. The reduction of the efficiency with the high-order FDTD Maxwell solver compared to the low-order FDTD solver is mainly due to the larger number of guard cells that need to be exchanged.

Strong OpenMP scaling tests for the particle and field routines have then been performed with a fixed number of 49 152 MPI tasks. It shows that a speedup of 11 can be achieved with 16 OpenMP threads per MPI process.

### References

[1] – Mira team. Mira super-computer. https://www.alcf.anl.gov/mira. Accessed:
2016-07-22.

Henri Vincenti, Mathieu Lobet. Last update: February 6, 2017