## Multi-node performance on CORI

The multi-node performance of PICSAR stand-alone is studied on the KNL partition of the supercomputer CORI. CORI is a new Cray XC40 supercomputer hosted at NERSC at the Lawrence Berkeley National Laboratory (LBNL) in California. The system overview is detailed on the NERSC website. Cori is equipped of 2,004 Intel Haswell computes nodes (Intel® Xeon™ Processor E5-2698 v3) and 9,304 Intel Xeon Phi Knights Landing (KNL) compute nodes (Intel® Xeon Phi™ Processor 7250). More information can also be found in our tutorial page on how to use PICSAR on KNL.

### Weak scaling

#### Case description

A weak scaling has been performed using the PICSAR sand-alone code until 4096 nodes. For this aim, an homogeneous hydrogen plasma has been simulated. In this plasma, all electrons have the same velocity $v = 0.99 c$ with a random propagation direction. This ensures that a non significant amount of particles are crossing boundaries and that this amount is well balanced between all boundaries. Simulations are run with 128 KNL nodes configured in quadrant cache mode. Each node uses 4 MPI ranks and 32 OpenMP threads per rank (hyper-threading with 2 threads per core). Each MPI domain have a size of 100x50x50 cells with 12x6x6 tiles. Therefore, tile size is around 8x8x8 cells. There is 40 particles per cell per species (ans 2 species). Current deposition is done at order 1. The problem entirely fit in MCDRAM.

#### Scaling results

Evolution of the efficiency as a function of the number of nodes from the weak scaling study are shown in Fig. 1.  The scaling is given for the kernel, the computation, the particle communications and the grid communications. Relative times are given in Fig. 2. The study starts at 1 node and end at 4096 nodes corresponding to 16384 MPI processes and a total of 262144 cores.

The first node is the reference. On a single node, computation represents 70% of the total kernel time, grid communications are at 10% and particle communications at 20%. For 2 nodes, the computation relative time falls down. From 2 to 2048 nodes, the computation decreases from 50% to 40% of the kernel time. This is mainly due to the increase of the grid communication time. Above 1 node, MPI communications use the network. At 2 nodes, the relative time rises to 30% and then globally continue to increase to reach 40% at 20148 nodes. Surprisingly, the particle communication relative time stays almost constant at 20% between 1 and 4096 nodes.

On the weak scaling efficiency, the computational efficiency is almost constant close to 95% except for the last points with a value close to 85%. Worth results are coming from grid communications with an efficiency below 20%. Particle communication efficiency fluctuates between 45% and 80%. The result is an overall efficiency of 64% for 2 nodes that decreases to 47% for 2048 nodes. If we look at the efficiency between 2 and 2048 nodes for which the network is used, the scaling is correct with a decrease of 25% at 2048 nodes.

The surprising behavior is for the last point at 4096 nodes which seems to have performance close to the single node. This simulation has been run only once and needs more data for a better statistics.

Mathieu Lobet, last update: February 6 2017