Single-node performance on the NERSC Cori system

The NERSC Cori system

Cori is a new Cray XC40 supercomputer hosted at NERSC at the Lawrence Berkeley National Laboratory (LBNL) in California. The system overview is detailed on the NERSC website. Cori is equipped of 2,004 Intel Haswell computes nodes and 9,304 Intel KNL compute nodes. The system has a theoretical peak of 27.88 TFlops and is at rank 5 in the top500 most powerful supercomputers in November 2016 with a maximum peak performance of 14 TFlops.

Single-node performance

Global kernel performance

Global performance of the PICSAR kernel has been measured on a single node of the Haswell and the KNL partitions using the FORTRAN stand-alone kernel. For this aim, the similar test case has been used: a homogeneous thermalized plasma that fulfills the entire domain. the domain is divided into 100x100x100 cells with 2 species (electrons and protons) and 40 particles per cell and per species. The use of the thermalized homogeneous plasma ensures that the load is balanced between processors during the whole simulation time and that same amounts of particles cross all boundaries. For this study, the code has been compiled with Intel Compiler 17 using Intel-MPI. The direct deposition is used for the charge and the current deposition processes.

The time per iteration and per particle on a Haswell node (2 Intel Haswell sockets) is given in Fig. 1 for order 1 and order 3 interpolations. In physical simulations, order 3 is usually preferred for accuracy. This study exhibits a speed-up of 2.4 at order 1 and 1.7 at order 3.

haswell_time_ppart_pit
Fig 1. – Simulation time per iteration per particle on a Haswell node of Cori phase II between the non-optimized (right) and the optimized (left) PICSAR kernel.

The time per iteration and per particle on a KNL node (2 Intel Haswell sockets) is given in Fig. 2 for order 1 and order 3 interpolations. The system used here is not Cori but Carl, the samll KNl cluster available before the arrival of Cori. This study exhibits a speed-up of 3.7 at order 1 and 5 at order 3.

knl_time_ppart_pit
Fig 2. – Simulation time per iteration per particle on a KNL node (configured in SNC4 flat mode) of Cori phase II between the non-optimized and the optimized PICSAR kernel.

Optimization efforts perform in PICSAR now enable to have better performance on Many-Integrated Cores architectures (KNL) than on previous multi-core architectures. A speed-up of 1.8 is observed at order 3 between Haswell and KNL. Similar simulation times are obtained at order 1.

OpenMP strong scaling on a single KNL node

A strong scaling of OpenMP has been performed on a node of Carl (Intel compiler with Intel MPI) using a KNL configured in SNC4. For this aim we use 4 MPI ranks, one per NUMA (Non-Uniform Memory Access) domain. The number of OpenMP threads ranges between 1 per MPI rank to 64 if we use hyper-threading. Hyper-threading starts above 17 cores per MPI rank (68 cores). The results are shown in Fig. 3. OpenMP scaling efficiency slowly diminishes and is still good until 16 OpenMP threads with an efficiency of 70%. Above, with hyperscaling, efficiency drops to reach 20% for 64 OpenMP threads per MPI rank. This is mainly due to the communications as shown by the red and green curves. Grid communication scales very badly but represents between 1 and 10% of the simulation time. They represent guard cell exchanges between MPI domains. Particle communication represent between 20 and 40% of the simulation time.

knl_strong_scaling
Fig. 3 – Strong OpenMP scaling on KNL configured in SNC4 flat mode. The black curve is the efficiency of the whole kernel, the blue curve is for the computation, the green curve for communication of the guard cells between grids and the red curve for particle communications.

Hyper-threading

In some cases, using hyper-threading can bring an additional speedup. One should bear in mind that on KNL all cores have to be used to saturate the bandwidth. For codes with a high arithmetic intensity (high flop per byte ratio), hyper-threading can be interesting. However, bandwidth bound codes or code that exploit the totality of the L2 cache can suffer from loss of performance. Using more threads means that L2 will be shared with more processing units leading to potential L2 cache misses in case of saturation.

The study performed on a single KNL node shows that PICSAR can benefit from hyper-threading until 2 threads per core. Using 4 threads per core leads to similar performance to 1 thread per core. as shown in Fig. 4. This behavior is similarly observed for quadrant flat, SNC4 flat and quadrant cache mode.

picsar_hyperthreading_2
Fig. 4 – Effect of using hyper-threading on a single node. Time per iteration per particle for 3 parallel configurations using 4 MPI ranks with 16, 32 and 64 OpenMP threads. Using more than 16 OpenMP threads per MPI ranks mean that more than a physical thread per physical core are used. Three KNL configurations have been considered: Quadrant flat (blue), SNC4 flat (orange) and Quadrant cache (green).

Although PICSAR can be considered as partially memory bound, the stand-alone Fortran code can be used with 2 threads per core to obtain 20% speedup. As shown in Fig 3. in the previous section, hyper-threading does not scale and efficient drops rapidly for more than a thread per core.

Hybrid parallel configuration

PICSAR uses a hybrid MPI-OpenMP parallelization. In order to determine the best ratio number of MPI ranks/number of OpenMP threads in term of performance, a parametric study has been conducted on quadrant flat, quandrant cache and SNC4 flat mode. The results are shown in Fig. 5.

picsar_knl_conf_study_2
Fig. 5 – Study of the best ratio  number of MPI ranks / number of OpenMP threads in term of performance. For KNL nodes configured in quadrant flat (blue), SNC4 (orange) and quadrant cache (green), the times per iteration per particles are reported for 5 hybrid cases using always 2 threads per core and 64 cores.

On a single node, for the homogeneous test case, simulation time does not vary much between 4 MPI ranks/32 OpenMP threads and the full MPI case 64 MPI ranks/2 OpenMP threads. Only the full OpenMP case appears nearly 40% slower. This study also shows that on a single node, performance between the 3 studied KNL configuration are very similar when the simulation entirely fits in MCDRAM (less than 16 GB).

DDR versus MCDRAM

When the simulation can entirely fit in MCDRAM, memory bound codes can highly benefit from the high-bandwidth. This study reveals advantage of MCDRAM for PICSAR stand-alone when the problem is below 16 GB. Fig. 6 shows the time per iteration per particle for a series of 5 simulations with different hybrid parallel configurations (see abscissa) and for 2 KNL configurations: quadrant flat with in one case the use of MCDRAM only (numactl -m 1, blue) and in the other the use of DDR only (numaclt -m 0, orange).

picsar_mcdram_vs_ddr
Fig 6. – Time per iteration per particle for simulations performed on KNL configured in quadrant flat using the MCDRAM (numactl -p 1) and quadrant flat using the DDR (blue, numaclt -p 0) for different hybrid MPI/OpenMP configuration given in abscissa.

For any hybrid parallel configuration using here hyper-threading (2 threads per core), there is an average factor of 2 between simulation time on MCDRAM and in DDR. Using the MCDRAM as a memory or as a cache when the problem can be contained in it is an advantage for the PICSAR simulations.

Effect of L3 cache

Intel KNL processors do not have a L3 cache level (LLC cache) contrary to Intel Haswell processors but have this MCDRAM that can be used in cache mode. The L3 cache is a medium size (40 MB) high-bandwidth memory.

picsar_transferred_byte_ratio
Fig. 7 – Transferred byte ratio between KNL and Haswell as a function of the number of particles per cell.

Mathieu Lobet, last update: January 27, 2017