WARP+PICSAR performance

The particle-in-cell code WARP and the library PICSAR have been coupled in order to run WARP efficiently on KNL and also improve performance on older architectures.

WARP+PICSAR versus WARP alone

To compare WARP+PXR and WARP alone, the considered test case is an homogeneous thermalized hydrogen plasma of temperature v_{th} = 0.1c where c is the speed of light in vacuum. The plasma is in the entire domain of dimension 64 x 64 x 64 μm. The discretization if of 400x400x400 cells. There are 40 macro-particles per cell per species to represent the plasma (composed of two species, electrons and protons). This corresponds to a total of 5,120,000,000 particles. Simulations are run on 128 KNL nodes configures in quadrant cache with 4 MPI ranks per node. There are therefore 512 MPI subdomains that divide the main domain into cubes of 50x50x50 cells. Then, the tiling divides again each MPI subdomain into cubes of approximately 8x8x8 cells. Each MPI rank have 16 OpenMP threads to handle tiles. Current deposition is performed with order 3 B-splines shape factor. On each node, MCDRAM can contain all allocations.

The original version of WARP appears extremely slow on KNL. Using WARP alone is similar to using PICSAR with the most non-optimized subroutines with MPI only. In this case there is no tiling and using OpenMP is not as efficient as with PICSAR. Therefore, we obtain a simulation time per particle per iteration per node of around 150 ns. Using WARP+PICASR, the time per particle per iteration per node drops to 19 ns. This corresponds to a speedup of almost 8.

The same test case has been run on the Haswell partition of CORI and on Edison. We keep the same number of nodes meaning that the number of cores and the computational power is not equivalent. On CORI, WARP performs the simulation in around 150 ns per particle per iteration per node. WARP+PICSAR is very similar to KNL with a time per particle per iteration per node of 19 ns (speedup of 8). On Edison, WARP performs the test case in 96 ns per particle per iteration per node. With PICSAR, this drops to 33 ns (speedup of 3) showing that Haswell and KNL performs better than the previous generation Ivy Bridge.

We have also tested to run on KNL using  the compiled code for Haswell (-xCORE-AVX2 instead of -xMIC-AVX512). KNL architectures can excute AVX2 instructions but these instructions are not as efficient as AVX512 vector instructions to fully use the vector registers. We obtain a time of 27 ns per particle per iteration and per node.

Some performance tests have also been performed with large physical cases. The first one is a 2D physical case of harmonic generation in the interaction of a high-intensity laser with a solid target. The domain is of 4300×6600 cells with 400 millions of particles. We use the Yee’s scheme with 16000 iterations. This simulation includes diagnostics every 1000 time steps. run on 96 nodes both on Edison and CORI KNL, this simulation reveals twice faster on Edison with a simulation time of 6377 seconds against 13504 seconds on KNL.

Best KNL configuration

Hyper-threading

Hyper-threading is first tested with 4 parallel configurations on a KNL configured in quadrant cache mode. The 4 parallel configurations are given in abscissa of Fig. 1. Oordinate is the time per iteration per particle (and per node). For each case, 1, 2 and 4 threads per core are tested (respectively corresponding to the blue, orange and green markers). Fig. 1 shows that WARP+PICSAR does not benefit from hyper-threading. The best performances are obtained with 1 thread per core. Using 2 threads per core, which the best with PICSAR stand-alone, slightly slows down the code. Using 4 threads/core is useless for WARP+PICSAR.

warp_picsar_knl_hyperthreading
Fig. 1 – Study of hyper-threading with WARP+PICSAR on KNL configured in quadrant cache mode. The figure shows the simulation time per iteration per particle in ns for 4 different parallel configuration: 1 MPI and 64 cores, 4 MPI ranks and 16 cores, 8 MPI ranks and 8 cores and 32 MPI ranks and 4 cores. Every cases use 64 cores. For each one, 1, 2 and 4 threads per core are considered. They respectively correspond to the blue, orange and green markers.

Fig. 1 also reveals that the most efficient hybrid parallel distribution between MPI and OpenMP is the last case with 32 MPIs and 2 cores. This will be further studied in the next section.

NUMA and memory KNL configurations

The best NUMA and memory KNL configuration is now studied. We focus on 3 configurations: quadrant flat, quadrant cache and SNC4 flat. SNC2 is not studied here. The results of this study are shown in Fig. 2. Abscissa represents different hybrid parallel configurations as in Fig. 1 but with more data. Ordinate is the time per iteration per particle and per node in nanoseconds. Fig. 2 reveals that quadrant cache is surprisingly faster than quadrant flat although here the code fits in MCDRAM. Among all hybrid parallel distribution, the average speedup factor is of 1.15. SNC4 is almost as efficient as quadrant cache. More points are required. The difference between all configurations is nonetheless very small. Since the quadrant cache mode is the default one on CORI (they have their own partition, no need to reboot KNL nodes), we therefore recommend to use this mode for large production cases.

warp_picsar_knl_conf_study
Fig. 2 – Comparison of 3 NUMA and memory KNL configurations: quadrant cache (blue), SNC4 flat (orange) and quadrant flat (green) for different hybrid parallel configuration given in abscissa. Ordinate is the time per iteration per particle in ns.

This is also a study of the best hybrid parallel distribution between MPI and OpenMP. As for Fig. 1, Fig.2 confirms that using a large number of MPI ranks is the most efficient for WARP+PICSAR. It seems that 32 MPI ranks with 2 OpenMP per MPI is the best choice. Performance is pretty close with 8 or 64 MPI ranks. Using only OpenMP (1 MPI) is clearly slower than the best configuration in quadrant cache of a factor of 1.8. Carefully setting up a run with the best hybrid parallel distribution can really have an impact on performance. In our case, using a full MPI code is still an efficient possibility for WARP+PICSAR. In any case, even with 1 OpenMP per MPI rank, tiling is activated.

DRAM versus MCDRAM

We now study impact of MCDRAM when the problem fit in the 16 available GB. Fig. 3 shows the results of the study performed on KNL configured in flat mode with the MCDRAM only and then the DDR only. Having the code in MCDRAM speeds up the runs for every parallel configurations but with a small average factor of 1.13 in our case.

WARP_PICSAR_mcdram_vs_ddr.png
Fig. 3 – Time per iteration per particle in ns for different parallel configurations from 1 MPI rank and 64 OpenMP threads to 64 MP ranks and 1 OpenMP thread per MPI for two memory configuration in quadrant flat: using the MCDRAM (blue) and the DRAM (red).

Huge pages

Huge pages are virtual memory pages which are bigger than the default base page size of 4KB. On NERSC, using larger pages can speedup applications. To use larger pages, the corresponding modules has to be loaded. For instance, to use 16MB pages, the module craype-hugepages16M needs to be loaded. The code has to be recompiled and run with this module.

We have studied the effect of huge pages on WARP+PICSAR for 16 and 128 KNL nodes in quadrant cache. The results are shown in Fig. 4. For WARP+PICSAR, the times with huge-pages and the defaults are very similar. Using huge pages therefore does not seem to speedup the code at least for the considered test case.

hugepage
Fig. 4 – Huge-page study with WARP+PICSAR. Time per iteration per particle per node as a function of the page size is given for 2 simulation configurations both in KNL quadrant cache: one with 16 nodes (blue) and the other with 128 nodes as for the previous studies (red).

Weak scaling

This section presents weak scaling studies performed with WARP+PICSAR on KNL and on Haswell on CORI.

warp_picsar_weak_scaling
Fig. 5 – WARP-PICSAR weak scaling on KNL in quadrant cache mode with 4 and 64 MPI ranks, and on Haswell with 2 MPI ranks and 16 OpenMP threads. Abscissa is the number of nodes (KNL or Haswell) and the ordinate the efficiency (ratio time for a single node divided by the time on N nodes).

Conclusion

WARP when coupled with PICSAR is faster than alone with an average speedup of 8 on CORI (both on Intel Haswell and KNL). According to the results of this study, we recommend to keep the quadrant cache mode on KNL. The latter one provides good performance and is the default KNL mode on CORI. Until 512 KNL nodes, using  many MPI ranks (32 MPI ranks per node) is more efficient than using a lot of OpenMP threads. For our case, 32 MPI ranks and 2 threads per core without hyper-threading is the best hybrid configuration.

Mathieu Lobet, last update: February 9, 2017

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s