The CORI supercomputer

CORI is a new Cray XC40 supercomputer hosted at NERSC at the Lawrence Berkeley National Laboratory (LBNL) in California. The system overview is detailed on the NERSC website. Cori is equipped of 2,004 Intel Haswell computes nodes (Intel® Xeon™ Processor E5-2698 v3) and 9,304 Intel Xeon Phi Knights Landing (KNL) compute nodes (Intel® Xeon Phi™ Processor 7250). The system has a theoretical peak of 27.9 petaflops/second and is at rank 5 in the top500 most powerful supercomputers in November 2016 with a maximum peak performance of 14 petaflops/second. The system uses the Cray Aries high-speed interconnect with Dragonfly topology.

Intel Xeon Phi Knights Landing

So-called Knights Landing processors is the last (second generation after Knights Corner) Intel product of the Xeon Phi family and is based on the Many-Integrated Core architecture. Intel KNLs can exist in different forms:

  • self-hosted: run the system OS
  • co-processor: is seen as an accelerator (like a graphic card)
  • with or without on-die high-bandwidth memory (called MCDRAM)

NERSC KNLs are, like for previous systems, self-hosted processors (that runs their own OS). Each node contains a single KNL. KNLs have been designed to be more energy efficient, i.e. to provide the maximum computing power per consumed Watt. What make KNLs so different to the previous generation of processors (Intel Xeon such as Haswell processors) are:

  • Much more cores than before: NERSC KNLs have 68 cores per processor (16 cores on Haswell) but some models can go to 72. Each core has 4 hardware threads (2 threads on Haswell). There is therefore until 272 computing units that can be seen by the system (32 on Haswell).
  • A lower frequency per core: around 1.4 Ghz (against 2.3 Ghz on Haswell)
  • Larger vector pipelines: each KNL node uses two Vector Processing Units equipped of AVX-512 vector pipelines with a hardware vector length of 512 bits (eight double-precision elements). Larger vector instructions compensate for the decrease in frequency and contribute to make KNLs more energy efficient.
  •  On-package, high-bandwidth memory (MCDRAM): this is a high-bandwidth memory of 16 GB directly 5x larger than the DDR4 DRAM memory (>460 GB/sec). Latency is however equivalent. However, KNL does not have L3 cache contrary to Haswell (40 MB).
  • NUMA domains: Processor cores connected in a 2D mesh network with 2 cores per tile. NUMA effects depend on the configuration.
  • L2 cache: 1 MB per tile (shared in a group of 2 cores) versus 256 KB per core on Haswell.

All these parameters create a wide variety of possible configurations. Finding the best depends on your application and your test case. This also makes KNLs more complicated to use efficiently than Intel Haswell for instance.

KNL configurations

MCDRAM can be used in 3 ways:

  • As a cache memory (cache mode): the entire MCDRAM is managed by the system as a cache level and arrays can not be explicitly allocated in this space.
  • As an allocatable memory (flat mode): As for the traditional DDR, this space is an allocatable memory.
  • As an hybrid cache/allocatable memory: the user can decide to divide the MCDRAM into two memory space, one in cache the other in flat.

In addition to the memory configuration, KNL nodes can be configured in 3 NUMA configurations:

  • quadrant: KNL node is seen as a single socket and MCDRAM is fully available from any core with potential NUMA effects.
  • SNC4: KNL node is seen as 4 divided sockets and MCDRAM is separated into 4 partitions.
  • SNC2: KNL node is seen as 2 divided sockets and MCDRAM is separated into 2 partitions.

Performance tests have shown that PICSAR benefits from MCDRAM when the problem can fit in. In this case, Quadrant, SNC4 and SNC2 modes have similar performance in cache and flat modes with the homogeneous plasma test case.

The default KNL mode on CORI is quadrant cache on the partition knl. These KNLs can not be reconfigured. The partition knl_reboot enables to reconfigure the KNLs in the wished configuration. In this case, the system will reboot each KNL, this takes approximately 20 minutes.

Configuring KNL nodes for PICSAR on CORI

General and detailed information to create a batch script for the KNL partition on CORI can be found on the NERSC website. Information provided here can also help to configure PICSAR on another KNL cluster.

Since CORI has different environments, I like to define a source file ~/.knl_config that will contain environment variables and module loads for KNL (see listing 2). This file can be called manually on the frontal nodes or in an interactive session and can be called in the batch script to setup the modules and the path. I also like to have different installations of PICSAR and other libraries for different compilers, processors, libraries and therefore several source files to rapidly change and adapt my environment.

if [ "$NERSC_HOST" == "cori" ]

   # Modules
   module unload craype-haswell
   module load craype-mic-knl

   # Path to PICSAR compiled with Intel for KNL

   # PICSAR paths
   export PYTHONPATH="$PICSAR/python_bin/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/example_scripts_python/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/python_libs/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/postproc_python_script/:$PYTHONPATH"


Listing 2 is an example of batch script to launch PICSAR on KNL. This job launches PICSAR on 128 KNL nodes configures in quadrant cache mode (default mode on CORI). The KNL partition and configuration have to be specified via #SBATCH -C knl,quad,cache (#SBATCH -C haswell for Haswell partitions. Available configurations are knl,quad/snc2/snc4,flat/cache. SBATCH -p debug/regular enables to choose the type of queue.  SBATCH -S 4 can be used to specify how many nodes are reserved specifically for the OS. In practice, no difference has been seen with PICSAR when 4 cores are let alone without this option. We recommend to use PICSAR with the OpenMP. In this script, we use 128 KNL nodes by specifying #SBATCH -N. Each KNL has 4 MPI tasks and 32 OpenMP threads per task (hyperthreading is activated with 2 threads per core). We therefore specify export OMP_NUM_THREADS=32. The fact that we use 4 MPI tasks per node is specified in the srun command via -n 512. We only use 64 cores with -c 64 . For OpenMP, we increase the stacksize export OMP_STACKSIZE=128M, else segfault error can occur.  Thread placement can be controlled via OMP_PLACES and OMP_PROC_BIND. OMP_PLACES specifies where the threads are places. OMP_PROC_BIND enables to bind the threads to a specified level (they won’t move) and how they are distributed (master, close, spread). The OpenMP scheduler is set to OMP_SCHEDULE=dynamic for load balancing between tiles.

#!/bin/bash -l
#SBATCH --job-name=picsar_simulation
#SBATCH --time=00:30:00
#SBATCH -N 128
#SBATCH -p debug
#SBATCH -C knl,quad,cache
#SBATCH -e error_file
#SBATCH -o output_file

export OMP_DISPLAY_ENV=true
export OMP_PROC_BIND=spread
export OMP_PLACES=cores"(16)"
export OMP_SCHEDULE=dynamic

source ~/.knl_config
cp $PICSAR/fortran_bin/picsar_cori2 .
# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./picsar_cori2 /tmp/picsar_cori2

numactl -H

srun -n 512 -c 64 --cpu_bind=cores ./picsar_cori2

For the quad cache mode, no memory binding is required. However, in flat mode, you need to specify where the memory will be allocated (DDR or MCDRAM) and how the MCDRAM is coupled to DDR. For this aim, you can use:

  • MEMKIND: This library enables to allocate explicitly some arrays in MCDRAM via directives in the code or via environment variables in batch scripts. PICSAR does not use MEMKIND for the moment.
  • NUMACTL: numactl (NUMA control) enables to bind the code to specific memory partition. The numactl command has to be used in srun before the application name ./picsar .

In quadrant flat mode (#SBATCH -C knl,quad,flat), two memory partitions are available, the full DDR (mode 0) and the full MCDRAM (mode 1). For instance to use the MCDRAM in preference, the command becomes:

srun -n 512 -c 64 --cpu_bind=cores -numactl -p 1 ./picsar_cori2

Here, -p means that the MCDRAM is preferred. If filled, allocation will continue in DDR. Else, you can use -m to impose MCDRAM or DDR. A segfault occurs in case of filled partition.

In SNC4 flat mode, KNLs are divided into 4 NUMA domains. Therefore, modes 0 to 3 corresponds to the 4 NUMA domains in DDR and modes 4 to 7 to the 4 NUMA domains in MCDRAM. The srun command to use MCDRAM in preference becomes:

srun -n 512 -c 64 --cpu_bind=cores -numactl -p 4,5,6,7 ./picsar_cori2

A useful interactive batch script generator is also available on MyNERSC to help you designing your own launcher.

Mathieu Lobet, Last update: February 15 2017

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s