Profiling PICSAR on KNL with Vtune

In this tutorial, we will present how to profile PICSAR with Intel Vtune on CORI KNL nodes at NERSC. What we will present can be easily transposed to any system with small modifications.

You can find additional information on how to use Vtune on CORI on the NERSC website.

Presentation of Intel Vtune

Vtune is a performance analysis tool that enables you to find serial and parallel code bottlenecks and speed execution. It collects samples from your program during run time and prepares various reports on performance.

In command line, Vtune uses the command amplxe-cl followed by diverse options and the application name at the end of the line:

amplxe-cl -collect collection -r project_name -- ./picsar

collection corresponds to the type of survey you want to perform. For the list of available collections, you can go to the Vtune website or use this command:

amplxe-cl -help collect

For information related to a specific collection:

amplxe-cl -help collect

Basic collections that can be done first are:

  • hotspots: determine the most time-consuming part of the code
  • advanced-hotspots: better analysis of the hotspots with more metrics
  • general-exploration: enable to understand how the performances are correlated to the hardware (CPU and cache metrics).

Then, to go deeper in the code, specific collections can be performed:

  • memory-access: analysis of the memory performance and issues (bandwidth, memory usage…).

The project name project_name specified via -r is the directory that will contain Vtune files.

Many additional arguments can be added to the amplxe command to enrich the analysis:

  • -knob analyze-openmp: identify OpenMP overheads (for advanced-hotspots for instance)

I recommend to go on the Vtune collection webpage to get the full list of additional arguments.

On KNL, the finalization of the analysis can take a lot of time. Therefore, it is recommended to specify -no-auto-finalize and to finalize on a login node. If Vtune can not find the code sources (this happens when the finalization is not done one the same node), you can add:

-source-search-dir=

You can specify the binary path using this argument:

-search-dir=

It is also recommended to raise the memory usage threshold by adding:

-data-limit=

Vtune can use a lot of memory and if the threshold is reached the collection will stop. For MPI codes, the following argument should be added:

-trace-mpi

Setup your environment for Vtune on CORI at NERSC

We will create a new installation of PICSAR. We recommend to install PICSAR and run Vtune on the SCRATCH directory. Clone or download the PICSAR sources in the directory of your choice on the SCRATCH.

Then need to prepare your environment. The Intel compiler is used.

You can create a file ~/.knl_vtune_config:

if [ "$NERSC_HOST" == "cori" ]
then

   # Modules
   module unload craype-haswell
   module unload PrgEnv-gnu
   module unload darshan
module load PrgEnv-intel
   module load craype-mic-knl
   module load vtune

   # Path to PICSAR compiled with Intel for KNL and Vtune
   PICSAR=$SCRATCH/Codes/install_intel_knl_vtune/picsar/

   # PICSAR paths
   export PYTHONPATH="$PICSAR/python_bin/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/example_scripts_python/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/python_libs/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/postproc_python_script/:$PYTHONPATH"

fi

and source it source ~/.knl_vtune_config or you can also setup your environment manually. Note that here, you can change the PICSAR path by the location of your choice.

Compiling PICSAR for Intel Vtune on KNL at NERSC

The minimum compiler arguments required for Vtune are

  • -g: Produces symbolic debug information in the object file.
  • -Bdynamic: link dynamically the libraries

The PICSAR Makefile enables to directly compile for Vtune by doing:

make SYS=cori2 MODE=vtune

Additional arguments are used in our Makefile:

  • -D VTUNE=1: this flag prepares PICSAR to be used with VTUNE. In this case, PICSAR uses Vtune API functions to remove initialization from the profiling reports. The profiling start before the kernel and stop right after.
  • -I /opt/intel/vtune_amplifier_xe_2017.2.0.499904/include: Include path to the Vtune API for C files. Note that this path, valid on CORI, depends on your Vtune installation.
  • /opt/intel/vtune_amplifier_xe_2017.2.0.499904/lib64/libittnotify.a: to be added at the linking step for Vtune API functions.

The compilation will generate a binary file called picsar_cori2_vtune.

Batch script for Intel Vtune

An example of batch script is presented in List. 2. Here, a general exploration of PICSAR is performed on a single KNL node using 4 MPI ranks and 32 OpenMP threads per rank. Vtune analysis have to be run on the $SCRATCH system. On the HOME or PROJECT partitions, the finalization will fail because a GPFS file system is needed. It is recommended to add #SBATCH --vtune to the batch script. Although we are not sure this is still required for CORI, some VTune collections (but not all) require a special kernel module to be loaded. If you forget to use it with these specific collections, it will fail.

#!/bin/bash -l
#SBATCH --job-name=vtune_analysis
#SBATCH --time=01:00:00
#SBATCH -N 1
##SBATCH -S 4
#SBATCH -p knl
#SBATCH -C knl,quad,cache
#SBATCH --vtune
#SBATCH -e vtune_analysis.err
#SBATCH -o vtune_analysis.out

export OMP_NUM_THREADS=32
export OMP_STACKSIZE=128M
export OMP_DISPLAY_ENV=true
export OMP_PROC_BIND=spread
export OMP_PLACES=cores"(16)"
export OMP_SCHEDULE=dynamic

source ~/.knl_vtune_config

cp $PICSAR/fortran_bin/picsar_cori2_vtune .

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

numactl -H

srun -n 4 -c 64 --cpu_bind=cores amplxe-cl -data-limit=5000 -collect general-exploration -r vtune_analysis -trace-mpi -- ./picsar_cori2_vtune

Start and stop Vtune functions in PICSAR

Vtune analysis often reveals long to run and consumes a lot of memory, especially on KNL due to the lower frequency. Therefore, it can be more convenient to focus the analysis on a specific part of the code. For instance, one can decide to not include the initialization or some parts of the code that are known to be negligible or optimized. for this aim, the Vtune API provides C functions to start and stop Vtune analysis when added in codes. In PICSAR, these functions have been wrapped in Fortran subroutines. These subroutines are localized in src/profiling/api_fortran_itt.c and in src/profiling/itt_fortran.F90 .

To analyze the code section of your choice, just add at the beginning:

start_vtune_collection()

And then

stop_vtune_collection()

at the end of the section. Vtune will automatically starts and stops when passing through these subroutine.

Visualization of the results

You can open the GUI Vtune interface with this command:

amplxe-gui

At NERSC, it is recommended to use NX for network performance.

Mathieu Lobet, Oleksyi Kononenko, last update: January 25, 2017

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s