Profiling PICSAR with Intel Advisor

In this tutorial, we will present how to profile PICSAR with Intel Advisor on KNL. We will use the CORI supercomputer at NERSC. What we will present can be easily transposed to any system with small modifications.

Presentation of Intel Advisor

Intel Advisor is composed of Vectorization Advisor and Threading Advisor that help to ensure that your code realizes full performance potential.

In this tutorial, we will focus on Vectorization advisor which is a vectorization optimization tool:

  • Help to identify time-consuming loops that can benefit from vectorization or already vectorized
  • Help to identify vetorization and efficiency issues (dependencies, spilling, memory access…) and propose solutions
  • Help to ensure that vectorization is safe and quantify effects of vectorization (vectorization efficiency, roofline performance model)

In command line, Advisor is called via advixe-cl.

advixe-cl -collect analysis -trace-mpi -- ./application

analysis refers to the type of survey you want to do. They can be:

  • survey: general overview of the performances and the vectorization state of the code.
  • tripcounts: improves the survey by dynamically exploring loop iteration execution and propose better decisions about your vectorization strategy. It measures #FLOP count and cumulative data traffic necessary for the Roofline performance model.

  • dependencies: refine analysis by checking for real data dependencies in loops the compiler did not vectorize because of assumed dependencies.

  • map: (Memory Access Pattern) refine analysis by checking for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses.

Then, a series of additional arguments can be used. You can choose the name of the analysis directory using the following argument:


You can specify the location of the sources using this command:

--search-dir src:r=

In order to take into account the masks in the vector operations on KNL for specific collections such as tripcounts, you have to specify:


The finalization of the results on KNL can take a long time. You can decide to not finalize the results after the analysis by specifying:


Here is a list of very useful links toward Intel Advisor documentations:

Compiling PICSAR for Intel Advisor on KNL at NERSC

First of all, you have to install PICSAR with the right flags for Advisor. We  recommend to create a new PICSAR installation in your SCRATCH directory. We also recommend to use the Intel compiler and the corresponding libraries for KNL:

module unload craype-haswell
module load craype-mic-knl
module unload Prgenv-gnu
module load Prgenv-intel

To setup your environment rapidly, you can create a file that you will source like for instance source ~/.knl_config_advisor. This file can be like this:

if [ "$NERSC_HOST" == "cori" ]

   # Modules
   module unload craype-haswell
   module unload PrgEnv-gnu
   module unload darshan
   module load PrgEnv-intel
   module load craype-mic-knl

   # not required for compilation but at execution
   module load advisor
   # This activate the Cache-Aware Roofline Feature
   export ADVIXE_EXPERIMENTAL=roofline

   # Path to PICSAR compiled with Intel for KNL and Advisor

   # PICSAR paths
   export PYTHONPATH="$PICSAR/python_bin/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/example_scripts_python/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/python_libs/:$PYTHONPATH"
   export PYTHONPATH="$PICSAR/postproc_python_script/:$PYTHONPATH"


In this script, PICSAR is installed in


Change this path accordingly to your installation. Instead of sourcing a file, you can also setup your environment manually.

You can easily compile PICSAR for Advisor using our Makefile:

make SYS=cori2 MODE=advisor

The compilation will generate a binary file called picsar_cori2_advisor.

The compilation is made with the following flags:

  •  -g: Produces symbolic debug information in the object file (required).
  • -dynamic
  • -O3 -xMIC-AVX512 -qopenmp: enable OpenMP, optimization and vectorization on KLNLs
  • -debug inline-debug-info: more debug information
  • -align array64byte: data alignment for better vectorization efficiency.

Batch script for Intel Advisor

To use Advisor, we have to load the corresponding module on CORI: module load advisor. In the following example, we simply source the setup file source ~/.knl_config_advisor.

In the following batch script (List. 2), the basic survey is performed on PICSAR on a KNL node configured in quadrant cache with 1 MPI rank and 64 OpenMP threads.

#!/bin/bash -l
#SBATCH --job-name=advisor_analysis
#SBATCH --time=00:30:00
#SBATCH -p knl
#SBATCH -C knl,quad,cache
#SBATCH -e advisor_analysis.err
#SBATCH -o advisor_analysis.out

export OMP_DISPLAY_ENV=true
export OMP_PROC_BIND=spread
export OMP_PLACES=cores"(64)"
export OMP_SCHEDULE=dynamic

source ~/.knl_config_advisor

cp $PICSAR/fortran_bin/picsar_cori2_advisor .

# Add the following "sbcast" line here for jobs larger than 1500 MPI tasks:
# sbcast ./mycode.exe /tmp/mycode.exe

numactl -H

# Survey
srun -n 1 -c 272 --cpu_bind=cores advixe-cl -collect survey -project-dir advisor_analysis -trace-mpi -- ./picsar_cori2_advisor
# Tripcounts
srun -n 1 -c 272 --cpu_bind=cores advixe-cl -collect tripcounts -project-dir advisor_analysis -flops-and-masks -trace-mpi -- ./picsar_cori2_advisor

Visualization of the results

We use the GUI Advisor interface to visualize the data. You can open the GUI on a login node by first loading the Advisor module module load advisor and then using the following command:


You can directly specify the path to your data or open it via the GUI interface.

If you wish to use the Cache-Aware Roofline feature with Advisor, do not forget to do:


At NERSC, it is recommended to use NX to use advisor-gui for network performance.

Fig 1. – Screenshot of the loop properties from Advisor GUI after a simple survey. The loops have been sorted out so that the vectorized ones appear first (orange marker on the left). The first column is the list of loops. The 9th one called efficiency gives an estimation of the vectorization efficiency and the 10th the estimated speedup (Gain). Loops for the particle pusher are efficiently vectorized. Loops for field gathering and the current deposition are however not reaching 100%. Reasons are given in the 3rd column called vector issues. By clicking on the loop, one can access a details summary and advice to improve vectorization.

Mathieu Lobet, last update: January 31, 2017

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s