In this tutorial, we will present how to profile PICSAR with Intel Vtune on CORI KNL nodes at NERSC. What we will present can be easily transposed to any system with small modifications.
You can find additional information on how to use Vtune on CORI on the NERSC website.
Presentation of Intel Vtune
Vtune is a performance analysis tool that enables you to find serial and parallel code bottlenecks and speed execution. It collects samples from your program during run time and prepares various reports on performance.
In command line, Vtune uses the command
amplxe-cl followed by diverse options and the application name at the end of the line:
amplxe-cl -collect collection -r project_name -- ./picsar
collection corresponds to the type of survey you want to perform. For the list of available collections, you can go to the Vtune website or use this command:
amplxe-cl -help collect
For information related to a specific collection:
amplxe-cl -help collect
Basic collections that can be done first are:
- hotspots: determine the most time-consuming part of the code
- advanced-hotspots: better analysis of the hotspots with more metrics
- general-exploration: enable to understand how the performances are correlated to the hardware (CPU and cache metrics).
Then, to go deeper in the code, specific collections can be performed:
- memory-access: analysis of the memory performance and issues (bandwidth, memory usage…).
The project name
project_name specified via
-r is the directory that will contain Vtune files.
Many additional arguments can be added to the
amplxe command to enrich the analysis:
-knob analyze-openmp: identify OpenMP overheads (for advanced-hotspots for instance)
I recommend to go on the Vtune collection webpage to get the full list of additional arguments.
On KNL, the finalization of the analysis can take a lot of time. Therefore, it is recommended to specify
-no-auto-finalize and to finalize on a login node. If Vtune can not find the code sources (this happens when the finalization is not done one the same node), you can add:
You can specify the binary path using this argument:
It is also recommended to raise the memory usage threshold by adding:
Vtune can use a lot of memory and if the threshold is reached the collection will stop. For MPI codes, the following argument should be added:
Setup your environment for Vtune on CORI at NERSC
We will create a new installation of PICSAR. We recommend to install PICSAR and run Vtune on the SCRATCH directory. Clone or download the PICSAR sources in the directory of your choice on the SCRATCH.
Then need to prepare your environment. The Intel compiler is used.
You can create a file
if [ "$NERSC_HOST" == "cori" ] then # Modules module unload craype-haswell module unload PrgEnv-gnu module unload darshan module load PrgEnv-intel module load craype-mic-knl module load vtune # Path to PICSAR compiled with Intel for KNL and Vtune PICSAR=$SCRATCH/Codes/install_intel_knl_vtune/picsar/ # PICSAR paths export PYTHONPATH="$PICSAR/python_bin/:$PYTHONPATH" export PYTHONPATH="$PICSAR/example_scripts_python/:$PYTHONPATH" export PYTHONPATH="$PICSAR/python_libs/:$PYTHONPATH" export PYTHONPATH="$PICSAR/postproc_python_script/:$PYTHONPATH" fi
and source it
source ~/.knl_vtune_config or you can also setup your environment manually. Note that here, you can change the
PICSAR path by the location of your choice.
Compiling PICSAR for Intel Vtune on KNL at NERSC
The minimum compiler arguments required for Vtune are
-g: Produces symbolic debug information in the object file.
-Bdynamic: link dynamically the libraries
The PICSAR Makefile enables to directly compile for Vtune by doing:
make SYS=cori2 MODE=vtune
Additional arguments are used in our Makefile:
-D VTUNE=1: this flag prepares PICSAR to be used with VTUNE. In this case, PICSAR uses Vtune API functions to remove initialization from the profiling reports. The profiling start before the kernel and stop right after.
-I /opt/intel/vtune_amplifier_xe_2017.2.0.499904/include: Include path to the Vtune API for C files. Note that this path, valid on CORI, depends on your Vtune installation.
/opt/intel/vtune_amplifier_xe_2017.2.0.499904/lib64/libittnotify.a: to be added at the linking step for Vtune API functions.
The compilation will generate a binary file called
Batch script for Intel Vtune
An example of batch script is presented in List. 2. Here, a general exploration of PICSAR is performed on a single KNL node using 4 MPI ranks and 32 OpenMP threads per rank. Vtune analysis have to be run on the
$SCRATCH system. On the
PROJECT partitions, the finalization will fail because a GPFS file system is needed. It is recommended to add
#SBATCH --vtune to the batch script. Although we are not sure this is still required for CORI, some VTune collections (but not all) require a special kernel module to be loaded. If you forget to use it with these specific collections, it will fail.
#!/bin/bash -l #SBATCH --job-name=vtune_analysis #SBATCH --time=01:00:00 #SBATCH -N 1 ##SBATCH -S 4 #SBATCH -p knl #SBATCH -C knl,quad,cache #SBATCH --vtune #SBATCH -e vtune_analysis.err #SBATCH -o vtune_analysis.out export OMP_NUM_THREADS=32 export OMP_STACKSIZE=128M export OMP_DISPLAY_ENV=true export OMP_PROC_BIND=spread export OMP_PLACES=cores"(16)" export OMP_SCHEDULE=dynamic source ~/.knl_vtune_config cp $PICSAR/fortran_bin/picsar_cori2_vtune . # Add the following "sbcast" line here for jobs larger than 1500 MPI tasks: # sbcast ./mycode.exe /tmp/mycode.exe numactl -H srun -n 4 -c 64 --cpu_bind=cores amplxe-cl -data-limit=5000 -collect general-exploration -r vtune_analysis -trace-mpi -- ./picsar_cori2_vtune
Start and stop Vtune functions in PICSAR
Vtune analysis often reveals long to run and consumes a lot of memory, especially on KNL due to the lower frequency. Therefore, it can be more convenient to focus the analysis on a specific part of the code. For instance, one can decide to not include the initialization or some parts of the code that are known to be negligible or optimized. for this aim, the Vtune API provides C functions to start and stop Vtune analysis when added in codes. In PICSAR, these functions have been wrapped in Fortran subroutines. These subroutines are localized in
src/profiling/api_fortran_itt.c and in
To analyze the code section of your choice, just add at the beginning:
at the end of the section. Vtune will automatically starts and stops when passing through these subroutine.
Visualization of the results
You can open the GUI Vtune interface with this command:
At NERSC, it is recommended to use NX for network performance.
Mathieu Lobet, Oleksyi Kononenko, last update: January 25, 2017