The roofline performance model provides a visual analysis of the computational constraining resources of every systems from single-core to many-cores architectures. It consists of a 2D graph with information on floating point performance, operational intensity (also refers to as arithmetic intensity), and memory performance. It provides a synthesized understanding of the efficiency of a kernel use of system resources, how far the kernel is from reaching the machine peak performance, and what are the limitations and trade-offs. The rooflines can be computed using the RAM or cache level bandwidth and usage (and more recently to the High-Bandwidth Memory that equips KNLs).
In a roofline graph (schematically shown in Fig. 1), the ordinate represents the floating operations per time units (in Gflop/s), whereas the abscissa represents the arithmetic intensity (AI), which is the ratio of total floating-point operations to total data movement (bytes). Roofline graphs are composed of 2 parts:
- The left roofline is a rising slope with a steepness that is given by the system peak memory bandwidth, which can be determined from the system itself or the vendor specifications.
- The right flat roofline is given by the peak floating-point performance when the optimum system peak memory bandwidth is reached.
There are several roofline models. In this tutorial we will focus on 2 of them:
- The classical roofline model [1,2] (Fig. 2): Originally, the arithmetic intensity is determine using bytes measured out of the DRAM traffic. The classical roofline model can be generalized to any given memory or cache level if the traffic can be measured.
- The Cache-Aware Roofline Model (CARM)  (Fig. 3): Operational intensity is determined from the total number of bytes transferred from all levels in memory hierarchy to the CPU. It corresponds to total amount of bytes transferred to the processing units. From another point of view, it corresponds basically to the L1 traffic (L1 operational intensity).
Computing the Classical Roofline Model
As shown in Fig. 2, the classical roofline model requires 3 ingredients:
- Number of flops
- Number of bytes loaded from DRAM
- The computational time
In this section, we will use the method of Douglas Doerfler presented on the NERSC website. This method is also presented in  and applied on different codes on Intel KNL processors. To compute the total number of flops for the full code or for a specific function, the Intel Software Development Emulator toolkit (SDE) is used. For more information on how to use SDE, you can go to our SDE tutorial webpage. The following command shows how to use
srun with SDE (SDE is used with 4 MPI ranks) and PICSAR:
srun -n 4 -c 1 sde -knl -d -iform 1 -omix my_mix.out -i -global_region \ -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- ./picsar
SDE will generate a separated folder per MPI rank containing reports for different metrics including flops for all kings of variables. Douglas Doerfler have developed bash scripts to analyze these files and extract the total number of flops from all kinds of floats. These scripts can be downloaded on the Arithmetic Intensity NERSC webpage. You can get the number of flops by parsing all outputs using the script
sh parse-sde.sh sde*
To compute the arithmetic intensity, the number of bytes is now required. For this aim, we run the code with Intel Vtune in command line by doing a
memory-access collection. For more information about how to use Intel Vtune with PICSAR, you can go to our Vtune tutorial. An example of
srun command with Intel Vtune is the following (Vtune is used with 4 MPI ranks):
srun -n 4 -c 1 amplxe-cl -start-paused –r vtune_results \ -collect memory-access -no-auto-finalize -trace-mpi -- code
Vtune will create a folder called
vtune_results containing analysis output files. To parse more easily the profiling results, a report has then to be generated from the first output files by doing:
amplxe-cl –report summary –r vtune_results > vtune_report
This create a new file called
vtune_report. As for SDE, Douglas Doerfler have developed a bash script to analyze this report and extract the number of bytes from DRAM. This script called
parse-vtune.sh can be downloaded on the Arithmetic Intensity NERSC webpage. You can finally obtain the total number of bytes transferred from DRAM (you can also get information about MCDRAM and L1) by doing:
sh parse-vtune.sh vtune_report
The final ingredient that you need is the simulation time. PICSAR provides time statistics using internal timers at the end of the runs (printed in the terminal or .err files). Run PICSAR without any profiler to avoid overheads.
In order to draw the rooflines, one needs the bandwidths and the peak performance of the machine. Rooflines can be drawn from the theoretical vendor values or from computed experimental values. In  the maximum sustained bandwidth at each level of the cache hierarchy and the maximum sustained floating-point rate for the KNL processor are measured using the Empirical Roofline Toolkit (ERT) and the results are given in the article.
Computing the Cache-Aware Roofline Model
The Cache-Aware Roofline is a model that can be computed with the last version of Intel Advisor. For detailed information on Advisor, you can go to our Intel Advisor Tutorial. In Intel Advisor, the computation of the rooflines is done using an internal benchmark run at the beginning of the profiling. Intel Advisor is AVX512 mask-aware. The roofline model of Advisor plots arithmetic intensities and performances for loops or functions. Intel Advisor measures the traffic between the L1 and registers: what CPU demands from memory sub-system to make a computation.
For the Roofline feature, the
survey and the
tripcounts collections have to performed together. By command line on CORI, this corresponds to:
srun -n 1 -c 1 advixe-cl -collect survey --trace-mpi -- ./code
srun -n 1 -c 1 advixe-cl -collect tripcounts \ -flops-and-masks --trace-mpi -- ./code
In these examples, the analysis is performed on 1 core. Since Intel Advisor divides analysis in MPI ranks (one analysis output folder for each MPI rank), it is better to just keep one MPI rank and use 64 OpenMP threads to use the full potential of the bandwidth and the caches.
The roofline can be visualized using the GUI Advisor interface as shown in Fig. 4.
For the moment, this feature does not enable to compare several code versions on a single roofline. We propose another way to visualize the roofline profiling results using Python and Matplotlib that will also enable to compare different loops with different optimizations. Main information from the analysis output files can be extracted in a csv file (other format available) using the following command:
advixe-cl --report survey --show-all-columns --format=csv \ --project-dir ./advisor_analysis --report-output advixe_report.csv
In order to read the resulting file
advixe_report.csv, Tuomas Koskela and Mathieu Lobet have developed a small Python library called pyAdvisor available on GitHub. This library also enables to plot the rooflines, to compare different Advisor surveys and to select and sort interesting points in order to have custom plots.
Here is a list of good webpages and videos on how to use the Cache-Aware roofline model with Intel Advisor:
- Youtube tutotial about Intel Advisor
- Presentation given at the Intel HPC Developer Conference
- Intel Roofline presentation
 – Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65-76, 2009.
 – Roofline performance model. http://crd.lbl.gov/departments/computerscience/PAR/research/roofline/. Accessed: 2016-07-22.
 – A. Ilic et al., IEEE Computer Architecture Letters (2014)
 – Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincent. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor, https://crd.lbl.gov/assets/Uploads/ixpug16-roofline.pdf.
Mathieu Lobet, Last update: January 29, 2017