Using the Roofline Performance Model with PICSAR

The roofline performance model provides a visual analysis of the computational constraining resources of every systems from single-core to many-cores architectures. It consists of a 2D graph with information on  floating point performance, operational intensity (also refers to as arithmetic intensity), and memory performance. It provides a synthesized understanding of the efficiency of a kernel use of system resources, how far the kernel is from reaching the machine peak performance, and what are the limitations and trade-offs. The rooflines can be computed using the RAM or cache level bandwidth and usage (and more recently to the High-Bandwidth Memory that equips KNLs).

roofline_general.png
Fig. 1 – General description of the roofline model.

In a roofline graph (schematically shown in Fig. 1), the ordinate represents the floating operations per time units (in Gflop/s), whereas the abscissa represents the arithmetic intensity (AI), which is the ratio of total floating-point operations to total data movement (bytes). Roofline graphs are composed of 2 parts:

  • The left roofline is a rising slope with a steepness that is given by the system peak memory bandwidth, which can be determined from the system itself or the vendor specifications.
  • The right flat roofline is given by the peak  floating-point performance when the optimum system peak memory bandwidth is reached.

There are several roofline models. In this tutorial we will focus on 2 of them:

  • The classical roofline model [1,2] (Fig. 2): Originally, the arithmetic intensity is determine using bytes measured out of the DRAM traffic. The classical roofline model can be generalized to any given memory or cache level if the traffic can be measured.
classical_roofline
Fig. 2 – The classical roofline model.
  • The Cache-Aware Roofline Model (CARM) [3] (Fig. 3): Operational intensity is determined from the total number of bytes transferred from all levels in memory hierarchy to the CPU. It corresponds to total amount of bytes transferred to the processing units. From another point of view, it corresponds basically to the L1 traffic (L1 operational intensity).
Cache_aware_roofline.jpg
Fig. 3 – The Cache-Aware Roofline Model. Here, DRAM -> C Bandwidth means Bandwidth between the DRAM to the CPU core.

Computing the Classical Roofline Model

As shown in Fig. 2, the classical roofline model requires 3 ingredients:

  • Number of flops
  • Number of bytes loaded from DRAM
  • The computational time

In this section, we will use the method of Douglas Doerfler presented on the NERSC website. This method is also presented in [4] and applied on different codes on Intel KNL processors. To compute the total number of flops for the full code or for a specific function, the Intel Software Development Emulator toolkit (SDE) is used. For more information on how to use SDE, you can go to our SDE tutorial webpage. The following command shows how to use srun with SDE (SDE is used with 4 MPI ranks) and PICSAR:

srun -n 4 -c 1 sde -knl -d -iform 1 -omix my_mix.out -i -global_region \
-start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- ./picsar

SDE will generate a separated folder per MPI rank containing reports for different metrics including flops for all kings of variables. Douglas Doerfler have developed bash scripts to analyze these files and extract the total number of flops from all kinds of floats. These scripts can be downloaded on the Arithmetic Intensity NERSC webpage. You can get the number of flops by parsing all outputs using the script parse-sde.sh :

sh parse-sde.sh sde*

To compute the arithmetic intensity, the number of bytes is now required. For this aim, we run the code with Intel Vtune in command line by doing a memory-access collection. For more information about how to use Intel Vtune with PICSAR, you can go to our Vtune tutorial. An example of srun command with Intel Vtune is the following (Vtune is used with 4 MPI ranks):

srun -n 4 -c 1 amplxe-cl -start-paused –r vtune_results \
-collect memory-access -no-auto-finalize -trace-mpi -- code

Vtune will create a folder called vtune_results containing analysis output files. To parse more easily the profiling results, a report has then to be generated  from the first output files by doing:

amplxe-cl –report summary –r vtune_results > vtune_report

This create a new file called vtune_report. As for SDE, Douglas Doerfler have developed a bash script to analyze this report and extract the number of bytes from DRAM. This script called parse-vtune.sh can be downloaded on the Arithmetic Intensity NERSC webpage. You can finally obtain the total number of bytes transferred from DRAM (you can also get information about MCDRAM and L1) by doing:

sh parse-vtune.sh vtune_report

The final ingredient that you need is the simulation time. PICSAR provides time statistics using internal timers at the end of the runs (printed in the terminal or .err files). Run PICSAR without any profiler to avoid overheads.

In order to draw the rooflines, one needs the bandwidths and the peak performance of the machine. Rooflines can be drawn from the theoretical vendor values or from computed experimental values. In [4] the maximum sustained bandwidth at each level of the cache hierarchy and the maximum sustained floating-point rate for the KNL processor are measured using the Empirical Roofline Toolkit (ERT) and the results are given in the article.

Computing the Cache-Aware Roofline Model

The Cache-Aware Roofline is a model that can be computed with the last version of Intel Advisor. For detailed information on Advisor, you can go to our Intel Advisor Tutorial. In Intel Advisor, the computation of the rooflines is done using an internal benchmark run at the beginning of the profiling. Intel Advisor is AVX512 mask-aware. The roofline model of Advisor plots arithmetic intensities and performances for loops or functions. Intel Advisor measures the traffic between the L1 and registers: what CPU demands from memory sub-system to make a computation.

For the Roofline feature, the  survey and the tripcounts collections have to performed together. By command line on CORI, this corresponds to:

srun -n 1 -c 1 advixe-cl -collect survey --trace-mpi -- ./code
srun -n 1 -c 1 advixe-cl -collect tripcounts \
-flops-and-masks --trace-mpi -- ./code

In these examples, the analysis is performed on 1 core. Since Intel Advisor divides analysis in MPI ranks (one analysis output folder for each MPI rank), it is better to just keep one MPI rank and use 64 OpenMP threads to use the full potential of the bandwidth and the caches.

The roofline can be visualized using the GUI Advisor interface as shown in Fig. 4.

advisor_roofline.png
Fig. 4 – Screenshot of the cache-aware roofline from Intel Advisor computed on KNL for an old version of PICSAR. Colored circle markers correspond to different loops of the code. The marker color (green, yellow, red) depends on the percentage of simulation time spent in the corresponding  loop.

For the moment, this feature does not enable to compare several code versions on a single roofline. We propose another way to visualize the roofline profiling results using Python and Matplotlib that will also enable to compare different loops with different optimizations. Main information from the analysis output files can be extracted in a csv file (other format available) using the following command:

advixe-cl --report survey --show-all-columns --format=csv \
--project-dir ./advisor_analysis --report-output advixe_report.csv

In order to read the resulting file advixe_report.csv, Tuomas Koskela and Mathieu Lobet have developed a small Python library called pyAdvisor available on GitHub. This library also enables to plot the rooflines, to compare different Advisor surveys and to select and sort interesting points in order to have custom plots.

Here is a list of good webpages and videos on how to use the Cache-Aware roofline model with Intel Advisor:

References

[1] – Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65-76, 2009.

[2] – Roofline performance model. http://crd.lbl.gov/departments/computerscience/PAR/research/roofline/. Accessed: 2016-07-22.

[3] – A. Ilic et al., IEEE Computer Architecture Letters (2014)

[4] – Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincent. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor, https://crd.lbl.gov/assets/Uploads/ixpug16-roofline.pdf.

Mathieu Lobet, Last update: January 29, 2017

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s