The exascale challenge

What will change on exascale computers?

Achieving exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware and software developments that directly impact our way of implementing PIC codes.

Table 1. Energy consumption of different operations taken from the DARPA report by P. Kogget et al (2008). The die hereby refers to the integrated circuit board made of semi-conductor materials that usually holds the functional units and fast memories (first levels of cache). This table shows the energy required to achieve different operations on current (year 2015) and future (Year 2019) computer architectures. DP stands for ‘Double Precision’, FMADD for ‘Fused Multiply ADD’ and DRAM for ‘Dynamic Random Access Memory’.

Table 1 shows the energy required to perform different operations ranging from arithmetic operations (fused multiply add or FMADD) to on-die memory/DRAM/Socket/Network memory accesses. As 1 pJ/flop/s is equivalent to 1 MW for exascale machines delivering 1 exaflop (1 billion of a billion flops/sec), this simple table shows that as we go off the die, the cost of memory accesses and data movement becomes prohibitive and much more important than simple arithmetic operations. In addition to this energy limitation, the draconian reduction in power/flop and per byte will make data movement less reliable and more sensitive to noise, which also pushes towards an increase in data locality in our applications.

At the hardware level, part of this problem of memory locality was progressively adressed in the past few years by limiting costly network communications and grouping more computing ressources that share the same memory (“fat nodes”). However, partly due to cooling issues, grouping more and more of these computing units will imply a reduction of their clock speed. To compensate for the reduction of computing power due to clock speed, future CPUs will have much wider data registers that can process or “vectorize” multiple data in a single clock cycle (Single Instruction Multiple Data or SIMD).

In order to achieve very good performances on exascale computers, programmers will thus need to exploit the following three  levels of parallelism:

  • Distributed memory level (internode) : usually achieved with the Message Passing Interface (MPI) API
  • Shared memory level (intranode) : CUDA (on GPUs), OpenMP, OpenACC, OpenCL
  • Vectorization : Compiler directives, OpenMP 4.0 directives, Intel C++ intrinsics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s