What will change on exascale computers?
Achieving exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware and software developments that directly impact our way of implementing PIC codes.
Table 1 shows the energy required to perform different operations ranging from arithmetic operations (fused multiply add or FMADD) to on-die memory/DRAM/Socket/Network memory accesses. As 1 pJ/flop/s is equivalent to 1 MW for exascale machines delivering 1 exaflop (1 billion of a billion flops/sec), this simple table shows that as we go off the die, the cost of memory accesses and data movement becomes prohibitive and much more important than simple arithmetic operations. In addition to this energy limitation, the draconian reduction in power/flop and per byte will make data movement less reliable and more sensitive to noise, which also pushes towards an increase in data locality in our applications.
At the hardware level, part of this problem of memory locality was progressively adressed in the past few years by limiting costly network communications and grouping more computing ressources that share the same memory (“fat nodes”). However, partly due to cooling issues, grouping more and more of these computing units will imply a reduction of their clock speed. To compensate for the reduction of computing power due to clock speed, future CPUs will have much wider data registers that can process or “vectorize” multiple data in a single clock cycle (Single Instruction Multiple Data or SIMD).
In order to achieve very good performances on exascale computers, programmers will thus need to exploit the following three levels of parallelism:
- Distributed memory level (internode) : usually achieved with the Message Passing Interface (MPI) API
- Shared memory level (intranode) : CUDA (on GPUs), OpenMP, OpenACC, OpenCL
- Vectorization : Compiler directives, OpenMP 4.0 directives, Intel C++ intrinsics