Technical ReportPDF Available

Basics of Computer Architecture as related to CFD

August 2023

August 2023

DOI:10.13140/RG.2.2.27074.94409/6

Report number: 2.45
Affiliation: EURO/CFD

Authors:

Ideen Sadrehaghighi

CFD Open Series

As we know, Computers and Software’s are one of the pillars of CFD and next two chapter are devoted to that. We argued that being a CFD analysis is not necessities a computer expert, nevertheless, knowing the essentials of it never hurts. So, it is wise to get familiar with modern computer architectures, as well as, software optimization. Even if you can speed up the computational aspects of a processor infinitely fast, you still must load and store the data and instructions to and from a memory. Today's processors continue to creep ever closer to infinitely fast processing. But memory performance is increasing at a much slower rate (it will take longer for memory to become infinitely fast). Many of the interesting problems in high performance computing use a large amount of memory. As computers are getting faster, the size of problems they tend to operate on also goes up. The trouble is that when you want to solve these problems at high speeds, you need a memory system that is large, yet at the same time fast; a big challenge.

Many memory addresses map to the same cache line

…

Two-Way Set-Associative Cache

…

Sharp Profiling (right) vs. Flat Profiling (right)

…

a) Control Dependency; (b) A section of your program; (c) Expensive Operation Moved so that it's Rarely Executed

…

+52

Flow Graph for Data Flow Analysis

…

Figures - uploaded by Ideen Sadrehaghighi

Content may be subject to copyright.

Content uploaded by Ideen Sadrehaghighi

Content may be subject to copyright.

Content uploaded by Ideen Sadrehaghighi

Content may be subject to copyright.

CFD Open Series / Patch 2.45

Basics of Computer

Architecture as

related to CFD

Edited & Adapted by : Ideen Sadrehaghighi

AN N A P O L I S , MD

Memory

Management

and Smooth

Programing

CPU vs. GPU


 
Contents 
Introduction to Modern Computer Architectures via CFD ........................................... 8 
1 Preliminaries ............................................................................................................................................................ 8 
Memory Systems & Basics of Good Programming ........................................................... 9 
1 Memory Technology ............................................................................................................................................. 9 
1.1 Memory Access Time ......................................................................................................... 10 
1.2 Memory Access Patterns ................................................................................................... 10 
1.2.1 Loop Interchange to Ease Memory Access Patterns ............................................... 11 
1.3 Virtual Memory ................................................................................................................. 11 
2 Registers ................................................................................................................................................................. 11 
3 Caches ...................................................................................................................................................................... 12 
3.1 Cache Organization ............................................................................................................ 14 
3.1.1 Direct-Mapped Cache .............................................................................................. 14 
3.1.2 Fully Associative Cache ............................................................................................ 15 
3.1.3 Set-Associative Cache .............................................................................................. 15 
3.1.4 Instruction Cache ..................................................................................................... 16 
4 Timing a Program ............................................................................................................................................... 17 
4.1 Timing a Portion of the Program ....................................................................................... 18 
4.2 Getting Time Information .................................................................................................. 18 
5 Subroutine Profiling ........................................................................................................................................... 20 
6 Loop Optimizations ............................................................................................................................................ 21 
6.1 Operation Counting ........................................................................................................... 22 
6.2 Basic Loop Un-Rolling ........................................................................................................ 23 
6.3 Loops with Low Trip Counts .............................................................................................. 24 
6.4 Fat Loops ............................................................................................................................ 24 
6.5 Loops Containing Procedure Calls ..................................................................................... 25 
6.6 Loops with Branches .......................................................................................................... 25 
6.7 Nested Loops ..................................................................................................................... 26 
6.8 Outer Loop Un-Rolling ....................................................................................................... 26 
6.9 Loop Interchange to Move Computations to the Center .................................................. 27 
7 Matrix Multiplication ......................................................................................................................................... 27 
7.1 Matrix Optimization .......................................................................................................... 28 
7.2 Blocking to Ease Memory Access Patterns ........................................................................ 29 
8 Shared-Memory Parallel Processors ........................................................................................................... 29 
8.1 Dependencies .................................................................................................................... 30 
8.1.1 Control Dependencies ............................................................................................. 31 
8.1.2 Data Dependencies ................................................................................................. 31 
8.2 Forming a Flow Graph ....................................................................................................... 32 
8.2.1   Loop Dependencies ............................................................................................... 33 
8.2.2 Loop-Carried Dependencies .................................................................................... 34 
8.2.3 Flow Dependencies ................................................................................................. 35 
8.2.4 Output Dependencies ............................................................................................. 35 
8.2.5 Dependencies Within an Iteration .......................................................................... 36 
9 Pointer Ambiguity in C ...................................................................................................................................... 37 


 
Performance of CFD Codes as Related to Hardware (CPU vs. GPU) ......................... 39 
1 CFD for Next Generation High Performance Computing .................................................................... 39 
2 Hardware Consideration and CPU vs. GPU Technology ...................................................................... 39 
2.1 Case Study 1 – 2D Laplace Equation .................................................................................. 40 
2.1.1 Results ..................................................................................................................... 40 
2.1.2 Future Work – Heterogeneous Computing ............................................................. 41 
2.2 Case Study 2 - Unstructured Grid Based CFD Solvers on Modern Graphics Hardware ..... 41 
2.2.1 Background and Literature Survey .......................................................................... 41 
2.2.2 Implementation on Graphics Hardware .................................................................. 42 
2.2.3 Test Case .................................................................................................................. 43 
2.3 Case Study 3 - Accelerating CFD Simulation With High Order Finite Difference Method on 
Curvilinear Coordinates for Modern GPU Clusters ........................................................................... 44 
2.3.1 Introduction ............................................................................................................. 44 
2.3.2 Numerical Methods ................................................................................................. 46 
2.3.3 Hardware Environment ........................................................................................... 46 
2.3.4 Programming Implementation and Optimization of HiResX ................................... 48 
2.3.4.1 Code Introduction .............................................................................................. 48 
2.3.4.2 Domain Decomposition ...................................................................................... 52 
2.3.4.3 Hardware Technique .......................................................................................... 52 
2.3.4.4 CPU-GPU Communication Optimization ............................................................ 53 
2.3.4.5 GPU-GPU Communication Optimization ............................................................ 54 
2.3.4.6 Memory Utilization............................................................................................. 55 
2.3.4.7 CUDA Kernels...................................................................................................... 57 
2.3.4.8 Inviscid Fluxes ..................................................................................................... 58 
2.3.4.9 Viscous Fluxes ..................................................................................................... 61 
2.3.4.10 Kernels in Other Section of The Solver ............................................................... 63 
2.3.5 Performance Result ................................................................................................. 63 
2.3.5.1 Speedup Varies with Grid Size ............................................................................ 63 
2.3.5.2 Performance of Kernels ...................................................................................... 64 
2.3.5.3 Performance of Running with Multiple GPUs .................................................... 65 
2.3.5.4 Case Study .......................................................................................................... 68 
2.3.6 Conclusions .............................................................................................................. 69 
2.3.7 References ............................................................................................................... 70 
2.4 Case Study 4 - A Matrix-free GMRES Algorithm on GPU Clusters for Implicit Large Eddy 
Simulation ......................................................................................................................................... 72 
2.4.1 Introduction ............................................................................................................. 72 
2.4.2 A Brief Overview of the FR/CPR Method ................................................................. 73 
2.4.3 Matrix-free GMRES Algorithm ................................................................................. 75 
2.4.4 Numerical Results .................................................................................................... 77 
2.4.4.1 Multi-GPU Performance and Scalability Results ................................................ 77 
2.4.4.2 Transitional Flow Over the T106C Low Pressure Turbine Cascade .................... 80 
2.4.4.3 Flow Over a High-Lift Configuration of the Common Research Model .............. 84 
2.4.5 Concluding Remarks ................................................................................................ 85 
2.5 References ......................................................................................................................... 86 
CFD and HPC Trends Forecasted for 2030 ....................................................................... 90 
1 Relationship Between Semiconductors, SMT, and Microelectronics............................................. 90 


 
2 Comparison of Semiconductor Fabrication Sizes in HPC ................................................................... 91 
3 Current Status of CFD ........................................................................................................................................ 92 
3.1 Conceptual Design ............................................................................................................. 92 
3.2 Preliminary/Detailed Design .............................................................................................. 92 
3.3 Product Validation and Certification ................................................................................. 93 
3.4 CFD usage of High Performance Computing (HPC) ........................................................... 93 
3.5 Turbulence Modeling......................................................................................................... 93 
3.6 Process Automation .......................................................................................................... 94 
3.7 Solution Uncertainty and Robustness ............................................................................... 94 
3.8 Multidisciplinary Analysis and Optimization (MDAO) ....................................................... 94 
4 Vision of CFD in 2030 as anticipated by NASA ....................................................................................... 95 
4.1 Technology Roadmap to achieve GC Challenge ................................................................ 96 
4.1.1 High Performance Computing (HPC) ....................................................................... 97 
4.1.2 Physical Modeling .................................................................................................... 97 
4.1.3 Numerical Algorithms .............................................................................................. 98 
4.1.4 Uncertainty Quantification (UQ) ............................................................................. 99 
4.1.5 Geometry and Grid Generation ............................................................................. 100 
4.1.6 Knowledge Extraction ............................................................................................ 101 
4.1.7 Multidisciplinary Design and Optimization ........................................................... 101 
4.2 Recommendations ........................................................................................................... 102 
5   HPC Envisioned by Department of Energy (DOE) ............................................................................ 103 
5.1 What is Exa-scale Computing? ........................................................................................ 104 
5.2 Why Exa-scale? ................................................................................................................ 104 
5.3 Range of Applications may be Transformed by Going to the Exa-scale .......................... 104 
5.3.1 Aerospace, Airframes and Jet Turbines ................................................................. 104 
5.3.2 Combustion ........................................................................................................... 106 
5.3.3 Climate Modeling .................................................................................................. 107 
5.3.4 Computational Biology .......................................................................................... 108 
5.3.5 Materials Science................................................................................................... 109 
5.3.6 Nuclear Engineering .............................................................................................. 110 
5.3.7 Others Disciplines .................................................................................................. 111 
5.4 Challenges in Going to the Exa-scale ............................................................................... 111 
5.4.1 The Hardware Challenges ...................................................................................... 112 
5.4.2 The Applied Mathematics Challenges ................................................................... 112 
5.4.3 Mathematical Modeling ........................................................................................ 113 
5.4.4 Numerical Algorithms ............................................................................................ 113 
5.4.5 The Algorithmic Challenges ................................................................................... 114 
5.4.6 Computer Science Challenges ............................................................................... 115 
5.4.7 Educational Challenges .......................................................................................... 116 
 
List of Tables 
Table 2.3.1     Memory Access Speed on a DEC Alpha ............................................................................................. 12 
Table 3.2.1     Present GPU specs. Titan V has highest double precision operation performance. RTX 
Ti utilizes the newest architecture and provides highest single precision operation performance
 ....................................................................................................................................................................................................... 47 
Table 3.2.2        Present spec  of CPU.  The double precision  operation performance of E5-2680v3  is 
slightly higher than RTX 2080 Ti’s. ................................................................................................................................ 47 
Table 3.2.3     Maximum grid capacity of HiResX solver in different GPUs ..................................................... 56 

5 
 
Table 3.2.4     Results of the percentage of a typical simulation iteration that is spent in different parts 
of the GMRES algorithm. Results for different solution order and cell types ............................................... 78 
Table 3.2.5     CPU and GPU simulation final settings for the T106C test case ............................................. 81 
Table 3.2.6     Comparison of mean lift and drag coefficients between the CPU simulation and different 
GPU simulations. Relative error is calculated with respect to the CPU simulation ................................... 83 
Table 4.5.1     Three Order of Magnitude Jump ...................................................................................................... 111 
Table 4.5.2      Potential Exa-scale Computer Design for  2018  and Its  Relationship to Current HPC 
Designs (DOE) ...................................................................................................................................................................... 112 
 
List of Figures     
Figure 1.1.1     Contributions From Other Disciplines to CFD ............................................................................... 8 
Figure 2.3.1     Cache Lines can come from Different Parts of Memory ........................................................... 12 
Figure 2.3.2     Many memory addresses map to the same cache line ............................................................. 14 
Figure 2.3.3     Two-Way Set-Associative Cache ........................................................................................................ 16 
Figure 2.5.1     Sharp Profiling (right) vs. Flat Profiling (right) .......................................................................... 20 
Figure 2.8.1       (a)  Control Dependency;   (b) A  section of  your program;  (c) Expensive Operation 
Moved so that it's Rarely Executed ................................................................................................................................ 31 
Figure 2.8.2     Types of Data Dependencies ............................................................................................................... 32 
Figure 2.8.3     Flow Graph for Data Flow Analysis .................................................................................................. 33 
Figure 2.8.4     Flow Graph including  a loop .............................................................................................................. 34 
Figure 3.2.1     Architecture Differences Between CPU and GPU ....................................................................... 40 
Figure 3.2.2     Result for a V cycle Multigrid .............................................................................................................. 40 
Figure 3.2.3     Heterogeneous Computing using CPUs and GPUs ...................................................................... 41 
Figure 3.2.4     Pressures at the Surface and Plane for the NACA 00012 (Left)  and at the Surface for 
the Missile (Right) ................................................................................................................................................................. 42 
Figure 3.2.5     Running Times in Double Precision Per Element Per Iteration for the NACA 0012 .... 43 
Figure 3.2.6     Framework of modern GPU cluster. The computational nodes are connected with high 
speed network. GPUs deliver the majority of performance of modern GPU cluster ................................. 46 
Figure 3.2.7     Several typical PCIe root architectures of GPU server.  For type (a), each NUMA node 
has PCIe switch attached on them, and there are two GPUs mounted on each PCIe switch. For type 
(b), only one PCIe switch is mounted on each NUMA node, and there are 4 GPUs attached on one PCIe 
switch. For type (c), all GPUs are mounted on one NUMA node by two PCIe switches. For type (d), all 
GPUs are mounted on one NUMA node. ....................................................................................................................... 48 
Figure 3.2.8     Main structure of HiResX summarized by pseudo code ........................................................... 49 
Figure 3.2.9     Framework of communication of  HiResX without GPU peer to peer communication 
technology support. For communication between processes that are all running on CPU, if processes 
are located in the same node, they exchange data within RAM with MPI-3 shared memory technique, 
see red path. If processes are located in different nodes, the standard MPI communication is utilized, 
see yellow path that connects triangle pair. For processes with GPU acceleration, data on GPU should 
be downloaded back to CPU, and then processes communicate in the same way as processes without 
acceleration. ............................................................................................................................................................................. 50 
Figure  3.2.10        A  global  glance of  execution procedures  of  HiResX  running  on GPU.  Except for 
initialization and data performed on CPU, all computations are performed on GPU while CPU is used 
to schedule kernels of GPU only. "Data 0" represents the initial data computed in CPU and uploaded 
to GPU. "Data 1" represents data exchanged between processes. "Data 3" represents flow field data 
to be written, and it should be downloaded to CPU and written with a new thread. ............................... 50 
Figure 3.2.11     Domain decomposition strategies of HiResX. The domain is divided into several blocks 
according to  performance of CPU  and GPU  in order to balance workloads of each process. For the 

6 
 
blocks computed by CPU, the whole block is computed by a CPU process. For the blocks computed by 
GPU, each grid cell is computed by one CUDA thread. ........................................................................................... 51 
Figure 3.2.12     CPU-GPU and GPU-GPU memory access models. In (a), red path is the optimal one, 
while pink path is inferior because CPU to GPU memory access across NUMA nodes is worse than 
local  access.  In  (b),  if  GPUs  support  peer-to-peer  (P2P)  communication  technology,  green  path 
indicates that  two  GPUs communicate by Nvlink, which is the fast path. Purple path indicates that 
two GPUs communicate by PCIe switch (PLX), which is most efficient communication way without 
Nvlink. Yellow path means  that  two GPUs communicate by  host (CPU). Red path is  the worst one 
because P2P is not supported across NUMA nodes, and memory access between two GPU in this way 
must be transferred by both CPUs. ................................................................................................................................ 52 
Figure  3.2.13         Organization  of  CUDA  threads for  NVIDIA’s  GPU  and  its mapping  to  hardware 
structure .................................................................................................................................................................................... 57 
Figure 3.2.14     Code Structure For Derivative of Inviscid Flux Computed With Original AFWENO . 58 
Figure  3.2.15          Comparison  of  WENO  interpolations.  In  original  WENO  interpolation,  the 
characteristic space is built at i + 1/2, and u± i+1/2 are computed simultaneous in a loop. In modified 
WENO  interpolation, the characteristic space is built at i, and u−i+1/2 and u+i−1/2 are computed 
simultaneous  in  a  loop.  In  modified  WENO  interpolation,  point  stencils  are  the  same,  so  the 
characteristic variables and smoothness indicators are shared, thus it reduces calculation and global 
memory access. ...................................................................................................................................................................... 60 
Figure 3.2.16     Code structure of derivative of inviscid flux computed with modified AFWENO ...... 61 
Figure 3.2.17     Code structure of the derivative of viscous flux summarized by a pseudo code ........ 62 
Figure 3.2.18     Global performance varying with grid sizes. The speedup is defined as the ratio of 
elapsed time of running on CPU with one core to lapsed time of running with a GPU ............................ 63 
Figure 3.2.19        Performance of  main  parts of  the solver.  Generally, the  parts that contain  large 
amount of computation get apparent acceleration. In the sections of time advance and viscous flux, 
due to the high occupancies of kernels, both GPUs get higher speedups. For evaluation of time step 
and inviscid flux, Titan V gets higher speedups than RTX 2080 Ti, because Titan V has more double 
precision operation processing unit, which means higher double precision operation performance, 
see  Table 4.2.1. ...................................................................................................................................................................... 65 
Figure 3.2.20        Strong scaling result of different GPU. Both GPUs’ scalabilities deviate from linear 
result, but the efficiencies are all larger than 75% when 10 GPUs are utilized. The drops of scalability 
come from the fact that when grid size decreases, the speedup drops too, see Figure 4.2.18.  Scalability 
of Titan V drops faster than RTX 2080 Ti’s, because the speedup of Titan V drops faster than RTX 
2080 Ti’s when the grid size decreases. ....................................................................................................................... 66 
Figure 3.2.21     Cases configuration. In figure (a), there are 100 points in the spanwise direction. Data 
exchange in vertical direction  (red) is larger than data  exchange in horizontal direction (blue).  In 
figure  (b).  five  cases  are  set  to  test  the  communication  performance  of  different  strategies  of 
computations  assigned  to  different  GPUs.  The  number  in  block  presents  the  GPU  ID  in  which 
computation is performed. Blocks with the same color indicate that they are in the same PCIe switch. 
In  cases I  and  II,  computations  are performed  on  GPUs  that  are  in  two  different  PCIe  switches 
respectively.  In cases  III  to  IV,    computations  are performed  on  GPUs belong  to  both  two  PCIe 
switches. For case III, the maximum data transfers are performed across PCIe switch. In case IV, the 
maximum data transfers are performed within the same PCIe switch. In case V, all data transfers are 
performed across PCIe switch. ......................................................................................................................................... 67 
Figure 3.2.22     Performance of HiResX running on different GPU topology. Cases in which all GPUs 
are in the same PCIe switch get best communication efficiency, see case I and II. For case in which 
GPUs  are located  in  different  PCIe  switches,  if  GPU  devices  are optimally  assigned  according  to 
communication load, communication efficiency can be also improved, see case IV. Without GPU to 
GPU optimization, the communication efficiency is lower, see case III and V. ............................................ 68 

7 
 
Figure 3.2.23     Shock-cell spacing Lh/h as a function of the fully expanded jet Mach number Mj . Our 
result is more close to theoretical[17] and experimental results [18, 19], compared to the LES results 
of Berland et al.[16]. ............................................................................................................................................................. 68 
Figure 3.2.24     Strouhal number St = fh/Uj of the fundamental screech tone as a function of the fully 
expanded jet Mach number Mj . Our result is in good agreement with experimental result of Ref.[18], 
which is better than the LES result of Ref.[16]. ......................................................................................................... 69 
Figure 3.2.25     Instantaneous snapshot of spanwise vorticity !z and the dilatation in the plane z/h = 
2.5 as the background. The isosurface of vorticity is colored with the amplitude of velocity .............. 69 
Figure 3.2.26     NVIDIA V100 GPU Speedup compared to a single CPU core of an Intel Xeon CPU E5-
2620 chip .................................................................................................................................................................................. 77 
Figure 3.2.27     Strong scalability study with GPU simulations on the KU cluster for a mesh with 149k 
hex elements ............................................................................................................................................................................ 78 
Figure 3.2.28     Strong scalability study with GPU simulations on Summit for a mesh with 15.6M hex 
elements .................................................................................................................................................................................... 79 
Figure 3.2.29     Results of GPU GMRES speedup relative to the GPU RK3 for running one characteristic 
time Tc using different physical time steps ................................................................................................................. 80 
Figure 3.2.30     Mesh provided by Alhawwary et al. [38] and time-averaged solution contours of the 
p2 simulation with GPU ...................................................................................................................................................... 80 
Figure 3.2.31     Results of GPU GMRES speedup relative to the GPU RK3 for running one characteristic 
time Tc using different GMRES linear convergence tolerance ............................................................................ 81 
Figure 3.2.32     Comparison of the time- and spanwise-averaged pressure coefficient between the 
CPU simulation and different GPU simulations ........................................................................................................ 82 
Figure 3.2.33     Comparison of the time- and spanwise-averaged coefficient of streamwise friction 
between the CPU simulation and different GPU simulations .............................................................................. 82 
Figure 3.2.34     Comparison of the PSD of pressure at wake point(1) ............................................................ 83 
Figure 3.2.35     Views of the high-order (Q2) hybrid mesh generated by Pointwise with over 4 million 
elements .................................................................................................................................................................................... 85 
Figure  3.2.36          Instantaneous flow  field showing  iso-surfaces of  the Q-criterion  colored by  the 
stream-wise velocity for a p2 simulation .................................................................................................................... 86 
Figure 4.1.1     Relationship between Microelectronics,  SMT and Semiconductor.................................... 91 
Figure 4.2.1     Changing Predictions About Semiconductor Sizes .................................................................... 91 
Figure 4.4.1     Proposed New Computational Sciences Program Structure .............................................. 103 
Figure 4.5.2     Computer speed and memory requirements for the Grand Challenge .......................... 105 
Figure  4.5.3          A  supersonic Jet  Engine  Nozzle  Rapidly  Accelerates High-Pressure  Gas  into  the 
Atmosphere ........................................................................................................................................................................... 106 
Figure 4.5.4     Detail View of 9-Billion Atom Molecular Dynamics Simulation Instability .................. 109 
 
   

1 Introduction to Modern Computer Architectures via CFD

1.1 Preliminaries

As we know, Computers and Software’s are one of the pillars of CFD and next two chapter are

devoted to that. We argued that being a CFD analysis is not necessities a computer expert,

nevertheless, knowing the essentials

of it never hurts. (see Figure 1.1.1).

So, it is wise to get familiar with

modern computer architectures, as

well as, software optimization, as

details in [Severance & Dowd]

. Even

if you can speed up the

computational aspects of a

processor infinitely fast, you still

must load and store the data and

instructions to and from a memory.

Today's processors continue to creep

ever closer to infinitely fast

processing. But memory

performance is increasing at a much

slower rate (it will take longer for

memory to become infinitely fast).

Many of the interesting problems in

high performance computing use a

large amount of memory. As

computers are getting faster, the size of problems they tend to operate on also goes up. The trouble

is that when you want to solve these problems at high speeds, you need a memory system that is

large, yet at the same time fast; a big challenge.

Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.

Figure 1.1.1 Contributions From Other Disciplines to CFD

CFD

2 Memory Systems & Basics of Good Programming

Today's processors continue to creep ever closer to infinitely fast processing. But memory

performance is increasing at a much slower rate (it will take longer for memory to become infinitely

fast). Many of the interesting problems in high performance computing use a large amount of

memory. As computers are getting faster, the size of problems they tend to operate on also goes up.

The trouble is that when you want to solve these problems at high speeds, you need a memory system

that is large, yet at the same time fast; a big challenge. Possible approaches include the following:

➢ Every memory system component can be made individually fast enough to respond to every

memory access request.

➢ Slow memory can be accessed in a round-robin fashion (hopefully) to give the effect of a

faster memory system.

➢ The memory system design can be made wide so that each transfer contains many bytes of

information.

➢ The system can be divided into faster and slower portions and arranged so that the fast

portion is used more often than the slow one.

Again, economics are the dominant force in the computer business. A cheap, statistically optimized

memory system will be a better seller than a prohibitively expensive, blazingly fast one, so the first

choice is not much of a choice at all. But these choices, used in combination, can attain a good fraction

of the performance you would get if every component were fast. Chances are very good that your

high performance workstation incorporates several or all of them. Once the memory system has been

decided upon, there are things we can do in software to see that it is used efficiently. A compiler that

has some knowledge of the way memory is arranged and the details of the caches can optimize their

use to some extent. The other place for optimizations is in user applications, as we'll see later in the

book. A good pattern of memory access will work with, rather than against, the components of the

system. Next, we discuss how the pieces of a memory system work. We look at how patterns of data

and instruction access factor into your overall runtime, especially as CPU speeds increase. We also

talk a bit about the performance implications of running in a virtual memory environment

2.1 Memory Technology

Almost all fast memories used today are semiconductor-based

. They come in two flavors: Dynamic

Random Access Memory (DRAM) and Static Random Access Memory (SRAM). The term random

means that you can address memory locations in any order. This is to distinguishes from Serial

Memories, where you have to step through all intervening locations to get to the particular one you

are interested in. An example of a storage medium that is not random is magnetic tape. The terms

dynamic and static have to do with the technology used in the design of the memory cells. DRAMs are

charge-based devices, where each bit is represented by an electrical charge stored in a very small

capacitor. The charge can leak away in a short amount of time, so the system has to be continually

refreshed to prevent data from being lost. The act of reading a bit in DRAM also discharges the bit,

requiring that it be refreshed. It's not possible to read the memory bit in the DRAM while it's being

refreshed.

SRAM is based on gates, and each bit is stored in four to six connected transistors. SRAM memories

retain their data as long as they have power, without the need for any form of data refresh. DRAM

offers the best price/performance, as well as highest density of memory cells per chip. This means

lower cost, less board space, less power, and less heat. On the other hand, some applications such as

Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.

Magnetic core memory is still used in applications where radiation hardness resistance to changes caused by

ionizing radiation is important.

cache and video memory require higher speed, to which SRAM is better suited. Currently, you can

choose between SRAM and DRAM at slower speeds _ down to about 50 nanoseconds (ns). SRAM has

access times down to about 7 ns at higher cost, heat, power, and board space. In addition to the basic

technology to store a single bit of data, memory performance is limited by the practical

considerations of the on-chip wiring layout and the external pins on the chip that communicate the

address and data information between the memory and the processor.

2.1.1 Memory Access Time

The amount of time it takes to read or write a memory location is called the memory access time.

Whereas the access time says how quickly you can reference a memory location, cycle time describes

how often you can repeat references. They sound like the same thing, but they're not. For instance, if

you ask for data from DRAM chips with a 50-ns access time, it may be 100 ns before you can ask for

more data from the same chips. This is because the chips must internally recover from the previous

access. Also, when you are retrieving data sequentially from DRAM chips, some technologies have

improved performance. On these chips, data immediately following the previously accessed data may

be accessed as quickly as 10 ns

2.1.2 Memory Access Patterns

The best pattern is the most straightforward: increasing and unit sequential. For an array with a

single dimension, stepping through one element at a time will accomplish this. For multiply-

dimensioned arrays, access is fastest if you iterate on the array subscript offering the smallest stride

or step size. In FORTRAN programs, this is the leftmost subscript; in C, it is the rightmost. The

FORTRAN loop below has unit stride, and therefore will run quickly:

DO J = 1 , N

DO I = 1 , N

A (I , J) = B (I , J) + C (I ,J ) * D

ENDDO

In contrast, the next loop is slower because its pace is N. As N increases from one to the length of the

cache line (adjusting for the length of each element), the performance worsens. Once N is longer than

the length of the cache line (again adjusted for element size), the performance won't decrease:

DO J = 1 , N

DO I =1 , N

A (J , I) = B (J , I) + C (J , I) * D

ENDDO

Here's a unit-stride loop like the previous one, but written in C:

for (I = 0 ; I < n ; i++)

for (j = 0 ; j < n ; j++)

a [ I ] [j ] = a [ i] [ j ] + c [ I ] [ j ] * d;

Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.

2.1.2.1 Loop Interchange to Ease Memory Access Patterns

Loop interchange is a good technique for lessening the impact of stride memory references. Let's

revisit our FORTRAN loop with non-unit stride. The good news is that we can easily interchange the

loops; each iteration is independent of every other:

DO J = 1 , N

DO I = 1 , N

A (J , I) = B (J , I) + C (J , I) * D

ENDDO

After interchange, A, B, and C are referenced with the leftmost subscript varying most quickly. This

modification can make an important difference in performance. We traded three N-stride memory

references for unit strides:

DO I =1 , N

DO J = 1 , N

A (J , I) = B (J , I) + C (J , I) * D

ENDDO

2.1.3 Virtual Memory

Virtual memory decouples the addresses used by the program (virtual addresses) from the actual

addresses where the data is stored in memory (physical addresses). Your program sees its address

space starting at 0 and working its way up to some large number, but the actual physical addresses

assigned can be very different. It gives a degree of flexibility by allowing all processes to believe they

have the entire memory system to themselves. Another trait of virtual memory systems is that they

divide your program's memory up into pages chunks. Page sizes vary from 512 bytes to 1 MB or

larger, depending on the machine. Pages don't have to be allocated contiguously, though your

program sees them that way. By being separated into pages, programs are easier to arrange in

memory, or move portions out to disk.

2.2 Registers

At least the top layer of the memory hierarchy, the CPU registers, operate as fast as the rest of the

processor. The goal is to keep operands in the registers as much as possible. This is especially

important for intermediate values used in a long computation such as:

X = G * 2.41 + A / W - W * M

While computing the value of A divided by W, we must store the result of multiplying G by 2.41. It

would be a shame to have to store this intermediate result in memory and then reload it a few

instructions later. On any modern processor with moderate optimization, the intermediate result is

stored in a register. Also, the value W is used in two computations, and so it can be loaded once and

used twice to eliminate a wasted load. Compilers have been very good at detecting these types of

optimizations and efficiently making use of the available registers since the 1970s. Adding more

registers to the processor has some performance benefit. It's not practical to add enough registers to

the processor to store the entire problem data. So we must still use the slower memory technology.

2.3 Caches

Once we go beyond the registers in the memory hierarchy, we encounter caches. Caches are small

amounts of SRAM that store a subset of the contents of the memory. The hope is that the cache will

have the right subset of main memory at the right time. The actual cache architecture has had to

change as the cycle time of the processors has improved. The

processors are so fast that off-chip SRAM chips are not even fast

enough. This has led to a multilevel cache approach with one, or

even two, levels of cache implemented as part of the processor.

Table 2.3.1 shows the approximate speed of accessing the

memory hierarchy on a 500-MHz DEC Alpha. When every

reference can be found in a cache, you say that you have a 100% hit

rate. Generally, a hit rate of 90% or better is considered good for a

level-one (L1) cache. In level-two (L2) cache, a hit rate of above

50% is considered acceptable. Below that, application

performance can drop off steeply. One can characterize the average read performance of the memory

hierarchy by examining the probability that a particular load will be satisfied at a particular level of

the hierarchy. For example, assume a memory architecture with an L1 cache speed of 10 ns, L2 speed

of 30 ns, and memory speed of 300 ns. If a memory reference were satisfied from L1 cache 75% of

the time, L2 cache 20% of the time, and main memory 5% of the time, the average memory

performance would be:

(0.75 * 10 ) + ( 0.20 * 30 ) + ( 0.05 * 300 ) = 28.5 ns

You can easily see why it's important to have an L1 cache hit rate of 90% or higher. Given that a cache

holds only a subset of the main memory at any time, it's important to keep an index of which areas of

the main memory are currently stored in the cache. To reduce the amount of space that must be

dedicated to tracking which memory areas are in cache, the cache is divided into a number of equal

sized slots known as lines. Each line contains some number of sequential main memory locations,

generally four to sixteen integers or real numbers. Whereas the data within a line comes from the

same part of memory, other lines can contain data that is far separated within your program, or

perhaps data from somebody else's program, as in Figure 2.3.1 (Cache lines can come from

different parts of memory). When you ask for something from memory, the computer checks to see

if the data is available within one of these cache lines. If it is, the data is returned with a minimal

delay. If it's not, your program may be delayed while a new line is fetched from main memory. Of

2 ns

L1 on-Chip

4 ns

L2 on-Chip

5 ns

L3 on-Chip

30 ns

Memory

220 ns

Table 2.3.1 Memory Access

Speed on a DEC Alpha

Figure 2.3.1 Cache Lines can come from Different Parts of Memory

course, if a new line is brought in, another has to be thrown out. If you're lucky, it won't be the one

containing the data you are just about to need.

On multiprocessors (several CPUs), written data must be returned to main memory so the rest of the

processors can see it, or all other processors must be made aware of local cache activity. The problem

can become very complex in a multiprocessor system. Caches are effective because programs often

exhibit characteristics that help keep the hit rate high. This is called unit stride because the address

of each successive data element is incremented by one and all the data retrieved into the cache is

used. The following loop is a unit-stride loop:

DO I = 1 , 1000000

SUM = SUM + A (I)

END DO

When a program accesses a large data structure using non-unit stride, performance suffers because

data is loaded into cache that is not used. For example:

DO I = 1 , 1000000 , 8

SUM = SUM + A (I)

END DO

This code would experience the same number of cache misses as the previous loop, and the same

amount of data would be loaded into the cache. However, the program needs only one of the eight

32-bit words loaded into cache. Even though this program performs one-eighth the additions of the

previous loop, its elapsed time is roughly the same as the previous loop because the memory

operations dominate performance. While this example may seem a bit contrived, there are several

situations in which non-unit strides occur quite often. First, when a FORTRAN two-dimensional

array is stored in memory, successive elements in the first column are stored sequentially followed

by the elements of the second column. If the array is processed with the row iteration as the inner

loop, it produces a unit-stride reference pattern as follows:

REAL*4 A (200 , 200)

DO J = 1 , 200

DO I = 1 , 200

SUM = SUM + A (I , J)

END DO

Interestingly, a FORTRAN programmer would most likely write the loop (in alphabetical order) as

follows, producing a non-unit stride of 800 bytes between successive load operations:

REAL*4 A (200 , 200)

DO I = 1 , 200

DO J = 1 , 200

SUM = SUM + A (I , J)

END DO

Because of this, some compilers can detect this suboptimal loop order and reverse the order of the

loops to make best use of the memory system. As we will see, however, this code transformation may

produce different results, and so you may have to give the compiler _permission_ to interchange these

loops in this particular example (or, after reading this book, you could just code it properly in the first

place).

while ( ptr ! = NULL ) ptr = ptr -> next;

The next element that is retrieved is based on the contents of the current element. This type of loop

bounces all around memory in no particular pattern. This is called pointer chasing and there are no

good ways to improve the performance of this code. A third pattern often found in certain types of

codes is called gather (or scatter) and occurs in loops such as:

SUM = SUM + ARR ( IND (I))

where the IND array contains offsets into the ARR array. Again, like the linked list, the exact pattern

of memory references is known only at runtime when the values stored in the IND array are known.

Some special-purpose systems have special hardware support to accelerate this particular operation.

2.3.1 Cache Organization

The process of pairing memory locations with cache lines is called mapping. Of course, given that a

cache is smaller than main memory, you have to share the same cache lines for different memory

locations. In caches, each cache line has a record of the memory address (called the tag) it represents

and perhaps when it was last used. The tag is used to track which area of memory is stored in a

particular cache line. The way memory locations (tags) are mapped to cache lines can have a

beneficial effect on the way your program runs, because if two heavily used memory locations map

onto the same cache line, the miss rate will be higher than you would like it to be. Caches can be

organized in one of several ways: direct mapped, fully associative, and set associative.

2.3.1.1 Direct-Mapped Cache

Direct mapping, as presented in Figure 2.3.2 (Many memory addresses map to the same cache line),

is the simplest algorithm for deciding how memory maps onto the cache. Say, for example, that your

computer has a 4-KB cache. In a direct mapped scheme, memory location 0 maps into cache location

0, as do memory locations 4K, 8K, 12K, etc. In other words, memory maps onto the cache size.

Figure 2.3.2 Many memory addresses map to the same cache line

Another way to think about it is to imagine a metal spring with a chalk line marked down the side.

Every time around the spring, you encounter the chalk line at the same place modulo the

circumference of the spring. If the spring is very long, the chalk line crosses many coils, the analog

being a large memory with many locations mapping into the same cache line. Problems occur when

alternating runtime memory references in a direct-mapped cache point to the same cache line. Each

reference causes a cache miss and replaces the entry just replaced, causing a lot of overhead. The

popular word for this is thrashing. When there is lots of thrashing, a cache can be more of a liability

than an asset because each cache miss requires that a cache line be refilled an operation that moves

more data than merely satisfying the reference directly from main memory. It is easy to construct a

pathological case that causes thrashing in a 4-KB direct-mapped cache:

REAL*4 A (1024), B (1024)

COMMON /STUFF/ A , B

DO I=1,1024

A(I) = A(I) * B(I)

END DO

The arrays A and B both take up exactly 4 KB of storage, and their inclusion together in COMMON

assures that the arrays start exactly 4 KB apart in memory. In a 4-KB direct mapped cache, the same

line that is used for A(1) is used for B(1), and likewise for A(2) and B(2), etc., so alternating references

cause repeated cache misses. To fix it, you could either adjust the size of the array A, or put some

other variables into COMMON, between them. For this reason one should generally avoid array

dimensions that are close to powers of two.

2.3.1.2 Fully Associative Cache

At the other extreme from a direct mapped cache is a fully associative cache, where any memory

location can be mapped into any cache line, regardless of memory address. Fully associative caches

get their name from the type of memory used to construct them _ associative memory. Associative

memory is like regular memory, except that each memory cell knows something about the data it

contains. When the processor goes looking for a piece of data, the cache lines are asked all at once

whether any of them has it. The cache line containing the data holds up its hand and says _I have it_;

if none of them do, there is a cache miss. It then becomes a question of which cache line will be

replaced with the new data. Rather than map memory locations to cache lines via an algorithm, like

a direct- mapped cache, the memory system can ask the fully associative cache lines to choose among

themselves which memory locations they will represent. Usually the least recently used line is the

one that gets overwritten with new data. The assumption is that if the data hasn't been used in quite

a while, it is least likely to be used in the future. Fully associative caches have superior utilization

when compared to direct mapped caches. It's difficult to find real-world examples of programs that

will cause thrashing in a fully associative cache. The expense of fully associative caches is very high,

in terms of size, price, and speed. The associative caches that do exist tend to be small.

2.3.1.3 Set-Associative Cache

Now imagine that you have two direct mapped caches sitting side by side in a single cache unit as

shown in Figure 2.3.3 (Two-way set-associative cache). Each memory location corresponds to a

particular cache line in each of the two direct-mapped caches. The one you choose to replace during

a cache miss is subject to a decision about whose line was used last the same way the decision was

made in a fully associative cache except that now there are only two choices. This is called a set-

associative cache. Set-associative caches generally come in two and four separate banks of cache.

These are called two-way and four-way set associative caches, respectively. Of course, there are

benefits and drawbacks to each type of cache. A set associative cache is more immune to cache

thrashing than a direct-mapped cache of the same size, because for each mapping of a memory

address into a cache line, there are two or more choices where it can go. The beauty of a direct-

mapped cache, however, is that it's easy to implement and, if made large enough, will perform

roughly as well as a set-associative design. Your machine may contain multiple caches for several

different purposes. Here's a little program for causing thrashing in a 4-KB two-way set- associative

cache:

REAL*4 A(1024), B(1024), C(1024)

COMMON /STUFF/ A,B,C

DO I=1,1024

A(I) = A(I) * B(I) + C(I)

END DO

Like the previous cache thrasher program, this forces repeated accesses to the same cache lines,

except that now there are three variables contending for the choose set same mapping instead of two.

Again, the way to fix it would be to change the size of the arrays or insert something in between them,

in COMMON. By the way, if you accidentally arranged a program to thrash like this, it would be hard

for you to detect it; aside from a feeling that the program runs a little slow. Few vendors provide

tools for measuring cache misses.

2.3.1.4 Instruction Cache

So far we have glossed over the two kinds of information you would expect to find in a cache between

main memory and the CPU: instructions and data. But if you think about it, the demand for data is

separate from the demand for instructions. In superscalar processors, for example, it's possible to

execute an instruction that causes a data cache miss alongside other instructions that require no data

from cache at all, i.e., they operate on registers. It doesn't seem fair that a cache miss on a data

reference in one instruction should keep you from fetching other instructions because the cache is

tied up. Furthermore, a cache depends on locality of reference between bits of data and other bits of

data or instructions and other instructions, but what kind of interplay is there between instructions

and data? It would seem possible for instructions to bump perfectly useful data from cache, or vice

versa, with complete disregard for locality of reference.

Figure 2.3.3 Two-Way Set-Associative Cache

Many designs from the 1980s used a single cache for both instructions and data. But newer designs

are employing what is known as the Harvard Memory Architecture, where the demand for data is

segregated from the demand for instructions. Main memory is a still a single large pool, but these

processors have separate data and instruction caches, possibly of different designs. By providing two

independent sources for data and instructions, the aggregate rate of information coming from

memory is increased, and interference between the two types of memory references is minimized.

Also, instructions generally have an extremely high level of locality of reference because of the

sequential nature of most programs. Because the instruction caches don't have to be particularly

large to be effective, a typical architecture is to have separate L1 caches for instructions and data and

to have a combined L2 cache. For example, the IBM/Motorola PowerPC 604e has separate 32-K four-

way set-associative L1 caches for instruction and data and a combined L2 cache.

2.4 Timing a Program

Under UNIX, you can time program execution by placing the time command before everything else

you normally type on the command line. When the program finishes, a timing summary is produced.

For instance, if your program is called foo, you can time its execution by typing time foo. If you are

using the C shell or Korn shell, time is one of the shell's built-in commands. With a Bourne shell, time

is a separate command executable in /bin. In any case, the following information appears at the end

of the run:

• User time

• System time

• Elapsed time

These timing figures are easier to understand with a little background. As your program runs, it

switches back and forth between two fundamentally different modes: user mode and kernel mode.

The normal operating state is user mode. It is in user mode that the instructions the compiler

generated on your behalf get executed, in addition to any subroutine library calls linked with your

program.20 It might be enough to run in user mode forever, except that programs generally need

other services, such as I/O, and these require the intervention of the operating system the kernel. A

kernel service request made by your program, or perhaps an event from outside your program,

causes a switch from user mode into kernel mode.

Time spent executing in the two modes is accounted for separately. The user time figure describes

time spent in user mode. Similarly, system time is a measure of the time spent in kernel mode. As far

as user time goes, each program on the machine is accounted for separately. That is, you won't be

charged for activity in somebody else's application. System time accounting works the same way, for

the most part; however, you can, in some instances, be charged for some system services performed

on other people's behalf, in addition to your own. Incorrect charging occurs because your program

may be executing at the moment some outside activity causes an interrupt. This seems unfair, but

take consolation in the fact that it works both ways: other users may be charged for your system

activity too, for the same reason. Taken together, user time and system time are called CPU time.

Generally, the user time is far greater than the system time. You would expect this because most

applications only occasionally ask for system services. In fact, a disproportionately large system time

probably indicates some trouble.

For instance, programs that are repeatedly generating exception conditions, such as page faults,

misaligned memory references, or floating-point exceptions, use an inordinate amount of system

time. Time spent doing things like seeking on a disk, rewinding a tape, or waiting for characters at

the terminal doesn't show up in CPU time. That's because these activities don't require the CPU; the

CPU is free to go on and execute other programs. The third piece of information (corresponding to

the third set of hands on the watch), elapsed time, is a measure of the actual (wall clock) time that

has passed since the program was started. For programs that spend most of their time computing,

the elapsed time should be close to the CPU time. Reasons why elapsed

time might be greater are:

• You are timesharing the machine with other active programs

• Your application performs a lot of I/O.

• Your application requires more memory bandwidth than is available on the machine.

• Your program was paging or swapped.

People often record the CPU time and use it as an estimate for elapsed time. Using CPU time is okay

on a single CPU machine, provided you have seen the program run when the machine was quiet and

noticed the two numbers were very close together. But for multiprocessors, the total CPU time can

be far different from the elapsed time. Whenever there is a doubt, wait until you have the machine to

your- self and time your program then, using elapsed time. It is very important to produce timing

results that can be verified using another run when the results are being used to make important

purchasing decisions.

If you are running on a Berkeley UNIX derivative, the C shell's built-in time command can report a

number of other useful statistics. Check with your csh manual page for more possibilities. In addition

to figures for CPU and elapsed time, csh time command produces information about CPU utilization,

page faults, swaps, blocked I/O operations (usually disk activity), and some measures of how much

physical memory our pro- gram occupied when it ran. We describe each of them in turn.

2.4.1 Timing a Portion of the Program

For some benchmarking or tuning efforts, measurements taken on the outside of the program tell you

everything you need to know. But if you are trying to isolate performance figures for individual loops

or portions of the code, you may want to include timing routines on the inside too. The basic

technique is simple enough:

1. Record the time before you start doing X.

2. Do X.

3. Record the time at completion of X.

4. Subtract the start time from the completion time.

If, for instance, X's primary job is to calculate particle positions, divide by the total time to obtain a

number for particle positions/second. You have to be careful though; too many calls to the timing

routines, and the observer becomes part of the experiment. The timing routines take time too, and

their very presence can increase instruction cache miss or paging. Furthermore, you want X to take

a significant amount of time so that the measurements are meaningful. Paying attention to the time

between timer calls is really important because the clock used by the timing functions has a limited

resolution. An event that occurs within a fraction of a second is hard to measure with any accuracy.

2.4.2 Getting Time Information

In this section, we discuss methods for getting various timer values during the execution of your

program. For FORTRAN programs, a library timing function found on many machines is called etime,

which takes a two-element REAL*4 array as an argument and fills the slots with the user CPU time

and system CPU time, respectively. The value returned by the function is the sum of the two. Here's

how etime is often used:

real*4 tarray(2), etime

The uptime command gives you a rough indication of the other activity on your machine. The last three fields

tell the average number of processes ready to run during the last 1, 5, and 15 minutes, respectively.

real*4 start, finish

start = etime(tarray)

finish = etime(tarray)

write (*,*) 'CPU time: ', finish – start

Not every vendor supplies an etime function; in fact, one doesn't provide a timing routine for

FORTRAN at all. Try it first. If it shows up as an undefined symbol when the program is linked, you

can use the following C routine. It provides the same functionality as etime:

#include <sys/times.h>

#define TICKS 100.

float etime (parts)

struct {

float user;

float system;

} *parts;

{

struct tms local;

times (&local);

parts->user= (float) local.tms_utime/TICKS;

parts->system = (float) local.tms_stime/TICKS;

return (parts->user + parts->system);

}

There are a couple of things you might have to tweak to make it work. First of all, linking C routines

with FORTRAN routines on your computer may require you to add an underscore (_) after the

function name. This changes the entry to float etime (parts). Furthermore, you might have to adjust

the TICKS parameter. We assumed that the system clock had a resolution of 1/100 of a second (true

for the Hewlett-Packard machines that this version of etime was written for). 1/60 is very common.

On an RS-6000 the number would be 1000. You may find the value in a file named

/usr/include/sys/param.h on your machine, or you can determine it empirically. A C routine for

retrieving the wall time using calling get time of day is shown below:

#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

void hpcwall(double *retval)

{

static long zsec = 0;

static long zusec = 0;

double esec;

struct timeval tp;

struct timezone tzp;

gettimeofday(&tp, &tzp);

if ( zsec == 0 ) zsec = tp.tv_sec;

if ( zusec == 0 ) zusec = tp.tv_usec;

*retval = (tp.tv_sec - zsec) + (tp.tv_usec - zusec ) * 0.000001 ;

}

void hpcwall_(double *retval) { hpcwall(retval); } /* Other convention */

Given that you will often need both CPU and wall time, and you will be continually computing the

difference between successive calls to these routines, you may want to write a routine to return the

elapsed wall and CPU time upon each call as follows:

2.5 Subroutine Profiling

Sometimes you want more detail than the overall timing of the application. But you don't have time

to modify the code to insert several hundred etime calls into your code. Profiles are also very useful

when you have been handed a strange 20,000-line application program and told to figure out how it

works and then improve its performance. Most compilers provide a facility to automatically insert

timing calls into your code at the entry and exit of each routine at compile time. While your program

runs, the entry and exit times are recorded and then dumped into a file. A separate utility summarizes

the execution patterns and produces a report that shows the percentage of the time spent in each of

your routines and the library routines. The profile gives you a sense of the shape of the execution

profile. That is, you can see that 10% of the time is spent in subroutine A, 5% in subroutine B, etc.

Naturally, if you add all of the routines together they should account for 100% of the overall time

spent. From these percentages you can construct a picture a profile of how execution is distributed

when the program runs. Though not representative of any particular profiling tool, the histograms in

Figure 2.5.1-left (Sharp profile) and Figure 2.5.1-right (Flat profile ) depict these percentages,

sorted from left to right, with each vertical column representing a different routine. They help

illustrate different profile shapes. A sharp profile says that most of the time is spent in one or two

procedures, and if you want to improve the program's performance you should focus your efforts on

tuning those procedures. A minor optimization in a heavily executed line of code can sometimes have

a great effect on the overall runtime, given the right opportunity. A at profile on the other hand, tells

you that the runtime is spread across many routines, and effort spent optimizing any one or two will

have little benefit in speeding up the program. Of course, there are also programs whose execution

profile falls somewhere in the middle.

We cannot predict with absolute certainty what you are likely to find when you profile your

programs, but there are some general trends. For instance, engineering and scientific codes built

around matrix solutions often exhibit very sharp profiles. The runtime is dominated by the work

performed in a handful of routines. To tune the code, you need to focus your efforts on those routines

to make them more efficient. It may involve restructuring loops to expose parallelism, providing hints

to the compiler, or rearranging memory references. In any case, the challenge is tangible; you can see

Figure 2.5.1 Sharp Profiling (right) vs. Flat Profiling (right)

the problems you have to x. There are limits to how much tuning one or two routines will improve

your runtime, of course. An often quoted rule of thumb is Amdahl's Law, derived from remarks made

in 1967 by one of the designers of the IBM 360 series, and founder of Amdahl Computer, Gene

Amdahl. Strictly speaking, his remarks were about the performance potential of parallel computers,

but people have adapted Amdahl's Law to describe other things too. For our purposes, it goes like

this: Say you have a program with two parts, one that can be optimized so that it goes infinitely fast

and another that can't be optimized at all. Even if the optimizable portion makes up 50% of the initial

runtime, at best you will be able to cut the total runtime in half. That is, your runtime will eventually

be dominated by the portion that can't be optimized. This puts an upper limit on your expectations

when tuning.

Even given the finite return on effort suggested by Amdahl's Law, tuning a program with a sharp

profile can be rewarding. Programs with _at profiles are much more difficult to tune. These are often

system codes, nonnumeric applications, and varieties of numerical codes without matrix solutions. It

takes a global tuning approach to reduce, to any justifiable degree, the runtime of a program with a

at profile. For instance, you can sometimes optimize instruction cache usage, which is complicated

because of the program's equal distribution of activity among a large number of routines. It can also

help to reduce subroutine call overhead by folding call lees into callers. Occasionally, you can find a

memory reference problem that is endemic to the whole program and one that can be fixed all at

once.

When you look at a profile, you might find an unusually large percentage of time spent in the library

routines such as log, exp, or sin. Often these functions are done in software routines rather than

inline. You may be able to rewrite your code to eliminate some of these operations. Another

important pattern to look for is when a routine takes far longer than you expect. Unexpected

execution time may indicate you are accessing memory in a pattern that is bad for performance or

that some aspect of the code cannot be optimized properly. In any case, to get a profile, you need a

profiler. One or two subroutine profilers come standard with the software development

environments on all UNIX machines. We confer two of them: prof and gprof. In addition, we mention

a few line-by-line profilers. Subroutine profilers can give you a general overall view of where time is

being spent. You probably should start with prof, if you have it (most machines do). Otherwise, use

gprof. After that, you can move to a line-by-line profiler if you need to know which statements take

the most time.

2.6 Loop Optimizations

In nearly all high performance applications, loops are where the majority of the execution time is

spent. In this chapter we focus on techniques used to improve the performance of these _clutter-free_

loops. Sometimes the compiler is clever enough to generate the faster versions of the loops, and other

times we have to do some rewriting of the loops ourselves to help the compiler. It's important to

remember that one compiler's performance enhancing modifications are another compiler's clutter.

When you make modifications in the name of performance you must make sure you're helping by

testing the performance with and without the modifications. Also, when you move to another

architecture you need to make sure that any modifications aren't hindering performance. For this

reason, you should choose your performance-related modifications wisely. You should also keep the

original (simple) version of the code for testing on new architectures. Also if the benefit of the

modification is small, you should probably keep the code in its most simple and clear form. The

different loop optimization techniques, includes:

➢ Loop unrolling

➢ Nested loop optimization

➢ Loop interchange

➢ Memory reference optimization

➢ Blocking

➢ Out-of-core solutions

Someday, it may be possible for a compiler to perform all these loop optimizations automatically.

Typically loop unrolling is performed as part of the normal compiler optimizations. Other

optimizations may have to be triggered using explicit compile-time options. As you contemplate

making manual changes, look carefully at which of these optimizations can be done by the compiler.

Also run some tests to determine if the compiler optimizations are as good as hand optimizations.

2.6.1 Operation Counting

Before you begin to rewrite a loop body or reorganize the order of the loops, you must have some

idea of what the body of the loop does for each iteration. Operation counting is the process of

surveying a loop to understand the operation mix. You need to count the number of loads, stores,

floating-point, integer, and library calls per iteration of the loop. From the count, you can see how

well the operation mix of a given loop matches the capabilities of the processor. Of course, operation

counting doesn't guarantee that the compiler will generate an efficient representation of a loop. But

it generally provides enough insight to the loop to direct tuning efforts. Bear in mind that an

instruction mix that is balanced for one machine may be imbalanced for another. Processors on the

market today can generally issue some combination of one to four operations per clock cycle. Address

arithmetic is often embedded in the instructions that reference memory. Because the compiler can

replace complicated loop address calculations with simple expressions (provided the pattern of

addresses is predictable), you can often ignore address arithmetic when counting operations. Let's

look at a few loops and see what we can learn about the instruction mix:

DO I = 1,N

A(I, J, K) = A(I, J, K) + B( J, I, K)

ENDDO

This loop contains one floating-point addition and three memory references (two loads and a store).

There are some complicated array index expressions, but these will probably be simplified by the

compiler and executed in the same cycle as the memory and floating-point operations. For each

iteration of the loop, we must increment the index variable and test to determine if the loop has

completed. A 3:1 ratio of memory references to floating-point operations suggests that we can hope

for no more than 1/3 peak floating-point performance from the loop unless we have more than one

path to memory. That's bad news, but good information. The ratio tells us that we ought to consider

memory reference optimizations first. The loop below contains one floating-point addition and two

memory operations a load and a store. Operand B(J) is loop-invariant, so its value only needs to be

loaded once, upon entry to the loop:

DO I=1,N

A(I) = A(I) + B(J)

ENDDO

Again, our floating-point throughput is limited, though not as severely as in the previous loop. The

ratio of memory references to floating-point operations is 2:1. The next example shows a loop with

better prospects. It performs element-wise multiplication of two vectors of complex numbers and

assigns the results back to the first. There are six memory operations (four loads and two stores) and

six floating-point operations (two additions and four multiplications):

for (i=0; i <n; i++)

{

xr[i] = xr[i] * yr[i] - xi[i] * yi[i];

xi[i] = xr[i] * yi[i] + xi[i] * yr[i];

}

It appears that this loop is roughly balanced for a processor that can perform the same number of

memory operations and floating-point operations per cycle. However, it might not be. Many

processors perform a floating-point multiply and add in a single instruction. If the compiler is good

enough to recognize that the multiply-add is appropriate, this loop may also be limited by memory

references; each iteration would be compiled into two multiplications and two multiply-adds. Again,

operation counting is a simple way to estimate how well the requirements of a loop will map onto

the capabilities of the machine. For many loops, you often find the performance of the loops

dominated by memory references, as we have seen in the last three examples. This suggests that

memory reference tuning is very important.

2.6.2 Basic Loop Un-Rolling

The most basic form of loop optimization is loop un-rolling. It is so basic that most of today's

compilers do it automatically if it looks like there's a benefit. There has been a great deal of clutter

introduced into old dusty-deck FORTRAN programs in the name of loop unrolling that now serves

only to confuse and mislead today's compilers. We're not suggesting that you unroll any loops by

hand. The purpose of this section is twofold. First, once you are familiar with loop unrolling, you

might recognize code that was unrolled by a programmer (not you) some time ago and simplify the

code. Second, you need to understand the concepts of loop unrolling so that when you look at

generated machine code, you recognize unrolled loops. The primary benefit in loop unrolling is to

perform more computations per iteration. At the end of each iteration, the index value must be

incremented, tested, and the control is branched back to the top of the loop if the loop has more

iterations to process. By unrolling the loop, there are less _loop-ends_ per loop execution. Unrolling

also reduces the overall number of branches significantly and gives the processor more instructions

between branches (i.e., it increases the size of the basic blocks). For illustration, consider the

following loop. It has a single statement wrapped in a do-loop:

DO I=1,N

A(I) = A(I) + B(I) * C

ENDDO

You can unroll the loop, as we have below, giving you the same operations in fewer iterations with

less loop overhead. You can imagine how this would help on any computer. Because the computations

in one iteration do not depend on the computations in other iterations, calculations from different

iterations can be executed together. On a superscalar processor, portions of these four statements

may actually execute in parallel:

DO I=1,N,4

A(I) = A(I) + B(I) * C

A(I+1) = A(I+1) + B(I+1) * C

A(I+2) = A(I+2) + B(I+2) * C

A(I+3) = A(I+3) + B(I+3) * C

ENDDO

However, this loop is not exactly the same as the previous loop. The loop is unrolled four times, but

what if N is not divisible by? If not, there will be one, two, or three spare iterations that don't get

executed. To handle these extra iterations, we add another little loop to soak them up. The extra loop

is called a preconditioning loop:

II = IMOD (N,4)

DO I=1,II

A(I) = A(I) + B(I) * C

ENDDO

DO I=1+II,N,4

A(I) = A(I) + B(I) * C

A(I+1) = A(I+1) + B(I+1) * C

A(I+2) = A(I+2) + B(I+2) * C

A(I+3) = A(I+3) + B(I+3) * C

ENDDO

The number of iterations needed in the preconditioning loop is the total iteration count modulo for

this unrolling amount. If, at runtime, N turns out to be divisible by, there are no spare iterations, and

the preconditioning loop isn't executed. Speculative execution in the post-RISC architecture can

reduce or eliminate the need for unrolling a loop that will operate on values that must be retrieved

from main memory. Because the load operations take such a long time relative to the computations,

the loop is naturally unrolled. While the processor is waiting for the first load to finish, it may

speculatively execute three to four iterations of the loop ahead of the first load, effectively unrolling

the loop in the Instruction Reorder Buffer.

2.6.3 Loops with Low Trip Counts

To be effective, loop unrolling requires a fairly large number of iterations in the original loop. To

understand why, picture what happens if the total iteration count is low, perhaps less than 10, or

even less than. With a trip count this low, the preconditioning loop is doing a proportionately large

amount of the work. It's not supposed to be that way. The preconditioning loop is supposed to catch

the few leftover iterations missed by the unrolled, main loop. However, when the trip count is low,

you make one or two passes through the unrolled loop, plus one or two passes through the

preconditioning loop. In other words, you have more clutter; the loop shouldn't have been unrolled

in the first place. Probably the only time it makes sense to unroll a loop with a low trip count is when

the number of iterations is constant and known at compile time. For instance, suppose you had the

following loop:

PARAMETER (NITER = 3)

DO I=1,NITER

A(I) = B(I) * C

ENDDO

Because NITER is hardwired to 3, you can safely unroll to a depth of 3 without worrying about a

preconditioning loop. In fact, you can throw out the loop structure altogether and leave just the

unrolled loop innards:

PARAMETER (NITER = 3)

A(1) = B(1) * C

A(2) = B(2) * C

A(3) = A(3) * C

Of course, if a loop's trip count is low, it probably won't contribute significantly to the overall runtime,

unless you find such a loop at the center of a larger loop. Then you either want to unroll it completely

or leave it alone.

2.6.4 Fat Loops

Loop unrolling helps performance because it fattens up a loop with more calculations per iteration.

By the same token, if a particular loop is already fat, unrolling isn't going to help. The loop overhead

is already spread over a fair number of instructions. In fact, unrolling a fat loop may even slow your

program down because it increases the size of the text segment, placing an added burden on the

memory system (we'll explain this in greater detail shortly). A good rule of thumb is to look

elsewhere for performance when the loop innards exceed three or four statements.

2.6.5 Loops Containing Procedure Calls

As with fat loops, loops containing subroutine or function calls generally aren't good candidates for

unrolling. There are several reasons. First, they often contain a fair number of instructions already.

And if the subroutine being called is fat, it makes the loop that calls it fat as well. The size of the loop

may not be apparent when you look at the loop; the function call can conceal many more instructions.

Second, when the calling routine and the subroutine are compiled separately, it's impossible for the

compiler to intermix instructions. A loop that is unrolled into a series of function calls behaves much

like the original loop, before unrolling. Last, function call overhead is expensive. Registers have to be

saved; argument lists have to be prepared. The time spent calling and returning from a subroutine

can be much greater than that of the loop overhead. Unrolling to amortize the cost of the loop

structure over several calls doesn't buy you enough to be worth the effort. The general rule when

dealing with procedures is to first try to eliminate them in the remove clutter phase, and when this

has been done, check to see if unrolling gives an additional performance improvement.

2.6.6 Loops with Branches

We presented you how to eliminate certain types of branches, but of course, we couldn't get rid of

them all. In cases of iteration-independent branches, there might be some benefit to loop unrolling.

The IF test becomes part of the operations that must be counted to determine the value of loop

unrolling. Below is a doubly nested loop. The inner loop tests the value of B(J,I):

DO I=1,N

DO J=1,N

IF (B(J,I) .GT. 1.0) A(J,I) = A(J,I) + B(J,I) * C

ENDDO

Each iteration is independent of every other, so unrolling it won't be a problem. We'll just leave the

outer loop undisturbed:

II = IMOD (N,4)

DO I=1,N

DO J=1,II

IF (B(J,I) .GT. 1.0)+ A(J,I) = A(J,I) + B(J,I) * C

ENDDO

DO J=II+1,N,4

IF (B(J,I) .GT. 1.0)+ A(J,I) = A(J,I) + B(J,I) * C

IF (B(J+1,I) .GT. 1.0)+ A(J+1,I) = A(J+1,I) + B(J+1,I) * C

IF (B(J+2,I) .GT. 1.0)+ A(J+2,I) = A(J+2,I) + B(J+2,I) * C

IF (B(J+3,I) .GT. 1.0)+ A(J+3,I) = A(J+3,I) + B(J+3,I) * C

ENDDO

This approach works particularly well if the processor you are using supports conditional execution.

As described earlier, conditional execution can replace a branch and an operation with a single

conditionally executed assignment. On a superscalar processor with conditional execution, this

unrolled loop executes quite nicely.

2.6.7 Nested Loops

When you embed loops within other loops, you create a loop nest. The loop or loops in the center are

called the inner loops. The surrounding loops are called outer loops. Depending on the construction

of the loop nest, we may have some flexibility in the ordering of the loops. At times, we can swap the

outer and inner loops with great benefit. In the next sections we look at some common loop nesting’s

and the optimizations that can be performed on these loop nests. Often when we are working with

nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays

can lead to non-unit-stride memory access. Many of the optimizations we perform on loop nests are

meant to improve the memory access patterns. First, we examine the computation-related

optimizations followed by the memory optimizations.

2.6.8 Outer Loop Un-Rolling

If you are faced with a loop nest, one simple approach is to unroll the inner loop. Unrolling the

innermost loop in a nest isn't any different from what we saw above. You just pretend the rest of the

loop nest doesn't exist and approach it in the nor- mal way. However, there are times when you want

to apply loop unrolling not just to the inner loop, but to outer loops as well or perhaps only to the

outer loops. Here's a typical loop nest:

for (i=0; i<n; i++)

for (j=0; j<n; j++)

for (k=0; k<n; k++)

a[i][j][k] = a[i][j][k] + b[i][j][k] * c;

To unroll an outer loop, you pick one of the outer loop index variables and replicate the innermost

loop body so that several iterations are performed at the same time. The difference is in the index

variable for which you unroll. In the code below, we have unrolled the middle (j) loop twice:

for (i=0; i<n; i++)

for (j=0; j<n; j+=2)

for (k=0; k<n; k++)

{

a[i][j][k] = a[i][j][k] + b[i][k][j] * c;

a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1] * c;

}

We left the k loop untouched; however, we could unroll that one, too. That would give us outer and

inner loop unrolling at the same time:

for (i=0; i<n; i++)

for (j=0; j<n; j+=2)

for (k=0; k<n; k+=2)

{

a[i][j][k] = a[i][j][k] + b[i][k][j] * c;

a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1] * c;

a[i][j][k+1] = a[i][j][k+1] + b[i][k+1][j] * c;

a[i][j+1][k+1] = a[i][j+1][k+1] + b[i][k+1][j+1] * c;

}

We could even unroll the i loop too, leaving eight copies of the loop innards. (Notice that we

completely ignored preconditioning; in a real application, of course, we couldn't.)

2.6.9 Loop Interchange to Move Computations to the Center

When someone writes a program that represents some kind of real-world model, they often structure

the code in terms of the model. This makes perfect sense. The computer is an analysis tool; you aren't

writing the code on the computer's behalf. However, a model expressed naturally often works on one

point in space at a time, which tends to give you insignificant inner loops at least in terms of the trip

count. For performance, you might want to interchange inner and outer loops to pull the activity into

the center, where you can then do some unrolling. Let's illustrate with an example. Here's a loop

where KDIM time-dependent quantities for points in a two-dimensional mesh are being updated:

PARAMETER (IDIM = 1000, JDIM = 1000, KDIM = 3)

...

DO I=1,IDIM

DO J=1,JDIM

DO K=1,KDIM

D(K,J,I) = D(K,J,I) + V(K,J,I) * DT

ENDDO

In practice, KDIM is probably equal to 2 or 3, where J or I, representing the number of points, may be

in the thousands. The way it is written, the inner loop has a very low trip count, making it a poor

candidate for unrolling. By interchanging the loops, you update one quantity at a time, across all of

the points. For tuning purposes, this moves larger trip counts into the inner loop and allows you to

do some strategic un-rolling:

PARAMETER (IDIM = 1000 , JDIM = 1000 , KDIM = 3)

DO I = 1 , KDIM

DO J = 1 , JDIM

DO K = 1 , IDIM

D (K , J , I) = D (K , J , I) + V (K , J , I) * DT

ENDDO

This example is straightforward; it's easy to see that there are no inter-iteration dependencies. But

how can you tell, in general, when two loops can be interchanged? Interchanging loops might violate

some dependency, or worse, only violate it occasionally, meaning you might not catch it when

optimizing. While it is possible to examine the loops by hand and determine the dependencies, it is

much better if the compiler can make the determination. Very few single-processor compilers

automatically perform loop interchange. However, the compilers for high-end vector and parallel

computers generally interchange loops if there is some benefit and if interchanging the loops won't

alter the program results.

2.7 Matrix Multiplication

Matrix multiplication is a common operation we can use to explore the options that are available in

optimizing a loop nest. A programmer who has just finished reading a linear algebra textbook would

probably write matrix multiply as it appears in the example below:

DO I=1,N

DO J=1,N

SUM = 0

DO K=1,N

SUM = SUM + A(I,K) * B(K,J)

ENDDO

C(I,J) = SUM

ENDDO

The problem with this loop is that the A (I,K) will be non-unit stride. Each iteration in the inner loop

consists of two loads (one non-unit stride), a multiplication, and an addition. Given the nature of the

matrix multiplication, it might appear that you can't eliminate the non-unit stride. However, with a

simple rewrite of the loops all the memory accesses can be made unit stride:

DO J = 1 , N

DO I =1 , N

C( I , J) = 0.0

ENDDO

DO K=1,N

DO J=1,N

SCALE = B(K,J)

DO I=1,N

C(I,J) = C(I,J) + A(I,K) * SCALE

ENDDO

Now, the inner loop accesses memory using unit stride. Each iteration performs two loads, one store,

a multiplication, and an addition. When comparing this to the previous loop, the non-unit stride loads

have been eliminated, but there is an additional store operation. Assuming that we are operating on

a cache-based system, and the matrix is larger than the cache, this extra store won't add much to the

execution time. The store is to the location in C(I,J) that was used in the load. In most cases, the store

is to a line that is already in the in the cache. The B(K,J) becomes a constant scaling factor within the

inner loop.

2.7.1 Matrix Optimization

In the matrix multiplication code, we encountered a non-unit stride and were able to eliminate it with

a quick interchange of the loops. Unfortunately, life is rarely this simple. Often you find some mix of

variables with unit and non-unit strides, in which case interchanging the loops moves the damage

around, but doesn't make it go away. The loop to perform a matrix transpose represents a simple

example of this dilemma:

DO I =1 , N

DO J =1 , M

A (J , I) = B (I , J)

ENDDO

Whichever way you interchange them, you will break the memory access pattern for either A or B.

Even more interesting, you have to make a choice between stride loads vs. stride stores: which will

it be? We really need a general method for improving the memory access patterns for both A and B,

not one or the other. We'll show you such a method later.

2.7.2 Blocking to Ease Memory Access Patterns

Blocking is another kind of memory reference optimization. As with loop interchange, the challenge

is to retrieve as much data as possible with as few cache misses as possible. We'd like to rearrange

the loop nest so that it works on data in little neighborhoods, rather than striding through memory

like a man on stilts. Given the following vector sum, how can we rearrange the loop?

DO I=1,N

DO J=1,N

A(J,I) = A(J,I) + B(I,J)

ENDDO

This loop involves two vectors. One is referenced with unit stride, the other with a stride of N. We

can interchange the loops, but one way or another we still have N-step array references on either A

or B, either of which is undesirable. The trick is to block references so that you grab a few elements

of A, and then a few of B, and then a few of A, and so on _ in neighborhoods. We make this happen by

combining inner and outer loop unrolling. The difference is in the way the processor handles updates

of main memory from cache.

DO I=1,N,2

DO J=1,N,2

A(J,I) = A(J,I) + B(I,J)

A(J+1,I) = A(J+1,I) + B(I,J+1)

A(J,I+1) = A(J,I+1) + B(I+1,J)

A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)

ENDDO

Use your imagination so we can show why this helps. Usually, when we think of a two-dimensional

array, we think of a rectangle or a square (Arrays A and B)). Remember, to make programming easier,

the compiler provides the illusion that two-dimensional arrays A and B are rectangular plots of

memory (Arrays A and B). Actually, memory is sequential storage. In FORTRAN, a two-dimensional

array is constructed in memory by logically lining memory strips up against each other, like the

pickets of a cedar fence. (It's the other way around in C: rows are stacked on top of one another.)

Array storage starts at the upper left, proceeds down to the bottom, and then starts over at the top of

the next column. Stepping through the array with unit stride traces out the shape of a backwards N

repeated over and over, moving to the right.

2.8 Shared-Memory Parallel Processors

As we move into programming on multiprocessors, we must increase our understanding of

parallelism in order to understand how to effectively program these systems. In short, as we gain

more parallel resources, we need to find more parallelism in our code. Some examples of parallelism

in order of increasing grain size are:

• When performing a 32-bit integer addition, using a carry look ahead adder, you can partially

add bits 0 and 1 at the same time as bits 2 and 3.

• On a pipelined processor, while decoding one instruction, you can fetch the next instruction.

• On a two-way superscalar processor, you can execute any combination of an integer and a

floating-point instruction in a single cycle.

• On a multiprocessor, you can divide the iterations of a loop among the four processors of the

system.

• You can split a large array across four workstations attached to a network. Each workstation

can operate on its local information and then exchange boundary values at the end of each

time step.

We start at instruction-level parallelism (pipelined and superscalar) and move toward thread level

parallelism, which is what we need for multiprocessor systems. It is important to note that the

different levels of parallelism are generally not in conflict. Increasing thread parallelism at a coarser

grain size often exposes more fine-grained parallelism. The following is a loop that has plenty of

parallelism:

DO I = 1 , 16000

A (I) = B (I) * 3.14159

ENDDO

We have expressed the loop in a way that would imply that A (1) must be computed first, followed

by A (2), and so on. However, once the loop was completed, it would not have mattered if A (16000),

were computed first followed by A (15999), and so on. The loop could have computed the even values

of I and then computed the odd values of I. It would not even make a difference if all 16,000 of the

iterations were computed simultaneously using a 16,000-way superscalar processor. If the compiler

has flexibility in the order in which it can execute the instructions that make up your program, it can

execute those instructions simultaneously when parallel hardware is available.

One technique that computer scientists use to formally analyze the potential parallelism in an

algorithm is to characterize how quickly it would execute with an infinite way superscalar processor.

Not all loops contain as much parallelism as this simple loop. We need to identify the things that limit

the parallelism in our codes and remove them whenever possible. Helping the compiler recognize

parallelism is one of the basic approaches specialists take in tuning code. A slight rewording of a loop

or some supplementary information supplied to the compiler can change a we don't know answer

into an opportunity for parallelism. To be certain, there are other facets to tuning as well, such as

optimizing memory access patterns so that they best suit the hardware, or recasting an algorithm.

And there is no single best approach to every problem; any tuning effort has to be a combination of

techniques.

2.8.1 Dependencies

Imagine a symphony orchestra where each musician plays without regard to the conductor or the

other musicians. At the first tap of the conductor's baton, each musician goes through all of his or her

sheet music. Some finish far ahead of others, leave the stage, and go home. The cacophony wouldn't

resemble music (come to think of it, it would resemble experimental jazz) because it would be totally

uncoordinated. Of course this isn't how music is played. A computer program, like a musical piece, is

woven on a fabric that unfolds in time (though perhaps woven more loosely). Certain things must

happen before or along with others, and there is a rate to the whole process. With computer

programs, whenever event A must occur before event B can, we say that B is dependent on A. We call

the relationship between them a dependency. Sometimes dependencies exist because of calculations

or memory operations; we call these data dependencies. Other times, we are waiting for a branch or

do-loop exit to take place; this is called a control dependency. Each is present in every program to

varying degrees. The goal is to eliminate as many dependencies as possible. Rearranging a program

so that two chunks of the computation are less dependent exposes parallelism, or opportunities to

do several things at once.

2.8.1.1 Control Dependencies

Just as variable assignments can depend on other assignments, a variable's value can also depend on

the flow of control within the program. For instance, an assignment within an if-statement can occur

only if the conditional evaluates to true. The same can be said of an assignment within a loop. If the

loop is never entered, no statements inside the loop are executed. Interestingly, this is not as far-

fetched as it might seem. On a single instruction multiple data (SIMD) computer such as the

Connection CM-2 with 16,384 processors, it would take three instruction cycles to process this entire

loop. When calculations occur as a consequence of the flow of control, we say there is a control

dependency, as in the code below and shown graphically in Figure 2.8.1 (a). The assignment located

inside the block-if may or may not be executed, depending on the outcome of the test X .NE. 0. In other

words, the value of Y depends on the flow of control in the code around it. Again, this may sound to

you like a concern for compiler designers, not programmers, and that's mostly true.

But there are times when you might want to move control-dependent instructions around to get

expensive calculations out of the way (provided your compiler isn't smart enough to do it for you).

For example, say that Figure 2.8.1 (b) represents a little section of your program. Flow of control

enters at the top and goes through two branch decisions. Furthermore, say that there is a square root

operation at the entry point, and that the flow of control almost always goes from the top, down to

the leg containing the statement A = 0.0.

This means that the results of the calculation A = SQRT(B) are almost always discarded because A

gets a new value of 0.0 each time through. A square root operation is always _expensive_ because it

takes a lot of time to execute. The trouble is that you can't just get rid of it; occasionally it's needed.

However, you could move it out of the way and continue to observe the control dependencies by

making two copies of the square root operation along the less traveled branches, as shown in Figure

2.8.1-(c). This way the SQRT would execute only along those paths where it was actually needed.

This kind of instruction scheduling will be appearing in compilers (and even hardware) more and

more as time goes on. A variation on this technique is to calculate results that might be needed at

times when there is a gap in the instruction stream (because of dependencies), thus using some spare

cycles that might otherwise be wasted.

2.8.1.2 Data Dependencies

A calculation that is in some way bound to a previous calculation is said to be data dependent upon

that calculation. In the code below, the value of B is data dependent on the value of A. That's because

you can't calculate B until the value of A is available:

A = X + Y + COS(Z)

Figure 2.8.1 (a) Control Dependency; (b) A section of your program; (c) Expensive Operation Moved

so that it's Rarely Executed

B = A * C

This dependency is easy to recognize, but others are not so simple. At other times, you must be careful

not to rewrite a variable with a new value before every other computation has finished using the old

value. We can group all data dependencies into three categories:

(1) flow dependencies,

(2) anti-dependencies, and

(3) output dependencies.

Figure 2.8.2 contains some simple examples to demonstrate each type of dependency. Here, we use

an arrow that starts at the source of the dependency and ends at the statement that must be delayed

by the dependency.

2.8.2 Forming a Flow Graph

To illustrate, suppose that we have the flow graph (flow chart) in Figure 2.8.3. Beside each basic

block we've listed the variables it uses and the variables it defines. What can data flow analysis tell

us? Notice that a value for A is defined in block X but only used in block Y. That means that A is dead

upon exit from block Y or immediately upon taking the right-hand branch leaving X; none of the other

basic blocks uses the value of A. That tells us that any associated resources, such as a register, can be

freed for other uses. Looking at Figure 2.8.3 we can see that D is defined in basic block X, but never

used. This means that the calculations defining D can be discarded. Something interesting is

happening with the variable G. Blocks X and W both use it, but if you look closely you'll see that the

two uses are distinct from one another, meaning that they can be treated as two independent

variables. A compiler featuring advanced instruction scheduling techniques might notice that W is

the only block that uses the value for E, and so move the calculations defining E out of block Y and

into W, where they are needed. In addition to gathering data about variables, the compiler can also

keep information about subexpressions. Examining both together, it can recognize cases where

redundant calculations are being made (across basic blocks), and substitute previously computed

values in their place. If, for instance, the expression H*I appears in blocks X, Y, and W, it could be

calculated just once in block X and propagated to the others that use it.

Figure 2.8.2 Types of Data Dependencies

2.8.2.1 Loop Dependencies

Loops are the center of activity for many applications, so there is often a high payback for simplifying

or moving calculations outside. Early compilers for parallel architectures used pattern matching to

identify the bounds of their loops. This limitation meant that a hand-constructed loop using if-

statements and go to-statements would not be correctly identified as a loop. Because modern

compilers use data flow graphs, it's practical to identify loops as a particular subset of nodes in the

flow graph. To a data flow graph, a hand constructed loop looks the same as a compiler-generated

loop. Optimizations can therefore be applied to either type of loop. Once we have identified the loops,

we can apply the same kinds of data-flow analysis we applied above. Among the things we are looking

for are calculations that are unchanging within the loop and variables that change in a predictable

(linear) fashion from iteration to iteration. How does the compiler identify a loop in the flow graph?

Fundamentally, two conditions have to be met:

• A given node has to dominate all other nodes within the suspected loop. This means that all

paths to any node in the loop have to pass through one particular node, the dominator. The

dominator node forms the header at the top of the loop.

• There has to be a cycle in the graph. Given a dominator, if we can find a path back to it from

one of the nodes it dominates, we have a loop. This path back is known as the back edge of

the loop.

The flow graph in Figure 2.8.4 contains one loop and one red herring. You can see that node B

dominates every node below it in the subset of the flow graph. Therefore, the nodes B, C, D, and E

form a loop. The loop goes through an array of linked list start pointers and traverses the lists to

determine the total number of nodes in all lists. Letters to the extreme right correspond to the basic

block numbers in the flow graph. At first glance, it appears that the nodes C and D form a loop too.

Figure 2.8.3 Flow Graph for Data Flow Analysis

The problem is that C doesn't dominate D (and vice versa), because entry to either can be made from

B. Generally, the flow graphs that come from code segments written with even the weakest

appreciation for a structured design offer better loop candidates. After identifying a loop, the

compiler can concentrate on that portion of the flow graph, looking for instructions to remove or

push to the outside. Certain types of subexpressions, such as those found in array index expressions,

can be simplified if they change in a predictable fashion from one iteration to the next. In the

continuing quest for parallelism, loops are generally our best sources for large amounts of

parallelism. However, loops also provide new opportunities for those parallelism-killing

dependencies.

2.8.2.2 Loop-Carried Dependencies

The notion of data dependence is particularly important when we look at loops, the hub of activity

inside numerical applications. A well-designed loop can produce millions of operations that can all

be performed in parallel. However, a single misplaced dependency in the loop can force it all to be

run in serial. So the stakes are higher when looking for dependencies in loops. Some constructs are

completely independent, right out of the box. The question we want to ask is “Can two different

iterations execute at the same time, or is there a data dependency between them? Consider the

following loop:

DO I=1,N

A(I) = A(I) + B(I)

ENDDO

For any two values of I and K, can we calculate the value of A(I) and A(K) at the same time? Below,

we have manually unrolled several iterations of the previous loop, so they can be executed together:

A(I) = A(I) + B(I)

Figure 2.8.4 Flow Graph including a loop

A(I+1) = A(I+1) + B(I+1)

A(I+2) = A(I+2) + B(I+2)

You can see that none of the results are used as an operand for another calculation. For instance, the

calculation for A(I+1) can occur at the same time as the calculation for A(I) because the calculations

are independent; you don't need the results of the first to determine the second. In fact, mixing up

the order of the calculations won't change the results in the least. Relaxing the serial order imposed

on these calculations makes it possible to execute this loop very quickly on parallel hardware.

2.8.2.3 Flow Dependencies

For comparison, look at the next code fragment:

DO I=2,N

A(I) = A(I-1) + B(I)

ENDDO

This loop has the regularity of the previous example, but one of the subscripts is changed. Again, it's

useful to manually unroll the loop and look at several iterations together:

A(I) = A(I-1) + B(I)

A(I+1) = A(I) + B(I+1)

A(I+2) = A(I+1) + B(I+2)

In this case, there is a dependency problem. The value of A (I+1) depends on the value of A (I), the

value of A (I+2) depends on A (I+1), and so on; every iteration depends on the result of a previous

one. Dependencies that extend back to a previous calculation and perhaps a previous iteration (like

this one), are loop carried flow dependencies or backward dependencies. You often see such

dependencies in applications that perform Gaussian elimination on certain types of matrices, or

numerical solutions to systems of differential equations. However, it is impossible to run such a loop

in parallel (as written); the processor must wait for intermediate results before it can proceed. In

some cases, flow dependencies are impossible to x; calculations are so dependent upon one another

that we have no choice but to wait for previous ones to complete. Other times, dependencies are a

function of the way the calculations are expressed. For instance, the loop above can be changed to

reduce the dependency. By replicating some of the arithmetic, we can make it so that the second and

third iterations depend on the first, but not on one another. The operation count goes up we have an

extra addition that we didn't have before but we have reduced the dependency between iterations:

DO I=2,N,2

A(I) = A(I-1) + B(I)

A(I+1) = A(I-1) + B(I) + B(I+1)

ENDDO

The speed increase on a workstation won't be great (most machines run the recast loop more slowly).

However, some parallel computers can trade off additional calculations for reduced dependency and

chalk up a net win.

2.8.2.4 Output Dependencies

The third class of data dependencies, output dependencies, is of particular interest to users of parallel

computers, particularly multiprocessors. Output dependencies involve getting the right values to the

right variables when all calculations have been completed. Otherwise, an output dependency is

violated. The loop below assigns new values to two elements of the vector A with each iteration:

DO I=1,N

A(I) = C(I) * 2.

A(I+2) = D(I) + E

ENDDO

As always, we won't have any problems if we execute the code sequentially. But if several iterations

are performed together, and statements are reordered, then incorrect values can be assigned to the

last elements of A. For example, in the naive vectored equivalent below, A (I+2) takes the wrong value

because the assignments occur out of order:

A(I) = C(I) * 2.

A(I+1) = C(I+1) * 2.

A(I+2) = C(I+2) * 2.

A(I+2) = D(I) + E Output dependency violated

A(I+3) = D(I+1) + E

A(I+4) = D(I+2) + E

Whether or not you have to worry about output dependencies depends on whether you are actually

parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate

legal code and possibly even fast code, if it's clever enough. But output dependencies occasionally

become a problem for programmers.

2.8.2.5 Dependencies Within an Iteration

We have looked at dependencies that cross iteration boundaries but we haven't looked at

dependencies within the same iteration. Consider the following code fragment:

DO I = 1,N

D = B(I) * 17

A(I) = D + 14

ENDDO

When we look at the loop, the variable D has a flow dependency. The second statement cannot start

until the first statement has completed. At first glance this might appear to limit parallelism

significantly. When we look closer and manually unroll several iterations of the loop, the situation

gets worse:

D = B(I) * 17

A(I) = D + 14

D = B(I+1) * 17

A(I+1) = D + 14

D = B(I+2) * 17

A(I+2) = D + 14

Now, the variable D has flow, output, and anti-dependencies. It looks like this loop has no hope of

running in parallel. However, there is a simple solution to this problem at the cost of some extra

memory space, using a technique called promoting a scalar to a vector. We define D as an array with

N elements and rewrite the code as follows:

DO I = 1,N

D(I) = B(I) * 17

A(I) = D(I) + 14

ENDDO

Now the iterations are all independent and can be run in parallel. Within each iteration, the first

statement must run before the second statement.

2.9 Pointer Ambiguity in C

FORTRAN compilers depend on programmers to observe aliasing rules. That is, programmers are

not supposed to modify locations through pointers that may be aliases of one another. They can

become aliases in several ways, such as when two dummy arguments receive pointers to the same

storage locations:

CALL BOB (A,A)

...

END

SUBROUTINE BOB (X,Y) X,Y become aliases

C compilers don't appreciate the same restrictions on aliasing. In fact, there are cases where aliasing

could be desirable. Additionally, C is blessed with pointer types, increasing the opportunities for

aliasing to occur. This means that a C compiler has to approach operations through pointers more

conservatively than a FORTRAN compiler would. The following loop nest looks like a FORTRAN loop

cast in C. The arrays are declared or allocated all at once at the top of the routine, and the starting

address and leading dimensions are visible to the compiler. This is important because it means that

the storage relationship between the array elements is well known. Hence, you could expect good

performance:

#define N ...

double *a[N][N], c[N][N], d;

for (i=0; i<N; i++)

for (j=0; j<N; j++)

a[i][j] = a[i][j] + c[j][i] * d;

Now imagine what happens if you allocate the rows dynamically. This makes the address calculations

more complicated. The loop nest hasn't changed; however, there is no guaranteed stride that can get

you from one row to the next. This is because the storage relationship between the rows is unknown:

#define N ...

double *a[N], *c[N], d;

for (i=0; i<N; i++)

{

a[i] = (double *) malloc (N*sizeof(double));

c[i] = (double *) malloc (N*sizeof(double));

}

for (i=0; i<N; i++)

for (j=0; j<N; j++)

a[i][j] = a[i][j] + c[j][i] * d;

In fact, your compiler knows even less than you might expect about the storage relationship. For

instance, how can it be sure that references to a and c aren't aliases? It may be obvious to you that

they're not. You might point out that malloc never overlaps storage. But the compiler isn't free to

assume that. Who knows? You may be substituting your own version of malloc! Let's look at a

different example, where storage is allocated all at once, though the declarations are not visible to all

routines that are using it. The following subroutine bob performs the same computation as our

previous example. However, because the compiler can't see the declarations for a and c (they're in

the main routine), it doesn't have enough information to be able to overlap memory references from

successive iterations; the references could be aliases:

#define N...

main( )

{

double a[N][N], c[N][N], d;

...

bob (a,c,d,N);

}

bob (double *a,double *c,double d,int n)

{

int i,j;

double *ap, *cp;

for (i=0;i<n;i++)

{

ap = a + (i*n);

cp = c + i;

for (j=0; j<n; j++)

*(ap+j) = *(ap+j) + *(cp+(j*n)) * d;

}

To get the best performance, make available to the compiler as many details about the size and shape

of your data structures as possible. Pointers, whether in the form of formal arguments to a subroutine

or explicitly declared, can hide important facts about how you are using memory. The more

information the compiler has, the more it can overlap memory references. This information can come

from compiler directives or from making declarations visible in the routines where performance is

most critical.

3 Performance of CFD Codes as Related to Hardware (CPU vs. GPU)

The method used to assess the performance of a parallel CFD solver is becoming a topic for debate.

While some implementations use a fixed number of outer iterations to assess the performance of the

parallel solver regardless of whether a solution has been obtained or not, other implementers use a

fixed value for the residual as a basis for evaluation. Ironically, a large amount of implementers do

not mention the method used in their assessment! The reason for this discrepancy is that the first

group (who uses a fixed number of outer iterations) believes that the evaluation of the parallel

performance should be done using exactly the same algorithm which justifies the use of a fixed

number of outer iterations. This can be acceptable from an algorithmic point of view. The other group

(who uses a fixed value for the maximum residual) believes that the evaluation of the parallel

performance should be done using the converged solution of the problem which justifies the use of

the maximum residual as a criterion for performance measurement. This is acceptable from an

engineering point of view and from the user point of view. In all cases, the parallel code will be used

to seek a valid solution! Now if the number of outer iterations is the same as that of the sequential

version, tant mieux! The problem becomes more complicated when an algebraic multigrid solver is

used. Depending on the method used in implementing the AMG solver, the maximum number of AMG

levels in the parallel version will usually be less than that of the sequential version which raises the

issue that one is not comparing the same algorithm. From an engineering point of view, the main

concern is to obtain a valid solution for a given problem in a reasonable amount of time and thus, a

user will not actually perform a sequential run and then a parallel run; rather, she will require the

code to use as many AMG levels as possible.

3.1 CFD for Next Generation High Performance Computing

High Performance Computing (HPC) is moving towards large scale parallelism. The Jaguar

supercomputer, which is currently the fastest computer in the world, has over 200,000 processing

cores. On chip parallelism has been increasing in regular processors (dual core, quad core, etc.) since

2001, but now larger scales of parallelism are being seen on a single chip. The introduction of

Graphics Processing Units (GPUs), which have hundreds of cores on a single chip, into HPC

represents a large change in the architectures being used for scientific computing. The scale of

parallelism and new architectures requires novel numerical solvers to be written and optimized for

solving CFD problems.

3.2 Hardware Consideration and CPU vs. GPU Technology

According to website WEKA blog, CPU is a generalized processor that is designed to carry out a wide

variety of tasks. GPU (graphics processing unit) is a specialized processing unit with enhanced

mathematical computation capability, ideal for computer graphics and machine-learning tasks. GPUs

have been traditionally used for rendering graphics, in which several relatively simple operations are

performed identically on all parts of the input to produce an output image

. The nature of this work

makes graphics rendering a good candidate for parallelization, GPUs reflect this in their architecture

by having many cores. The differences between CPU and GPU architecture can be summarized in

four points, and visualize at Figure 3.2.1:

• CPUs contain few cores but have a relatively large cache (several Mbs).

• CPUs have many advanced features such as predictive branching, out of order execution and

deep pipelines to improve throughput.

• GPUs have hundreds of cores split into groups which share control hardware and high speed

Mark Mawson, Alistair Revell & Robert Prosser, “Computational Fluid Dynamics Codes For Next Generation

High Performance Computing”, Computational Fluid Dynamics Group, University of Manchester.

memory (equivalent to cache). High speed memory is very small (a few Kbs).

• GPU cores are 'lightweight', i.e. they lack the advanced features of CPU cores.

GPUs are similar in function to CPU: they contain cores, memory, and other components. Instead of

emphasizing context switching to manage multiple tasks, GPU acceleration emphasizes parallel data

processing through a large number of cores [WEKA blog].

3.2.1 Case Study 1 – 2D Laplace Equation

A V-Cycle Multigrid method was written for GPUs to solve a 2D Laplace problem (Δu = 0). The

Multigrid Method solves a system of linear equations by restricting the error of the initial fine grid

solution to increasingly coarse grids,

and performing smoothing functions

on them. It has been shown that

higher frequency errors are more

susceptible to smoothing operations

than low frequency errors. By

restricting the solution to coarser

grids the relative frequency of the

errors increases, making them

susceptible to smoothing. Each level

of coarseness will allow a low

frequency error component to be

smoothed, the solution at each grid

level is then summed with the level

above and further smoothing carried

out to remove any errors introduced

by the summation. The result is a

solution that is smoothed across all

frequencies of error.

3.2.1.1 Results

GPU and CPU implementations of the

V-Cycle Multigrid Method were

Figure 3.2.1 Architecture Differences Between CPU and GPU

Figure 3.2.2 Result for a V cycle Multigrid

tested on grids of size up to 4097×4097 elements (see Figure 3.2.2). The maximum grid size was

limited by the size of RAM on the GPU (4GB), future work will include allowing partition as of larger

grids to be moved to and from the GPU. The GPU implementation performed up to 12× faster than

the CPU version.

3.2.1.2 Future Work – Heterogeneous

Computing

In the V-cycle method shown, the CPU is

idle while functions run on the GPU. The

principle of heterogeneous computing is

that the CPU will perform other tasks

while the GPU is being used. For CFD

applications this could include using the

CPU to assist with the solver or, for time

dependent problems, post processing of

the previous time step while the GPU

calculates the latest time step (see Figure

3.2.3).

3.2.2 Case Study 2 - Unstructured Grid

Based CFD Solvers on Modern

Graphics Hardware

The 3D Euler equations for inviscid,

compressible flow are considered by

[Corrigan et. al.]

. Effective memory

bandwidth is improved by reducing total

global memory access and overlapping redundant computation, as well as using an appropriate

numbering scheme and data layout. The applicability of per-block shared memory is also considered.

The performance of the solver is demonstrated on two benchmark cases: a missile and the NACA0012

wing. For a variety of mesh sizes, an average speed-up factor of roughly 9.5X is observed over the

equivalent parallelized OpenMP code running on a quad-core CPU, and roughly 33x over the

equivalent code running in serial.

3.2.2.1 Background and Literature Survey

Recently, GPUs (Graphics Processing Units) have seen a tremendous increase in performance, In

addition to this high computational performance, the latest modern graphics hardware offers

increasing memory capacity, as well as support for 64-bit floating point arithmetic. Parallel, multi-

core processors, GPUs offer tremendous potential for applications in computational fluid dynamics.

In order to fully exploit the computational power of such hardware, considerable care is required in

the coding and implementation, particularly in the memory access pattern. GPUs have general-

purpose global memory, which is not automatically cached and exhibits high latency in comparison

with the instruction throughput of GPUs. Furthermore, with earlier CUDA-enabled GPUs, there were

stringent requirements for achieving optimal effective memory bandwidth, with a large loss of

performance when these requirements went unmet. With the data-dependent memory access of

unstructured grid based solvers, this loss of performance is almost assured. However, with due care,

structured grid based solvers can meet these requirements due to the regular memory access

patterns of such solvers, as described in the work of (Brandvik & Pullan), and (Tolke). Further work

on regular grid solvers includes that of (Phillips et al.) who have developed a 2D compressible Euler

Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD

Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.

Figure 3.2.3 Heterogeneous Computing using CPUs and

GPUs

solver on a cluster of GPUs, and [Thibault et al.]

who have implemented a 3D incompressible Navier

- Stokes solver for multi-GPU systems. So far, the implementation of optimized unstructured grid

based solvers for modern graphics hardware has been relatively rare, perhaps due to these stringent

requirements. In fact, just prior to its first release, [Owens et al.]

comprehensively surveyed the field

of general-purpose computation on graphics hardware (GPGPU), which included a number of

primarily structured grid based solvers, such as those of [Harris]

, [Scheidegger et al.]

, and [Hagen

et al.]

However, the architecture has changed substantially and many of the limitations of GPGPU

via traditional graphics APIs such as OpenGL are no longer an issue.

The most recent CUDA-enabled GPUs have looser requirements for achieving high effective memory

bandwidth. Roughly speaking, memory no longer needs to be accessed in a specific order by

consecutive threads. Rather, high effective memory bandwidth can be achieved as long as

consecutive threads access nearby locations in memory, which is called coalescing. Thus, if an

appropriate memory access pattern is obtained, one can expect that modern GPUs will be capable of

achieving high effective memory bandwidth and in general high performance for unstructured grid

based CFD solvers. The purpose of this work is to study techniques which achieve this.

3.2.2.2 Implementation on Graphics Hardware

The performance-critical portion of the solver consists of a loop which repeatedly computes the time

derivatives of the conserved variables [see the Corrigan et. al.]

. The conserved variables are then

Thibault1, J. and Senocak, I., “CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms

for Incompressible Flows," 47th AIAA Aerospace Sciences Meeting Including The New Horizons Forum and

Aerospace Exposition, No. AIAA 2009-758, January 2009.

Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A. E., and Purcell, T. J., “A Survey of General-

Purpose Computation on Graphics Hardware," Computer Graphics Forum, Vol. 26, No. 1, 2007.

Harris, M., “Fast Fluid Dynamics Simulation on the GPU," GPU Gems, chap. 38, Addison-Wesley, 2004.

C. Scheidegger, J. Comba, R. C., “Practical CFD simulations on the GPU using SMAC." Computer Graphics Forum,

Vol. 24, 2005.

12Hagen, T., Lie, K.-A., and Natvig, J., “Solving the Euler Equations on Graphics Processing Units," Proceedings

of the 6th International Conference on Computational Science, Vol. 3994 of Lecture Notes in Computer Science,

Springer, May 2006.

Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD

Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.

Figure 3.2.4 Pressures at the Surface and Plane for the NACA 00012 (Left) and at the Surface for the

Missile (Right)

updated using an explicit Runge-Kutta time-stepping scheme. The most expensive computation

consists of accumulating flux contributions and artificial viscosity across each face when computing

the time derivatives. Therefore, the performance of the CUDA kernel which implements this

computation is crucial in determining whether or not high performance is achieved, and is the focus

of this section.

3.2.2.3 Test Case

The performance of the GPU code was measured on a prototype NVIDIA Tesla GPU, supporting

compute capability 1.3, with 24 multi-processors. A NACA 0012 wing in supersonic (M = 1.2 , α = 0)

flow was used as a test case where the pressure contours are plotted in Figure 3.2.4 (Left). Timing

measurements when running in single-precision and for a variety of meshes, showing an average

performance scaling factor of 9.4X in comparison to the OpenMP code running on four cores and

32.6X in comparison to the OpenMP code on one core. Furthermore, the code running on graphics

hardware is faster by a factor of 3.9X using redundant computation in comparison to pre-computed

flux contributions. Timing measurements when running in double-precision are given in Figure

3.2.5-(Top) for a variety of meshes, showing an average performance scaling factor of 1.56X in

comparison to the OpenMP code running on four cores and 4.7X in comparison to the OpenMP code

on one core. Furthermore, the code running on graphics hardware is faster by a factor of 1.1X using

redundant computation in comparison to pre-computed flux contributions.

A missile in supersonic (M = 1.2 , α = 8) flow was used as an additional test case. The pressure

contours are plotted in Figure 3.2.4-(right). Timing measurements when running in double-

precision are given in Figure 3.2.5-(bottom) for a variety of meshes, showing an average

performance scaling factor of 2.5X in comparison to the OpenMP code running on four cores and 7.4X

in comparison to the OpenMP code on one core. Furthermore, the code running on graphics

hardware is faster by a factor 1.63X using redundant computation in comparison to pre-computed

flux contributions. For additional, consult the [Corrigan et. al.]

Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD

Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.

Figure 3.2.5 Running Times in Double Precision Per Element Per Iteration for the NACA 0012

3.2.3 Case Study 3 - Accelerating CFD Simulation With High Order Finite Difference Method on

Curvilinear Coordinates for Modern GPU Clusters

Authors : Chuangchao Ye, Pengjunyi Zhang, Rui Yan, Dejun Sun, Zhenhua Wan

Affiliations : Department of modern mechanics, University of Science and Technology of China , Hefei,

230027, China

Title : Accelerating CFD simulation with high order finite difference method on curvilinear coordinates

for modern GPU clusters

Citation : Ye, C., Zhang, P., Yan, R., Sun, D., & Wan, Z. (2020). Accelerating CFD simulation with high

order finite difference method on curvilinear coordinates for modern GPU clusters. arXiv:

Computational Physics.

Source : arXiv:2006.07964v1

A high fidelity flow simulation for complex geometries for high Reynolds number (Re) flow is still

very challenging, which requires more powerful computational capability of HPC system [Ye et al.]

However, the development of HPC with traditional CPU architecture suffers bottlenecks due to its

high power consumption and technical difficulties. Heterogeneous architecture computation is

raised to be a promising solution of difficulties of HPC development. GPU accelerating technology has

been utilized in low order scheme CFD solvers on structured grid and high order scheme solvers on

unstructured meshes. The high order finite difference methods on structured grid possess many

advantages, e.g. high efficiency, robustness and low storage, however, the strong dependence among

points for a high order finite difference scheme still limits its application on GPU platform. In present

work, we propose a set of hardware-aware technology to optimize the efficiency of data transfer

between CPU and GPU, and efficiency of communication between GPUs. An in-house multi-block

structured CFD solver with high order finite difference methods on curvilinear coordinates is ported

onto GPU platform, and obtain satisfying performance with speedup maximum around 2000x over a

single CPU core. This work provides efficient solution to apply GPU computing in CFD simulation

with certain high order finite difference methods on current GPU heterogeneous computers. The

test shows that significant accelerating effects can been achieved for different GPUs.

Keywords: Hardware-aware, High Order, Finite Difference Methods, Curvilinear Coordinates, GPU,

M=million.

3.2.3.1 Introduction

CFD is one of the most important research methods in fluid mechanics, which highly relies on the

computational capability of computer, especially for accurately simulating realistic engineering

flows. The deficiency of computational capability of computer has now became one of the biggest

obstacles for the future CFD development. High order methods have been adopted in complex flow

simulations like turbulence simulation, computational aeroacoustics (CAA), on the merit that it can

obtain more accurate results with less grid than low order methods due to their low dissipation

property. However, high order methods are more computational expensive comparing to low order

methods. The real flows in nature are usually with large Reynolds number. Chapman estimated in

1979 the grid-point requirements (N) for wall-modeled large eddy simulation (LES) to be N  Re2/5L

, for wall-resolving LES, N  Re 9/5L , and for direct numerical simulation (DNS), N  Re9/4L [1]. In 2012,

Choi & Moin [2] revised Chapman’s estimation to be ReL, Re 13/7L and Re 37/14 L respectively.

Since Gordon E. Moore, the co-founder of Intel, observed that the number of transistors on a

microchip doubles every two years though the cost of computers is halved, the Moore’s law has

governed the development of CPU for decades. However, in the past few years, there are many voices

about gradual failure of this golden law, because the high temperatures of transistors eventually

Chuangchao Ye, Pengjunyi Zhang, Rui Yan, Dejun Sun, Zhenhua Wan, “Accelerating CFD simulation with high

order finite difference method on curvilinear coordinates for modern GPU clusters”, arXiv:2006.07964v1, 2020.

would make it harder to create smaller circuits. In the meantime, there are many heterogeneous

computational architectures arise, such as field-programmable gate array (FPGA), automata

processor (AP) and graphics processing unit (GPU). Among those heterogeneous architectures, the

GPU computing has grown to be the most popular one in high performance computing (HPC). GPU,

originally designed for graphics rendering, releases great computational capability due to its "many

cores" architecture.

Compared to traditional CPU which contains only several cores, GPU contains thousands of cores in

a chip, which provide more powerful computation capability. In additions, GPU has lower power and

is cheaper for equivalent computation capability as CPU. In the latest TOP500 list of HPC, half of the

10 fastest HPC systems have been equipped with GPU devices. Summit, the fastest supercomputer in

the world located in the Oak Ridge National Laboratory (ORNL) in USA, consists of 4608 compute

nodes, and each node contains 2 IBM POWER9 processors and 6 NVIDIA Volta V100 GPUs. The GPUs

deliver over 95% performance of Summit. As to traditional CPU architecture supercomputer Tianhe-

2A, Summit delivers double peak performance (200 PetaFlOPS) with nearly half power consumption

(10KW).

The merits of GPU computing have attracted many attempts to port CFD simulation to GPU platform.

Tutkun, et al. [3] solve compressible Navier-Stokes equations with compact scheme combining with

filter on a Tesla C1060 GPU, and get speedup of 9x to 16x compared to a AMD Phenom CPU.

Esfahanian, et al. [4] perform 3rd and 5th order WENO scheme on two dimensional uniform mesh

solving Euler equations on GPU, and get maximum speedup of 316x. Karantasis, et al. [5] perform 5th

, 7th and 9th order WENO scheme on uniform grid with 6.9 M grid points, accelerating with Tesla

S1070 GPU, and get maximum speedup of 90x. Xu, et al. [6] perform high order finite difference

WCNS scheme and Hybrid cell-edge and cell-node Dissipative Compact Scheme (HDCS) on curvilinear

coordinates.

The GPU-only approach achieves a speedup of about 1.3 on a Tesla M2050 GPU compared with two

Xeon X5670 CPUs, while the hybrid CPU and GPU approach achieves maximum 2.3x speedup. Lai, et

al. [7, 8] develop a multiple GPU algorithm in hypersonic flow computations with the second order

finite volume method on curvilinear coordinates, and obtain 20x to 40x speedup accelerating with a

Nvidia GTX1070 GPU compared to an 8 cores Intel Xeon E2670 CPU.

Elsen, et al. [9] develop a first order to sixth order finite difference method code for complex

geometry on GPU platform, and obtain speedup 40x for a simple geometry and 20x for a complex

geometry on a Nvidia 8800GTX GPU compared to an Intel-Core-2-Duo. Very recently, Lei, et al. [10]

accelerate a compressible flow solver based on a second order MUSCL/NND code on cartesian

coordinates with GPU. Tesla P100 GPU is used for accelerating and gets speedup 260x for MUSCL

scheme and 144x for NND scheme over a E5-2640v4 CPU core.

In general, for flow solvers in cartesian coordinates and with low order schemes, the application of

GPU gets conspicuous acceleration. But for a complicated coordinate system and high order schemes,

the flow solvers get inferior accelerating effect, which is far from the computation requirement for

the higher fidelity and larger scale simulation.

High order methods based on unstructured mesh, e.g. flux reconstruction (FR) method and

discontinuous Galerkin (DG) method, are popular in accurately simulating flow with complex

geometry. GPU computing achieves great success in this branch of the high order method due to its

computational independence of elements. Crabill, et al. [11] simulated the flow over a spinning golf

ball at a Reynolds number of 150,000 with flux reconstruction (FR) solver on GPU. Witherden, et

al.[12] developed a solver based on the high order flux reconstruction (FR) which has been used for

a variety of flows. Although it is easy to improve precision, the numerical stability of this type of

method in solving high speed flows is poor, which limits its application. Furthermore, this method

requires a large amount of storage which also limits its application in large scale simulations in GPU

computing. In contrast, the finite difference method is more storage saving and more robust when

strong discontinuity exists. In general, the finite difference method is still an important method in

high order CFD simulations, and deserves further development.

In the past few years, the hardware performance of GPU has been greatly improved, and a series of

technologies were developed, which make the heterogeneous computer more effective and mature.

Current GPU computer is not simply a traditional CPU computer equipped with one or two GPUs, it

is a highly integrated hardware and software system. One computer has multiple GPUs, and GPUs

become the main power of computing. To fully exploit the performance of current GPU computer,

programs must adjust themselves to adapt to the complicated computer architectures.

In present work, based on architecture analysis of modern GPU servers, we propose a set of

hardware-aware technology to optimize the performance of data transfer between CPU and GPU, and

communication efficiency between GPUs. An in-house code is porting onto GPU with careful memory

planning and high-efficiency kernel design. Moreover, a modified alternative formulation of WENO

scheme which saves computation and is more efficient in GPU computing is proposed.

3.2.3.2 Numerical Methods

The discussion regarding the RANS formulation can be obtained from source paper [Ye et al.]. In

addition, the derivatives of inviscid fluxes using the high order Alternative Formulation of Weighted

Essentially Non-Oscillatory scheme (AFWENO) [13], are also covered, and will not be repeated here.

Readers are also advised that present study is based on an inhouse code "HiResX (High Resolution

flow Solver)", which aims to simulate compressible flows with complex geometries in high order

accuracy with finite difference methods.

3.2.3.3 Hardware Environment

A traditional CPU-based supercomputing cluster usually consists of hundreds of thousands of

computational nodes, and each node contains dozens of cores. These nodes are interconnected with

high speed network. Up to now, the number of cores in a single CPU is still limited. Due to this

limitation, the large task must be divided into a lot of partitions and distributed to many nodes. One

obvious drawback is that the complex communication between nodes may lead to high latency.

Furthermore, the price/performance ratio of traditional CPU-based cluster is low, due to the fact that

Figure 3.2.6 Framework of modern GPU cluster. The computational nodes are connected with high

speed network. GPUs deliver the majority of performance of modern GPU cluster

you have to pay for the frame of each computation node by increasing CPU cores, which does not

directly contribute to computational capabilities. The computational capability of a computing

device with specific space occupation can be represented as "computing density". In the past few

years, supercomputing has come to heterogeneous computing era, and higher computing density

supercomputers have been developed. Those computers are accelerated with some devices called

many-core processor accelerators, such as GPU and FPGA. In the realm of heterogeneous computing,

GPU is the most popular one. Figure 3.2.6 shows the framework of modern GPU server cluster. Each

node contains two CPUs and much more GPUs than before. The GPUs in the same server communicate

with each other through PCIe slots, and further communicate with GPUs in other servers by network.

More sophistic technology called Nvlink has developed by NVIDIA to enable direct data transfer

between GPUs in the same server at faster speed than PCIe’s.

The two GPU servers utilized in this work are equipped with 10 NVIDIA RTX 2080 Ti GPUs and 10

NVIDIA Titan V GPUs, respectively. The RTX 2080 Ti GPU utilizes the newest Turing architecture of

NVIDIA GPU, while Titan V utilizes the architecture Volta. The RTX 2080 Ti GPU is a consumer-

oriented product aiming at video and game, though it has up to 4352 CUDA cores, the double

precision (DP) performance is only 1/32 of

single precision (SP) performance.

However, the core GV100 of Titan V

contains 5120 CUDA cores, and ratio of

DP/SP is up to 1/2, which makes it still the

most powerful GPU core up to now.

We also utilize the Tesla P100 GPU, based

on much older architecture Pascal.

Although it contains less CUDA cores than

RTX 2080 Ti, it has much higher

theoretical DP performance. The detailed

specs of GPUs mentioned above can be

found in Figure 3.2.1. The CPU of the

server is Intel Xeon E5 2680 v3, with 12

cores in a single CPU, and the spec is given

in Table 3.2.2. The ideal computational

performance of dual-CPU in the server is 960 GFlops (FP64 performance). With GPU acceleration,

Table 3.2.1 Present GPU specs. Titan V has highest double precision operation performance. RTX 2080

Ti utilizes the newest architecture and provides highest single precision operation performance

Table 3.2.2 Present spec of CPU. The double precision

operation performance of E5-2680v3 is slightly higher

than RTX 2080 Ti’s.

10 RTX 2080 Ti GPUs deliver 134.5 TFlops SP float performance and 4.2 TFlops DP float performance,

while 10 Titan V GPUs deliver incredible 149.0 TFlops SP performance and 74.5 TFlops DP

performance. In this work, no Nvlink is applied.

3.2.3.4 Programming Implementation and Optimization of HiResX

3.2.3.4.1 Code Introduction

HiResX is originally written in Fortran 90, aiming at simulating various compressible flows with

complex domains in high order schemes. HiResX solver is equipped with various turbulence models

such as Reynolds-Averaged Navier-Stokes (RANS) models, LES model and detached eddy simulation

(DES) model to simulate engineering realistic flows with high Reynolds numbers. Message Passing

Interface (MPI) is utilized to distribute computation tasks to different CPU cores inter- or intra- node.

There are several strategies of performing computation on GPU devices. OpenACC offers a user-

driven lightweight solution that the user needs simply adding some directive-based clauses in the

code segment which needs to accelerate on GPU with significantly less programming effort than

required with a low-level model. CUDA Fortran based on PGI Fortran is another solution provided by

PGI company, which offers programming flexibility using Fortran language. However, CUDA Fortran

highly relies on the PGI compiler. CUDA C/C++ is a programming language based on standard C/C++

and developed by NVIDIA, and it provides better affinity and flexibility to operate NVIDIA’s GPUs at

low-level programming.

However, programming with CUDA C/C++ needs much higher skill. To port an existing complex code

onto GPU by completely rewriting it with another programming language is not a wise choice. In

Figure 3.2.7 Several typical PCIe root architectures of GPU server. For type (a), each NUMA node has

PCIe switch attached on them, and there are two GPUs mounted on each PCIe switch. For type (b), only

one PCIe switch is mounted on each NUMA node, and there are 4 GPUs attached on one PCIe switch. For

type (c), all GPUs are mounted on one NUMA node by two PCIe switches. For type (d), all GPUs are

mounted on one NUMA node.

order to fully exploit

the performance of

current GPU server

without rewriting

whole the solver,

mixed programming

of Fortran and C/C++

is applied, and GPU

computing is achieved

with CUDA C/C++.

GPU computing for

each process is

independent with

each other when

HiResX is running in

parallel, thus HiResX

supports three

parallel modes: CPU-

only mode, GPU-only

mode and

collaborating (or

hybrid) CPU/GPU

mode. In CPU-only

parallel mode, all

processes perform

computation on CPU

only, whether if the

computation node has

GPU devices or not.

This parallel mode is

suitable for most of

the traditional

supercomputers

which support CPU

computing only. In

GPU-only parallel

mode, all processes are running on GPU devices, hence for supercomputing cluster, every

computational node running the code must equipped with a GPU device. The hybrid CPU/GPU mode

is a parallel mode that can make fully use of all CPU and GPU resource in the cluster, as mentioned in

Ref.[15]. For this parallel mode, the computational capability of CPU and GPU in one node should be

roughly comparative.

The procedures of GPU computing of HiResX are illustrated in Figure 3.2.10. After the initialization

of program, the process which runs computation on GPU uploads the conservative variables Q,

primitive variables Qv, Jacobian J and metrics at cell node to GPU device. To avoid frequent data

transfer between CPU and GPU, all these variables in computation domain are uploaded to GPU and

stored in GPU’s global memory. The Jacobian J and metrics at cell edge which are computed by CPU

at startup would not be uploaded with consideration of three aspects. Firstly, if all these variables at

cell edges in three directions are stored in memory of GPU, additional memory space for 12 variables

is needed, which reduces the maximum grid number that a single GPU can deal with. Secondly, if only

memory space of one direction is allocated on GPU, though the memory space is reduced to a third of

Figure 3.2.8 Main structure of HiResX summarized by pseudo code

the former case, frequent data transfer during direction switching is time-consuming. Thirdly, for

GPU, computational capability is much stronger than memory bandwidth, hence direct interpolation

from Jacobian and metrics at cell node is more time-saving. The calculation of residual dQ on right

hand side, time advancement and primitive variables updating from conservative variables are

completely running on GPU. When physical boundary condition is applied, very little data is needed

to upload onto GPU, such as the ID number of boundary points, which takes very little time that can

be ignored.

For traditional CPU computer, the user may not concern on where and how their program is running.

When the program runs in parallel, the user needs only to distribute the computation to different

CPU processes with MPI, or to different threads by tools (e.g. OpenMP). Users rarely care about the

Figure 3.2.10 A global glance of execution procedures of HiResX running on GPU. Except for

initialization and data performed on CPU, all computations are performed on GPU while CPU is used to

schedule kernels of GPU only. "Data 0" represents the initial data computed in CPU and uploaded to GPU.

"Data 1" represents data exchanged between processes. "Data 3" represents flow field data to be written,

and it should be downloaded to CPU and written with a new thread.

Figure 3.2.9 Framework of communication of HiResX without GPU peer to peer communication

technology support. For communication between processes that are all running on CPU, if processes are

located in the same node, they exchange data within RAM with MPI-3 shared memory technique, see red

path. If processes are located in different nodes, the standard MPI communication is utilized, see yellow

path that connects triangle pair. For processes with GPU acceleration, data on GPU should be

downloaded back to CPU, and then processes communicate in the same way as processes without

acceleration.

hardware information when they develop the program. However in current GPU computing, data

exchanges at block connecting boundary are more complex, due to the complex environment in

heterogeneous computational architecture. Figure 3.2.9 shows the communication structures of

our solver. When the solver runs in CPU-only mode, data exchange among blocks in the same process

is straightforward by simply copying data without communication with other process, as is depicted

in the figure with the yellow pentagon pairs.

For blocks in different processes in the same node, data exchange among them is achieved by use of

shared memory communication technique in MPI-3 standard, as is depicted in the figure with the red

five-pointed star pairs. For blocks in different processes in different nodes, data exchange is achieved

by standard MPI communication, which is marked as the brown triangle pairs.

In GPU computing, the data must be uploaded to GPU from CPU first, and downloaded back to CPU

when GPU computation finished. The data has to be transferred between memories of CPU and GPU

through PCIe bus. In the early stage of development of GPU computing, each GPU is installed in

different PCIe slots and works independently, and communicates with CPU only. If the program is

running in parallel with multiple processes on GPUs, the data to be sent has to be downloaded from

GPU and then communication is fully performed in CPU through MPI, when communications among

processes are involved. After the data is exchanged, the received data has to be uploaded to GPU. It

is obvious in this procedure that processes in different GPUs communicate by use of CPU memory

and explicit data copying between CPU and GPU. When data on GPU is downloaded to CPU, the

communication restores to general communication in CPU, and all techniques can be applied, as is

depicted with four-pointed star pair and seven-pointed star pair.

However, the Unified Virtual Addressing (UVA) technology and CUDA-aware MPI technology make

this procedure simpler and more efficient. There is no need to perform explicit data copying because

the MPI interfaces can recognize the location of buffer data and then find the optimal path to transfer

data. For blocks in the same process with GPU accelerating, data can be exchanged on GPU without

leaving device.

When the program running in hybrid CPU/GPU mode, the communications among processes without

GPU accelerating (CPU-only mode) are the same as the communications among processes with GPU

accelerating (GPU-only mode). The only thing to deal with is the communication between processes

with or without GPU accelerating, as is marked with diamond pair in Figure 3.2.9. For the process

Figure 3.2.11 Domain decomposition strategies of HiResX. The domain is divided into several blocks

according to performance of CPU and GPU in order to balance workloads of each process. For the blocks

computed by CPU, the whole block is computed by a CPU process. For the blocks computed by GPU, each

grid cell is computed by one CUDA thread.

with accelerating, data to be sent should be downloaded to CPU memory explicitly if CUDA-aware

MPI is not applied, while the process without accelerating does nothing with GPU. When data is

downloaded to CPU, the communication can be performed similar to that in CPU-only mode.

However, with CUDA-aware MPI technology and UVA, the data in GPU and CPU can be exchanged

directly.

3.2.3.4.2 Domain Decomposition

The HiResX solver is developed to simulate flows with complex domains with multi-block structure

grid. Figure 3.2.11 illustrates the domain decomposition strategies of HiResX. The whole

computation domain is splitted into a lot of connecting blocks, and these blocks are distributed to

different processors. For hybrid CPU/GPU parallel computing, the sizes of grid blocks can be

different. The size of grid block for CPU is usually much smaller than that for GPU, since the latter has

higher computational capability. For HiResX, in order to balance the computational loads of

processes of CPUs and GPUs, the whole domain can be splitted arbitrarily without the limitation of

one-to-one block connection, and the block connecting faces and physical boundary faces can be also

arbitrarily

defined on blocks.

3.2.3.4.3 Hardware Technique

In current GPU server, there are more GPUs installed in a single machine. Moreover, a variety of PCIe

root architectures have been designed for different performance needs, as demonstrated in Figure

3.2.7. The whole hardware system needs to match the increasing performance of current GPUs, and

the solver also need to adapt to the hardware system for maximum computational performance.

Complex PCIe root architecture will affect efficiencies of memory access of CPU-GPU and even

process unit (PU) to RAM. Figure 3.2.12 (a) and Figure 3.2.12 (b) show the memory access

models of GPU to CPU and GPU to GPU. The server has two CPUs interconnected with QPI, and it is a

dual root system. In Figure 3.2.12 (a), a process is running in process core PU0 in CPU-0, and selects

Figure 3.2.12 CPU-GPU and GPU-GPU memory access models. In (a), red path is the optimal one, while

pink path is inferior because CPU to GPU memory access across NUMA nodes is worse than local access.

In (b), if GPUs support peer-to-peer (P2P) communication technology, green path indicates that two

GPUs communicate by Nvlink, which is the fast path. Purple path indicates that two GPUs communicate

by PCIe switch (PLX), which is most efficient communication way without Nvlink. Yellow path means that

two GPUs communicate by host (CPU). Red path is the worst one because P2P is not supported across

NUMA nodes, and memory access between two GPU in this way must be transferred by both CPUs.

GPU0 for acceleration. The data of this process is allocated in the memory (RAM) in the same root

with CPU-0 and GPU0. As a result, the process visits its data in RAM with optimal bandwidth, and

GPU0 also transfers data to CPU with optimal bandwidth. This is the ideal situation. However, in fact,

the PU in which the process is running is not settled, and is managed by the system by default. For

pink path in Figure 3.2.12 (a), the process is running in PU2 in CPU-1, but selects GPU2 as

accelerator. Though it visits data allocated in the closest RAM with optimal bandwidth, GPU2 is not

in the same root with CPU-1. Data transfer between CPU and GPU must traverse QPI because of

unsupported direct memory access, so it is very inefficient. The worst case is the blue path. The data

is allocated in RAM close to CPU-0 and GPU1, but the process is running in PU of CPU-1, resulting in

poor efficiency of both GPU memory access and RAM access. PCIe root architecture will also affect

GPU to GPU communication or memory access. Figure 3.2.12 (b) gives several possible models of

GPU to GPU memory access in current GPU server.

The red path is the worst case since it traverses QPI, and the topology of GPU2 and GPU3 is usually

marked as "SYS". For brown path, GPU4 and GPU5 are attached to the same IOH chip, which provides

better communication efficiency than "SYS". The topology is usually called "PHB". The purple path

connects two GPUs in the same PCIe switch, which provides better efficiency than "PHB", and its

topology is marked as "PIX". The green path is the best, GPUs interconnect each other directly with

Nvlink, which is the fast way of GPU to GPU communication so far, labeled as "NV".

To exploit the optimal efficiency of the hardware system, users should know the topology of the

hardware system. The Hardware locality (HWLOC) software package provides an approach to gather

hierarchical topology information about modern increasingly complex parallel computing platforms,

including NUMA (Non-Unified Memory Access) memory node, shared cache, cores and

multithreading, as well as I/O devices such as GPUs, InfiniBand HCAs. With the topology information,

users can optimize their own programs to obtain the best performance. It is should be noted that the

optimizations of CPU-GPU communication and GPU-GPU communication are interdependent, which

will be introduced next.

3.2.3.4.4 CPU-GPU Communication Optimization

To optimize the performance of memory access between CPU and GPU, the only way is to get them

"closer". The word "close" is an iconic description of two device that has larger data transfer

bandwidth. As aforementioned, the best strategy is to make the RAM, PU and GPU be utilized by a

process located in the same root. There are many ways to bind a process to specified PU, including

process binding technique of HWLOC, APIs of NVIDIA management Library (NVML) and process

binding technique in Linux system.

Many commercial and open source software packages provide memory binding. MVAPICH allows

data allocation when calls "MPI_Init()", and OpenMPI allocates it after calling "MPI_Init()". The

problem is that, the size of memory space to allocate before program calling "MPI_Init()" and reading

input files and grids is unknown, and we have to run MPI initialization to get the process ID, so that

process can determine which grid files to read. What is more, to optimize the GPU-GPU

communication, we should know the communication data load of each process, then we can decide

which GPU to be selected for acceleration.

After the GPU of a process is chosen, we can bind process to the CPU that the selected GPU attached

to. Finally we allocate memory and bind it to the CPU. The steps of the procedure are briefly outlined:

• Run the program first until the processes get its GPU ID. When the solver run in parallel, each

process gets its process ID, and accordingly reads its input files. Process calculates the

communication data load according to block connection information. Based on the

communication data load and the topology of GPUs of local machine, the GPU-GPU

communication optimization will select the optimal GPU for the process.

• Find the NUMA node (CPU) near given GPU ID, and bind current process or thread to it. Here

we utilize HWLOC software package to do it. Firstly, set the cpuset of given GPU ID with API

function hwloc_cudart_get_device_cpuset(). Secondly, bind current process or thread to this

cpuset with hwloc_set_cpubind(). Finally, get the NUMA node of this cpuset with

hwloc_get_obj_covering_cpuset().

• Allocate memories in RAM for data of the process. If there is allocated memory of data, free it

or migrate it to the CPU selected above. To allocate memory for data in the RAM attached to

the selected CPU, the API function hwloc_alloc_membind() in HWLOC is utilized. This API

function will allocate memory in given size in RAM of local NUNA node, and return address

of the this memory space. For memory of data that has been allocated before GPU ID is got,

we can migrate it to the right NUMA node with API function hwloc_set_area_membind() when

the address and NUMA node ID are provided.

3.2.3.4.5 GPU-GPU Communication Optimization

GPU-GPU communication is relatively easier to perform, which is efficient if GPUDirect for Peer-to-

Peer (P2P) technology developed by Nvidia is applied. There are many Nvidia GPUs support

GPUDirect P2P, such as Tesla series and Quadro series GPUs. GPUDirect P2P technique includes P2P

memory access and P2P transfer and synchronization, and it is an optimization technique of GPU-

GPU communication in the same system. With it, buffers between memories of two GPUs in the same

computer can be copied directly through PCIe lane or Nvlink. The efficiencies of GPU-GPU

communication by use of GPUDirect P2P technique vary with the path the peers is connected with.

Nvlink offers maximum bandwidth, thus it is the most efficient path. Nvidia has developed new

Nvlink technology which enables interconnection of any pair of GPUs in a single system that has

multiple GPUs.

This strategy works well. However, this is a highly customized product which needs specified

motherboards and only supports Tesla P100 SXM and Tesla V100 SXM2. There are many GPU servers

do not support Nvlink and GPUs are still install in PCIe stlots. Therefore, the GPU-GPU

communication optimization is directed to those GPUs communicating through PCIe. To optimize

the GPU-GPU communication through PCIe, we need the topology of local system and the sizes of

buffers to be exchanged among processes located in this server. Then we can distribute process

groups which include dense communication load to GPU groups whose communication is more

efficient, thus improve global communication efficiency. The details of optimization are introduced

below:

• The solver runs in parallel, and each process reads its input file and gets its communication

buffer sizes to other processes. Each process may communicate with zero or one process, or

multiple processes. Those processes may be distributed in one or multiple servers.

• Gether processes located in the same server and let them know each other. MPI software

package offers APIs to achieve it. Firstly, new communicator in local server can be created

with API fucntion MPI_Comm_split_type() with split type "MPI_COMM_TYPE_SHARED", for

better distinguish below we call it "MPI_SHARED_LOCAL". The size of new communicator and

local rank (or ID) in new communicator of current global process can be obtained with APIs

MPI_Comm_size and MPI_Comm_rank respectively. Secondly, get the groups of global

communicator "MPI_COMM_WORLD" and local communicator "MPI_SHARED_LOCAL".

Thirdly, gather the processes that current process communicates with, and translate it into

local communicator. A global process ID that is not located in local communicator will be

marked as "MPI_UNDEFINED", so that we filter out processes to communicate with in local

server.

• Each process in local communicator shares its information with all other local processes. The

information here is the sizes of communication buffers to target processes in local

communicator. Then each local process knows the local communication network and buffer

size of each connecting line. This procedure can be easy to implement with MPI API function

MPI_Allgather().

• Each local process inquires the topology of GPU pairs in local server. We can obtain the

connection type of two GPUs by use of NVML API function

nvmlDeviceGetTopologyCommonAncestor( device1, device2, &topo_type) according to the

returned variable topo_type. The connection type of two GPUs in the same system should be

"NV", "PIX", "PXB", "PHB", or "SYS", which range in the order of efficiency decrease. We will

not consider "NV" as mentioned above. Then we organize GPUs in groups. Firstly, each CPU

(NUMA node) gathers GPUs that are attached to it. Any two GPUs in this group communicate

with each other with efficiency not worse than "PHB". Secondly, in each CPU, gather the GPUs

that in the branch in which any two GPUs communicate with each other with efficiency not

worse than "PXB". Thirdly, in each "PXB", gather the GPUs that are in the same PCIe switch.

GPUs in a single PCIe switch have connection type "PIX" with each other, and there may be

multiple "PIX" groups attached to a GPU.

• Filter out the busy GPUs in groups above, and get groups that contain idle GPUs only. The

utilization of GPU can be inquired with NVML API function nvmlDeviceGetUtilizationRates().

• Partition local processes according to number of GPUs of each CPU. If all local processes is

less than GPUs of any CPU, choose the CPU that can hold all local process. If no single CPU can

hold all processes, divide local processes into partitions according number of GPUs in each

CPU, and make sure that these partitions have minimal total communication data with each

other.

• In each CPU that gets a partition, divide the processes into partitions for "PXB" groups in the

fashion above. The "PIX" groups should be dealt with in similar fashion too. With the

procedure above, each process in local server is binded to the GPU that makes it communicate

efficiently with other GPUs attached to its target communicating processes. It is worth to

mention that, in modern GPU cluster, servers usually interconnect with infiniband network.

With GPUDirect Remote Direct Memory Access (RDMA) technology, buffers can be directly

sent from the GPU memory to a network adapter without staging through host memory. In

this case, the network adapter should be included during optimization above, and we can

simply regard it as a communication target like process.

3.2.3.4.6 Memory Utilization

How to utilize memory of GPU is one of key points in programming GPU solver. Strategies of memory

utilization of GPU vary from code to code in different fields according to algorithms applied and

computing scales. In the early stage of GPU computing, GPUs were mainly designed for image and

video processing which usually involves simple algorithms like matrix operations. Perform simple

algorithms on a bunch of single dataset would not occupy much memory of GPU, hence GPUs in

several years ago equipped with very low memory capacity, which is usually not higher than 4GB.

For scientific computing involves large scale computation, memory occupation is a challenge for CPU,

thus it is hard to hold all data in GPU memory. Due to limitation of memory capacity, GPU usually

works as local accelerator, where most of the data of program is fixed in memory of CPU, and upload

a part of data onto GPU for computation acceleration, and then download result back to CPU. This

strategy enables the solver to deal with much more data with limited memory of GPU. Data transfer

between CPU and GPU through PCIe lane is time-consuming, which must be compensated in GPU

computing, in order to get acceleration. However, with performance of GPU improving greatly,

massive and reduplicative data transfer gradually becomes bottleneck of GPU computing.

Bandwidth of PCIe has not increased significantly over the past few years, but the memory capacities

of GPUs increase a lot. In particular, some professional GPUs are designed for scientific computing,

which equipped with memory up to 32GB. Moreover, current GPU server contains more GPUs, which

make a single server equipped with powerful computational capacity. Hence, the strategy of memory

utilization should be adjusted to adapt to current GPU computing. We will introduce the detail of

memory utilization of our solver in the following.

There are two basic and vital principles of memory utilization in our solver. On the one hand, allocate

fixed permanent memory space for key dataset. On the other hand, avoid reduplicative memory

allocation and deallocation by allocating a piece of permanent memory space that is reasonably large

enough for potential public usage.

In the whole procedure of solving (Eq.(1) - see [Ye et al.]), we select several fundamental and

dominant variables from all variables. A variable is fundamental when it is used throughout the

whole computation and many other variables can be obtained with them. The metrics (ξ , η , ζ)x,y,z and

Jacobian J are utilized all over the computation and they are constants if the grid is stationary, so they

are fundamental variables. The conservative flow variable U󰋥 in (Eq.(2) - see [Ye et al.]) is fundamental

as well, because it plays a key role in time advancement, in calculating the interpolated variables in

WENO interpolation, moreover, it is also the source to compute primitive variable Qv = (u, v, w, p, T).

For convenience in utilization, we will not include the Jacobian J in it, and only store Q = (ρu, ρv, ρw,

e). Although Qv can be obtained from Q, it is always used directly during the whole computation. So

we have to sacrifice memory to avoid unnecessary computation time. For viscous flow computations,

molecular viscosity μ is unavoidable, which is used in computing viscous flux and local time step. In

time advancement with Runge-Kutta scheme, as shown in (Eq.(17) - see [Ye et al.]), Qn is a variable

that should be stored. For time advancement in local time step, the time step Δt at each grid point

should also be stored.

R(Q) represents the term of right hand side and contains contribution of derivatives of inviscid flux

and viscous flux, and is used in time advancement and residual computation. Thus R(Q) should be

stored during the whole computation. In the following, we denote the size of original grid of current

process with N, and the size of grid containing ghost cell with M. According to above analysis, we

need to allocate memory for fundamental and dominant variables with M * elements.

During computations of derivatives of inviscid and viscous fluxes, there are many temporary

variables at each grid point. According to numerical methods applied in inviscid terms, we need

memory space to store the physical flux fi and numerical flux fi1/2 , as shown in (see [Ye et al.] -

Eq.(16). The total size of temporary variables required is 10 * M. In the viscous term, the derivatives

of (u, v, w, T) with respect to (x, y, z) are widely used. Moreover, viscous flux has to be stored

according to our kernel mentioned in the next section. In total, we need to allocate memory space for

temporary variables in the viscous term with size of 17 * M. However, the inviscid term and viscous

term are computed successively, and there is no need to allocate memory space for temporary

variables in inviscid term and viscous term, respectively. Thus we allocate memory space with size

of 17 * M elements, which is enough for current public usage and potential usage elsewhere.

In short, we need permanent memory space with a size of 49 * M elements. The variables mentioned

above are all in float type. In our solver, arrays in integer type are widely applied as well, such as in

blocks communication and physical boundary conditions. We allocate memory for arrays for public

usage in the way mentioned above too. Our solver can avoid unnecessary data transfers between CPU

and GPU. Meanwhile, we save memory of GPU up to the hilt, and increases the maximum grid size the

solver can deal with. The maximum grid capacities of our solver in different single GPU are shown in

Table 3.2.3. The data is stored in double precision float type. The result is satisfying for a single

Table 3.2.3 Maximum grid capacity of HiResX solver in different GPUs

GPU. The computation speeds in full load are also well acceptable in general applications, because

much more grid will lead to unacceptably computation speed. Therefore, the memory strategy

mentioned above is well matched with our programming algorithm strategy.

3.2.3.4.7 CUDA Kernels

The efficiency of GPU computing relies heavily on the algorithm and programming skills. To fully

exploit the parallel capability of GPU, one needs deep understanding of hardware of GPU and its

execution mechanism. The GPU architecture consists of a scalable array of Streaming

Multiprocessors (SM). Figure 3.2.13 shows the structure of GPU in software and hardware layers.

Each SM in a GPU is designed to allow concurrent execution of hundreds of threads, and generally

there are multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently

on a single GPU. In CUDA, all the threads are organized in group called thread block. A thread block

can only be scheduled on one SM, and remains on that SM until execution finishes. Hence, how to

utilize threads is one of the primary things in kernel design. In CUDA memory model, there are many

types of memory that are programmable to user, as shown in Figure 3.2.13. Global memory is the

largest, highest-latency, and most commonly used memory on a GPU. It is accessible to any SM

throughout the lifetime of the application.

Registers are the fastest memory space on a GPU, and are partitioned among active warps in an SM,

so register variables are private to each thread. Registers store automatic variables declared in a

kernel without any other type qualifiers. Arrays declared in a kernel with constant that can be

determined at compile time, are also stored in registers. Local memory in nature is a part of memory

resides in global memory, and is used to store variables in a kernel that are eligible for registers but

cannot fit into the register space allocated for that kernel, including arrays referenced with indices

that cannot be determined when code is compiled.

Shared memory is a type of on-chip memory space that has much higher bandwidth and much lower

latency than local or global memory. Registers and shared memory are scarce resource, and these

limited resources impose a strict restriction on the number of active warps in an SM, thus affect the

parallel performance in an SM. Therefore, how to utilize these types of memory is another thing that

is vital to performance in kernel design.

Figure 3.2.13 Organization of CUDA threads for NVIDIA’s GPU and its mapping to hardware structure

In GPU computing, CPU is called host, and GPU is called device. Kernel function is a piece of code that

runs on device. Design of kernel functions is the core of GPU computing. In current computation, most

of the computation in solving (see [Ye et al.] - Eq.(1)) is spent on evaluation of inviscid term and

viscous term. Evaluating inviscid term and viscous term involves much more complex operations

than other sections of computation. Therefore we give will more details of them in the following.

3.2.3.4.8 Inviscid Fluxes

We consider a three dimensional grid block with sizes of (NI,NJ,NK) in I, J, K directions. In CPU source

code, derivatives of convective fluxes with respect to _, _, _ directions are evaluated in sequence, and

in each direction we perform procedure from ([Ye et al.]) – see Eq.(7) to Eq.(16)). Take the derivative

in I direction as example, in the JK face of grid block, NI * NJ one dimensional problems in I direction

are solved. For each pipeline in I direction, conservative variables at half-point or edge location (i +

1/2) are evaluated with fifth order WENO interpolation as is shown in ([Ye et al.] – see Eq.(10) to

Eq.(14)). To minimize oscillation, interpolation is performed in characteristic space. There are two

Figure 3.2.14 Code Structure For Derivative of Inviscid Flux Computed With Original AFWENO

interpolated values, u+i+1/2 and u−i+1/2, at each half-point location, which are approximated with values

of left-biased point stencil (i−2, i−1, i , i+1, i+2) and right-biased point stencil (i−1, i , i + 1, i + 2, i + 3)

respectively. Then at each half point, Steger-Warming flux is used to construct numerical flux f i+1/2.

The physical flux at the grid point can be simply obtained with (Eq.(3) - see [Ye et al.]). Finally, the

derivative at each point can be approximated with ([Ye et al.] – see Eq.(16)). Derivatives in the other

directions can be done in the same way, and the results are accumulated during the loop of direction

to construct the term R(Q) in ([Ye et al.] – see Eq.(17)) for Runge-Kutta integration.

All above computation is done by a single process when no multi-threads technique like OpenMP is

applied. The code running on GPU device should do all the work as it does on CPU. The structure of

GPU and its execution mechanism have been introduced earlier, and it is noted that we can call tens

of thousands of threads to complete certain computations.

To port a code onto GPU, the first work to do is how many threads to be called and how to arrange

their works. In our early version of our GPU code, a thread is assigned to complete a 1D problem, so

the number of threads to be called is the size of the sweeping face of grid block, NJ * NK for JK face.

However, the acceleration performance is poor. The reason for this problem is that, the number of

active threads is severely restricted by the high occupation of register in the kernel function. In

addition, the number of threads to be called is not large enough and is limited by sweeping faces of

grid block which usually vary with sweeping directions.

To solve this problem, two improvement measures are introduced. The first improvement measure

is that atomic level operation is applied. Other than assigning each thread to complete a 1D problem

of the whole grid line, we assign each thread to complete the work at only a grid point, it is so called

"atomic level operation". Thus the number of thread to be called increases up to the number of total

grid points, and it does not vary with sweeping directions of derivatives. The CUDA kernel is

therefore configured as 

󰇧󰇛󰇜

 󰇨

 󰇛󰇜

Eq. 3.2.1

Generally, optimizing the workloads to fit within the boundaries of a warp, which is a group of 32

threads, will lead to more efficient utilization of GPU compute resources. Thus, the value of

BLOCK_SIZE_INV ISC should be divided evenly by 32. Improper value that is too small or large may

cause performance reduction. In practice applications, 256 is usually proved to get satisfying

performance. But the best value should be determined by testing.

The second improvement measure is utilization of kernel decomposition technique. By decomposing

a big kernel function into several small ones, the number of registers needed in kernel function

reduces and the number of active threads increases. Figure 3.2.14 shows the pseudo code of

computation of derivatives of convective flux. The procedure of evaluating derivative is spitted into

three parts, and the computation of each part is done with a kernel function. In kernel

cuda_afweno_split_flux(...), numerical flux f i+1/2 is computed at each half point by a thread, and the

result is written into public memory space. After the kernel finishes, numerical fluxes at all half-point

locations in that direction are already known.

Then, the second kernel function cuda_phys_flux(...) computes physical flux fi at each point by a

thread, and the result is written into public array in global memory too, and at the end of it, physical

fluxes are all known. At last, in the third kernel function cuda_cflux_df(...), numerical flux and physical

flux are obtained from global memory to compute derivatives at each point by a thread, and the result

is stored in array R(Q) in global memory. Theoretically, the second kernel can be merged with the

first kernel. But the merged kernel function occupies too much registers and the number of active

warps is strictly restricted due to limited registers. By separating the independent part out and

computing it with a small kernel, more warps are active, thus improve parallel performance.

However, we do not divide the kernel cuda_afweno_split_flux(...) into two parts to calculate

interpolation of conservative variables and numerical fluxes as is done in [6], because delivering data

from one kernel to another kernel involves access to global memory, which brings high latency.

Moreover, this two parts are complex as well, and the kernels still occupy too much registers. So

number of active threads does not increase apparently.

Our test shows that the time saved by increasing active threads does not compensate the latency of

access of global memory. It is worth mentioning that, metrics and Jacobian at half point which are

needed to compute numerical flux are neither stored in global memory, nor transferred from CPU. If

we save them in global memory permanently, more memory is required for solver to deal with a

certain grid size. Although the public memory space is large enough to hold them, we do not transfer

them from CPU either, because our test suggests that computing is much faster than transferring.

Thus they are interpolated from point values directly when they are needed.

We have known that in GPU, as processor-to-memory latency increases, the capacity of that memory

increases. Global memory is the largest, highest-latency memory space on a GPU. So frequent access

to global memory will increase the execution time of a kernel. Therefore, reducing access to global

memory provides possibility of optimization during programming and numerical algorithm design.

In current work, we also propose an improved WENO interpolation, which reduces access to global

memory and also reduces

computation price.

Figure 3.2.15 shows the

schematic diagrams of

original WENO interpolation

and improved WENO

interpolation. In original

method, computing numerical

flux at i + 1/2 needs points

stencil (i−2, i−1, i, i+1, i+2,

i+3), and performs two

complete WENO

interpolations from (Eq.(10) -

to Eq.(14) – see [Ye et al.]).

The two WENO interpolations

share the same characteristic

space located at i + 1/2. In

CUDA kernel function

cuda_afweno_split_flux(...)

mentioned above, for Steger-

Warming splitting flux

method, f+i+1/2 and f−i+1/2 are

obtained simultaneously in

the same thread, thus

numerical flux is obtained

immediately with f i+1/2 = f+i+1/2

+ f−i+1/2. In our improved

method, we build the

characteristic space at i, and we utilize the points stencil (i−2, i−1, i, i+1, i+2) to perform WENO

interpolation to get u+i−1/2 and u−i+1/2. We do not need to perform two complete independent WENO

interpolations, because the characteristic variables and the smoothness indicators are shared by two

Figure 3.2.15 Comparison of WENO interpolations. In original

WENO interpolation, the characteristic space is built at i + 1/2, and u±

i+1/2 are computed simultaneous in a loop. In modified WENO

interpolation, the characteristic space is built at i, and u−i+1/2 and

u+i−1/2 are computed simultaneous in a loop. In modified WENO

interpolation, point stencils are the same, so the characteristic

variables and smoothness indicators are shared, thus it reduces

calculation and global memory access.

WENO interpolations. Since we cannot get f+i+1/2 and f−i+1/2 in the same thread, thus we have to

compute numerical flux f i+1/2 with a new kernel function cuda_sum_split_flux(...). The procedure

is shown in Figure 3.2.16.

3.2.3.4.9 Viscous Fluxes

Evaluation of viscous terms is slightly easier than convective terms. According to (see [Ye et al.] - Eq.

(4) and (5)), we have to compute (u, v, w, T)x,y,z before computing viscous fluxes. Here, we

approximate the derivative of viscous flux with fluxes defined at point locations, so we need to

compute (u, v, w, T)x,y,z once and share them during derivatives in all directions. Computing (u, v, w,

T)x,y,z is easy by chain rule. In general curvilinear coordinate system, derivatives with respect to (x, y,

z) can be obtained with chain rule as follow





















Figure 3.2.16 Code structure of derivative of inviscid flux computed with modified AFWENO











Eq. 3.2.2

(u, v, w, T)ξ,η,ζ can be approximated directly with sixth order central scheme. In CPU code, the

computation strategy of viscous term is similar with inviscid term. After (u, v, w, T)x,y,z are computed,

viscous flux and its can be obtained in each direction.

We port the code to GPU similar to convective term. Figure 3.2.17 shows the procedure of

evaluation of viscous term. There are two parts and each part involves loop in directions. In the first

part, contribution of (u, v, w, T)x,y,z in each direction are computed, accumulated, and stored in public

array in global memory. For example, in ξ direction, ξx * ∂(u ,v, w, T)/∂ξ , ξy * ∂(u ,v, w, T)/∂ξ and ξz *

∂(u ,v, w, T)/∂ξ are computed. This part is complete by only one kernel function

cuda_vars_derv_xyz(...). We have two strategies of thread schedule for this kernel. In the first

strategy, the kernel is designed to compute contribution in each direction for four variables (u, v,w,

T) at one point by one thread. So the CUDA kernel cuda_vars_derv_xyz(...) is therefore configured

int blocksize = BLOCK_SIZE_V ISC;

int nblock = (( NI * NJ * NK − 1) / blocksize ) + 1;

cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);

In the second strategy, the kernel is designed to compute contribution in each direction for only one

variable at one point by one thread. So the number of threads to be called is four times of that of the

first strategy, and the CUDA kernel cuda_vars_derv_xyz(...) is therefore configured as

int blocksize = BLOCK_SIZE_V ISC;

int nblock = (( NI * NJ * NK * 4 − 1) / blocksize ) + 1;

cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);

Each thread must determine the indexes of grid point and the variable according to its thread ID. The

indexes of grid point is needed because of one-side scheme for derivative near boundaries of grid

block. In convective term, we do not need to care about derivatives near boundaries of grid block. In

the second part, two kernels are applied which are similar to convective terms. In the first kernel,

viscous flux at each grid point is computed by one thread. After the first kernel completes, the second

Figure 3.2.17 Code structure of the derivative of viscous flux summarized by a pseudo code

kernel cuda_vflux_df(...) is ready to compute the derivative at each grid point by one kernel, and the

result is accumulated into R(Q) which has already included the contribution of convective terms.

3.2.3.4.10 Kernels in Other Section of The Solver

We have introduced the kernels of the convective and viscous terms which account for the majority

of total computation, and they are completely performed on GPU. In order to achieve the whole

computation performed on GPU, other parts of the solver have to be ported onto GPU as well.

Fortunately, these parts involve only simple algorithms and operations, which are independent on

grid. Therefore, we apply the "atomic level operation" when design the kernels. For computation of

residual, in which summation of all inner grid points is necessary, although there is atomic operation

in CUDA, we utilize "reduced-dimensional operation" to get better efficiency. "Reduced-dimensional

operation" here we mean that, for 3D grid block, CUDA threads are mapped into grid points of the

maximum face of grid block, and each thread performs specified operation like accumulation over

points of that grid line. When the "3D → 2D" procedure is complete, the "2D → 1D" and "1D →

SingleValue" are performed in similar fashion. The "reduced-dimensional operation" is also applied to

find the minimum time step for unsteady simulation controlled with the CFL number.

3.2.3.5 Performance Result

3.2.3.5.1 Speedup Varies with Grid Size

In this section we present the performance of HiResX running on a single GPU with different gird

sizes. We compare performances on different GPUs as well. Three different GPUs are employed for

test: RTX 2080 Ti, Tesla P100 and Titan V. The specs and the characteristics of them are introduced

Figure 3.2.18 Global performance varying with grid sizes. The speedup is defined as the ratio of

elapsed time of running on CPU with one core to lapsed time of running with a GPU

before. The maximum grid capability of the solver on these GPUs are also listed in Table 3.2.3. Since

the memory capability of RTX 2080 Ti GPU is the smallest one, the maximum grid size tested here is

bounded to 25 M. The CPU in the test is Intel Xeon E5-2680v3. For CPU computing, solver runs on

single CPU core and all computation is performed on CPU. For GPU computing, because GPU code

cannot run independently of CPU, the present solver uses one CPU core and one GPU device, but

nearly all the computation is performed on GPU device. The speedup here is defined as the ratio of

time of CPU computing to time of GPU computing for the same case. Global speedups of the optimal

GPU version of the code compared to the serial CPU version for grid sizes arranging from one million

to 25 million are shown in Figure 3.2.18. We get significant speedups for different GPUs for all grid

sizes. Generally, the speedups for all GPUs increase with the grid size when grid size is smaller than

15 million, but after that RTX 2080 Ti and Tesla P100 gradually reach their limits. Thus the maximum

speedups for RTX 2080 Ti and Tesla P100 are about 640 and 1500 respectively. Titan V has not

showed a drop when the grid size increases.

Due to the limitation of memory capacity and the grid capacity, the maximum speedup of Titan V is

predicted to be about 1950. The trend of actual measured performances are matched with their ideal

DP performances for these three GPUs, though the amplitudes deviate much from ideal DP ratios.

According to the specs of these GPUs in the bandwidths of RTX2080 Ti and Titan V is close, but actual

measured performance of Titan V is nearly three times as much as that of RTX 2080 Ti, which benefits

from more CUDA cores and the higher ratio of number of double precision process units to single

precision process units (1:2) in Titan V. Tesla P100 has less CUDA cores than RTX 2080 Ti, but has

higher ratio of number of double precision process units to single precision process units, 1:2 for

Tesla P100 and 1:32 for RTX 2080 Ti, resulting in more double precision process units. Hence RTX

2080 Ti is mainly bounded to computation rather than memory bandwidth.

According to performance analysis tools of NVIDIA, the large deviation from ideal performances for

these three GPUs comes from the relatively low occupancy in convective fluxes and viscous fluxes

which account for the majority of global computation. Low occupancy is caused by the limitation of

registers. If there are not enough active threads, double precision capability would be excessive

which causes large deviation from ideal performance. Although we have decomposed the big kernel

into smaller ones and cut down number of registers in kernel to increase occupancy and number of

active threads, high latency of introduction of other type of memory (e.g. global memory) cancels the

accelerating effect of higher occupancy. So there is balance of occupancy and memory efficiency.

3.2.3.5.2 Performance of Kernels

Global performance relies on performances of kernels in each part of the solver. Figure 3.2.19

shows performances of several main parts in which all computations are located. As shown in the

figure, the part with massive computation gets apparent accelerating, such as evaluation of inviscid

term, viscous term and time step, and procedure of time advance and computation of primitive

variable. In these parts, there are a lot of complicated calculation at each grid point. Though

evaluation of residual is performed at each point, the kernel for residual computing is not designed

to run with "atomic operation", and there is no complicated calculation in the kernel. So the

performance of residual evaluation is relatively low.

For Runge-Kutta integration in time advancement, there is only a little calculation in this part, but the

calculation is performed at each grid point independently, so the "atomic operation" strategy brings

apparent acceleration. The speedup of Titan V is double of RTX 2080 Ti because Titan V has more

CUDA cores and higher bandwidth. For convective terms, in which a lot of complicated calculation is

involved, the speedup of Titan V is relatively high. But the speedup of RTX 2080 Ti is only one fifth of

Titan V, because both number of double precision process units of RTX 2080 Ti and number of CUDA

core are less than that of Titan V. For viscous terms, in which the computation is second only to the

convective term, the accelerating effect is the best. It is worth to mention that, the gap of speedups of

both GPUs is much smaller than that of the convective terms. It is the kernel decomposition used to

increase the occupancy and active thread that brings the effect. In each kernel in the viscous term,

the amount of calculation is medium, so for RTX 2080 Ti the number of double precision process

units is not that scarce, but for Titan V it may be excessive. The scenario is similar for time step.

3.2.3.5.3 Performance of Running with Multiple GPUs

In this section, the performance of our solver on multiple GPUs will be investigated. This test aims to

show the scalability of our solver when it is running on multiple GPUs. For traditional CPU code, when

it runs in parallel, the parallel efficiency is affected by communication among processes and

computation loads of processes. In early GPU computing code, besides communication on CPU and

computation load, data transfer between CPU and GPU affects parallel efficiency as well. For modern

GPU computing, communication between GPUs is a new factor.

The grid size of all cases in this test is fixed to be 25 M

due to limitation of the memory capacity of

RTX2080 Ti. A single block with 25 million grid points is divided nearly equally into several sub-

blocks and each GPU computes one sub-block. Each GPU is driven by a CPU core, so for multiple GPUs

cases, solver runs in parallel with MPI actually.

M = million

Figure 3.2.19 Performance of main parts of the solver. Generally, the parts that contain large amount of

computation get apparent acceleration. In the sections of time advance and viscous flux, due to the high

occupancies of kernels, both GPUs get higher speedups. For evaluation of time step and inviscid flux, Titan

V gets higher speedups than RTX 2080 Ti, because Titan V has more double precision operation

processing unit, which means higher double precision operation performance, see Error! Reference

source not found..

As we have mentioned that for GPU computing nearly all computation are performed on GPU, so the

contribution of CPU to global computation is negligible. The elapsed times of the solver running with

number of GPUs varies from 1 to 10 are shown in Figure 3.2.20. The scalability gradually deviates

from linear value when the number of GPUs increases, but the scalability efficiency is still larger than

75% when 10 GPUs are utilized. We observe that performance of scalability of Titan V drops faster

than RTX 2080 Ti. However, it is not the parallel efficiency of CPU that causes performance drop, but

the fact that when grid size decreases, the speedup of GPU drops as well. As illustrated in Figure

3.2.18, the variation of speedup of Titan V is greater than that of RTX 2080 Ti. Performance varies

with the computation load, which is a remarkable feature of GPU different from CPU. Therefore,

compared with CPU, the scalability performance of GPU is slightly lower.

The number in block presents the GPU ID in which computation is performed. Blocks with the same

color indicate that they are in the same PCIe switch. In cases I and II, computations are performed on

GPUs that are in two different PCIe switches respectively. In cases III to IV, computations are

performed on GPUs belong to both two PCIe switches. For case III, the maximum data transfers are

performed across PCIe switch. In case IV, the maximum data transfers are performed within the same

PCIe switch. In case V, all data transfers are performed across PCIe switch.

Figure 3.2.20 Strong scaling result of different GPU. Both GPUs’ scalabilities deviate from linear result,

but the efficiencies are all larger than 75% when 10 GPUs are utilized. The drops of scalability come from

the fact that when grid size decreases, the speedup drops too, see Figure 3.2.18. Scalability of Titan V

drops faster than RTX 2080 Ti’s, because the speedup of Titan V drops faster than RTX 2080 Ti’s when

the grid size decreases.

To investigate how the topology of GPUs in the server influences communication efficiency of GPUs,

a computation model with 4 blocks are assigned to be computed with 4 GPUs in the server. The

computation model is shown in Figure 3.2.21 (a), the size of exchange data between blocks in

horizontal direction is 1/8 of the size of exchange data between blocks in vertical direction. The

topology of GPUs in our test server is similar to type (c) in Figure 3.2.7.

There are 10 Titan V GPUs in our server, and each PCIe switch has 5 GPUs mounted on it. Hence there

are two "PIX" GPU groups, GPU 0  4 and 5  9. And GPUs between the "PIX" groups are in relation

of "PHB". As we have introduced before, GPUs in "PIX" relation communicate faster than GPUs in

"PHB" relation. So put processes that have large data transfer between each other in "PIX" group of

GPUs will get higher communication efficiency.

For better comparison, five cases are considered, as shown in Figure 3.2.21 (b). For all cases, the

blocks with the same color means that they are in the same PCIe switch. Case I and case II are used

to test the communication performance of GPUs in each PCIe switch. In case III, processes with largest

exchange data are not distributed to GPUs in the same PCIe. In case IV, processes with largest

exchange data are distributed to GPUs in the same PCIe, and it is expected to get higher efficiency

than case III. Case V is designed to be the worst situation. We only run the test on server with Titan

V GPU, because RTX 2080 Ti GPU does not support P2P communication technique without Nvlink,

while Titan V supports P2P by PCIe.

The result is shown in Figure 3.2.22. As expected, cases running in the same PCIe switch are the

best. Case IV also gets good performance as case I and II due to reasonable distribution of

communication load. Case V gets the worse performance because each GPU communicates with GPUs

that are all located in another PCIe switch, leading to the worst distribution of communication load.

The results indicate that the GPU-GPU communication optimization technique is crucial for programs

running on current GPU server in order to fully exploit the performance.

Figure 3.2.21 Cases configuration. In figure (a), there are 100 points in the spanwise direction. Data

exchange in vertical direction (red) is larger than data exchange in horizontal direction (blue). In figure (b).

five cases are set to test the communication performance of different strategies of computations assigned

to different GPUs. The number in block presents the GPU ID in which computation is performed. Blocks

with the same color indicate that they are in the same PCIe switch. In cases I and II, computations are

performed on GPUs that are in two different PCIe switches respectively. In cases III to IV, computations

are performed on GPUs belong to both two PCIe switches. For case III, the maximum data transfers are

performed across PCIe switch. In case IV, the maximum data transfers are performed within the same PCIe

switch. In case V, all data transfers are performed across PCIe switch.

3.2.3.5.4 Case Study

To further investigate

performance of our solver in

the practical application, a

planar supersonic jet is

simulated with implicit large

eddy simulation (ILES). The

configuration is similar to the

Ref.[16], and we simulated an

under-expanded planar jet

with the ratio of pressure

pe/p∞ = 2.09. The

computational domain is

discretized by a Cartesian grid

with total 67.7 M

points. The

physical domain, which

excludes the sponge zone, has

dimensions 64h × 30h × 5h,

with a nozzle extending over

0.6h inside the domain. The jet

height is h = 3.5mm

corresponding to a Reynolds

number based on the jet height

and acoustic speed Re = ach/ν = 8.15×104. The velocity profile inside the nozzle is a laminar Blasius

solution with boundary layer thickness of δ = 0.05h, and there are 12 points within the boundary

layer, 70 points inside the jet height. The constant time step Δt = 2.5 × 10−3 is implemented.

The non-dimensional time for the

whole computation is 360, and

only the last 210 is outputted for

analysis. Three Titan V GPUs are

utilized for computation, and the

execution time for the simulation

is only about 41 hours.

Figure 3.2.23 shows the shock-

cell spacing Ls/h as a function of

the fully expanded jet Mach

number Mj , and various

experimental results[18, 19], LES

data of Berland et al.[16] and

theoretical data[17] are plotted for

comparison for rectangular jets. It

is clear that present result is closer

to experimental results than that

of Berland et al.. In Figure 3.2.24,

the Strouhal number of the

fundamental discrete frequency is

given and plotted against the fully

expanded jet Mach number for

M = million

Figure 3.2.22 Performance of HiResX running on different GPU

topology. Cases in which all GPUs are in the same PCIe switch get best

communication efficiency, see case I and II. For case in which GPUs are

located in different PCIe switches, if GPU devices are optimally

assigned according to communication load, communication efficiency

can be also improved, see case IV. Without GPU to GPU optimization,

the communication efficiency is lower, see case III and V.

Figure 3.2.23 Shock-cell spacing Lh/h as a function of the fully

expanded jet Mach number Mj . Our result is more close to

theoretical[17] and experimental results [18, 19], compared to the

LES results of Berland et al.[16].

several experiments on

rectangular jets, where the

experimental data of Panda[18]

and LES data of Berland et al.[16]

are shown for comparison. Clearly,

our result is in good agreement

with experimental result. Figure

3.2.25 illustrates the flow

structure of the jet and its

generated acoustic field, which are

presented with spanwise vorticity

!z and dilatation in the plane z/h =

2.5. The above results prove the

reliability of our solver, and the

short execution time

demonstrates the high efficiency of

our solver running with GPUs.

3.2.3.6 Conclusions

Heterogeneous computing is

changing our way of scientific

computing. The boom of GPU

computing in the past several

years shows the power and potential of GPU on computing, and it attracts more and more researchers

to exploit its application in their fields, including CFD simulation. In the early years of GPU computing,

many attempts have been made to port CFD codes onto GPU, but the results are not that satisfying as

Figure 3.2.24 Strouhal number St = fh/Uj of the fundamental

screech tone as a function of the fully expanded jet Mach number

Mj . Our result is in good agreement with experimental result of

Ref.[18], which is better than the LES result of Ref.[16].

Figure 3.2.25 Instantaneous snapshot of spanwise vorticity !z and the dilatation in the plane z/h = 2.5 as

the background. The isosurface of vorticity is colored with the amplitude of velocity

expected. With the development of GPU computing, more powerful GPUs and relating technologies

are present, which prompt us to further exploit its application in CFD.

In this work we analysis the characteristics of architectures of current GPU servers, and propose a

set of techniques to improve efficiency of data transfer between CPU and GPU, and efficiency of

communication between GPUs. An in-house compressible flow solver based on high order finite

difference method on curvilinear coordinates are successfully ported to GPU with CUDA C. By

carefully memory planning, we save time of unnecessary data transfer between CPU and GPU without

significant sacrifice of capability of grid size of our solver.

The "atomic operation" technique and kernel decomposition technology are applied to design high-

efficiency kernels. A modified AFWENO is proposed, which saves computation and reduces memory

access, and works better on GPU compared with original WENO. Tests have shown that our solver

gets maximum speedups almost 2000x on a Titan V GPU, 1500x on a Tesla P100 and 650x on an RTX

2080 Ti GPU over a CPU core of E5-2680v3. The hardware aware technology is proved to be more

efficient than unoptimizable scenarios. A test case of a supersonic jet shows the practical application

ability of present solver. This work provides an systematic and efficient solution to apply GPU

computing in CFD simulation with certain high order finite difference methods on current GPU

heterogeneous computers.

3.2.3.7 References

[1] D. R. Chapman, Computational aerodynamics development and outlook, AIAA journal 17 (1979).

[2] H. Choi, P. Moin, Grid-point requirements for large eddy simulation: Chapman’s estimates

revisited, Physics of fluids 24 (2012) 011702.

[3] B. Tutkun, F. O. Edis, A GPU application for high-order compact finite difference scheme,

Computers & Fluids 55 (2012) 29–35.

[4] V. Esfahanian, H. M. Darian, S. I. Gohari, Assessment of WENO schemes for numerical simulation

of some hyperbolic equations using GPU, Computers & Fluids 80 (2013) 260–268.

[5] K. I. Karantasis, E. D. Polychronopoulos, J. A. Ekaterinaris, High order accurate simulation of

compressible flows on GPU clusters over software distributed shared memory, Computers & Fluids

93 (2014) 18–29.

[6] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang, Z. Wang, et al.,

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the

tianhe-1a supercomputer, Journal of Computational Physics 278 (2014) 275–297.

[7] J. Lai, Z. Tian, H. Li, S. Pan, A CFD heterogeneous parallel solver based on collaborating CPU and

GPU, in: IOP Conference Series: Materials Science and Engineering, volume 326, IOP Publishing, 2018.

[8] J. Lai, H. Li, Z. Tian, Y. Zhang, A multi-GPU parallel algorithm in hypersonic flow computations,

Mathematical Problems in Engineering 2019 (2019).

[9] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle using a

GPU, Journal of Computational Physics 227 (2008) 10148–10161.

[10] J. Lei, D. l. Li, Y. l. Zhou, W. Liu, Optimization and acceleration of flow simulations for CFD on

CPU/GPU architecture, Journal of the Brazilian Society of Mechanical Sciences and Engineering

(2019).

[11] J. Crabill, F. D. Witherden, A. Jameson, A parallel direct cut algorithm for high-order overset

methods with application to a spinning golf ball, Journal of Computational Physics 374 (2018).

[12] F. D. Witherden, A. M. Farrington, P. E. Vincent, Pyfr: An open source framework for solving

advection–diffusion type problems on streaming architectures using the flux reconstruction

approach, Computer Physics Communications 185 (2014) 3028–3040.

[13] Y. Jiang, C.-W. Shu, M. Zhang, An alternative formulation of finite difference weighted Eno

schemes with Lax–Wendroff time discretization for conservation laws, SIAM Journal on Scientific

Computing 35 (2013) A1137–A1160.

[14] G.-S. Jiang, C.-W. Shu, Efficient implementation of weighted Eno schemes, Journal of

computational physics 126 (1996) 202–228.

[15] G. Patnaik, A. Corrigan, K. Obenschain, D. Schwer, D. Fyfe, Efficient Utilization of a CPU-GPU

Cluster, American Institute of Aeronautics and Astronautics, 2012. URL:

https://arc.aiaa.org/doi/abs/10.2514/6.2012-563. doi:10.2514/6.

[16] J. Berland, C. Bogey, C. Bailly, Numerical study of screech generation in a planar supersonic jet,

Physics of fluids 19 (2007) 075105.

[17] C. Tam, The shock-cell structures and screech tone frequencies of rectangular and non-

axisymmetric supersonic jets, Journal of Sound and Vibration 121 (1988) 135–147.

[18] J. Panda, G. Raman, K. Zaman, J. Panda, G. Raman, K. Zaman, Under expanded screeching jets

from circular, rectangular and elliptic nozzles, in: 3rd AIAA/CEAS Aeroacoustics Conference, 1997.

[19] G. Raman, E. J. Rice, Instability modes excited by natural screech tones in a supersonic

rectangular jet, Physics of fluids 6 (1994) 3999–4008.

3.2.4 Case Study 4 - A Matrix-free GMRES Algorithm on GPU Clusters for Implicit Large Eddy

Simulation

Authors : Eduardo Jourdan and Z. J. Wang

Affiliations : Department of Aerospace Engineering, University of Kansas, Lawrence, KS, 66045, USA

Citation : Eduardo Jourdan de Araujo Jorge Filho and Zhi J. Wang. "A Matrix-free GMRES Algorithm on

GPU Clusters for Implicit Large Eddy Simulation," AIAA 2021-1837. AIAA Scitech 2021 Forum. January

2021.

This paper investigates the performance of the matrix-free GMRES algorithm without a

preconditioner on GPU clusters for large eddy simulations (LES) with the high-order FR/CPR method.

This implicit solution algorithm does not need to store the Jacobian matrix, which is often too large

to store on GPUs. In addition to the residual operator, it is important to have an efficient global

reduction operation. The present work considers various solution approximate orders on mixed

unstructured meshes with four different element types: hexahedrons, prisms, pyramids and

tetrahedrons. Issues related to their difference in speedup are discussed and compared with the

explicit three-stage Runge-Kutta scheme on a GPU cluster. A strong scalability study is conducted on

the Summit supercomputer with up to 1200 NVIDIA V100 GPUs. A LES benchmark case is used to

evaluate different parameters and the convergence tolerance of the GMRES solver. Preliminary

performance results are also described for a test case using the High Lift Common Research Model

from the 4th AIAA High-Lift Prediction Workshop. In this case, a reduction of the simulation cost by a

factor of 15 to 35 is obtained compared to an explicit scheme on a GPU cluster.

3.2.4.1 Introduction

A series of International Workshops on High-Order CFD Methods have been organized since 2012 [1]

where benchmark vortex-dominated turbulent flow problems have been proposed and simulated

using high-order methods. One major conclusion from them is that high-order methods are more

efficient than low order ones for scale-resolving simulations such as large eddy simulation (LES) to

achieve the same accuracy. Moreover, NASA’s 2030 CFD vision states that LES will be increasingly

utilized in order to handle separated flow problems such as flow over high lift configurations and

through aircraft engines [2].

The use of high-order methods such as discontinuous Galerkin (DG) [3] and flux reconstruction (FR)

[4] for LES has grown significantly in the last decade [5–10] because of the capability of these

methods in handling complex geometries with unstructured meshes. Another advantage is the fact

that they achieve high-order accuracy by increasing the number of degrees of freedom (DOFs) inside

each element, remaining a compact scheme and hence improving its parallel performance. However,

some challenges still remain, such as the need to further reduce the cost and turnaround time of these

simulations.

At the same time, modern hardware is shifting towards the use of highly parallel accelerators, like

for instance NVIDIA’s Graphical Processing Units (GPUs). Several world flagship supercomputers

such as Summit (Oak Ridge National Laboratory), Sierra (Lawrence Livermore National Laboratory)

and Piz Daint (Swiss National Supercomputing Centre) are equipped with GPU-powered nodes. GPUs

are capable of achieving a much higher theoretical peak in floating-point operations per second

(FLOPs) by using a different parallelism strategy, executing concurrently the same instruction for a

large set of data. As of October 2020, 6 of the top 10 super-computers on the Top500 list [11] have a

node configuration with GPUs. Most recent GPU developments include the launch of the Nvidia A100

and AMD MI100 cards with increasingly higher memory bandwidth, memory resources and

computational power.

Typical challenges in the use of GPU for high-order CFD simulations involve [10, 12–20] different

parallel architecture, data-communication and scalability issues, memory access pattern and

memory bandwidth limitation.

Nevertheless, high-order methods such as DG and FR are compact with intensive local operations,

with the DOFs inside an element increasing rapidly with the order of accuracy. This leads to

operations like polynomial interpolations and matrix vector multiplications being faster and more

efficient for this particular hardware architecture, making these methods a suitable choice to be used

on GPUs.

One the other hand, higher-order spatial operators are much stiffer than low order ones [21]. As a

result, the stability limit of explicit schemes is more severe for high-order methods, and matrix-based

implicit methods take much more memory than lower order ones [21]. The different memory

architecture of the GPU creates additional difficulties for the use of implicit time solvers with high-

order methods [18]. Implicit solvers also tend to require more communication and may have memory

bound algorithms like global reduction operations.

A typical implicit solver used in CFD is the Newton-Krylov method, in which Newton-type methods

for the nonlinear system are coupled with Krylov subspace methods for linear system solving [22].

Reference [18] shows some of the difficulties in accelerating implicit solvers like the GMRES method

[23] with matrix-based preconditioners like the inherently sequential incomplete lower-upper

factorization (ILU). The speedups between one Nvidia K40 GPU and one CPU core for explicit

schemes are on the order of 70, while it is around 5 for a GMRES+ILU method [18]. Some additional

works with acceleration techniques include Ref. [14] with a preconditioned conjugate gradient solver

for the Poisson equation and Ref. [10] with a p-multigrid method and explicit smoothers.

In this context, the objective of the current paper is to investigate the use of a matrix-free

implementation of the GMRES method on multi-GPU clusters for unsteady simulations. The matrix-

free GMRES method [22] has a much smaller memory footprint, and besides the residual operator, it

needs an efficient global reduction algorithm and internal product calculation. Since the focus of the

current work is on unsteady simulations, no preconditioner is used because this system of equations

has a better condition number than steady problems since a small time step is needed to capture the

turbulent dynamics.

First, the speedup between CPU and GPU implementations of the matrix-free GMRES method is

measured for several orders of accuracy and cell types. Different parts of the algorithm such as the

residual operator and the Gram-Schmidt orthogonalization process and their performance are

investigated. Strong scalability studies from p1 to p5 are done with up to 200 nodes and 1200 GPUs

on the Summit supercomputer, which is the number two computer in the TOP500 list of June 2020

[11].

The LES benchmark T106C turbine case is used to run parametric tests involving the nonlinear

unsteady residual convergence tolerance, the GMRES linear convergence tolerance and the Krylov

subspace size. The simulation accuracy and total turnaround time is compared between a multi-GPU

GMRES implementation, an explicit RK3 time scheme run on multi-GPUs and an implicit LU-SGS

solver on a CPU cluster. Finally, the GPU GMRES algorithm performance is tested on a mixed mesh of

the high-lift configuration version of the Common Research Model intended to be studied on the 4th

AIAA CFD High Lift Prediction Workshop.

The paper is organized as follows. Section 3.2.4.2 presents the numerical formulation of the high-

order FR/CPR method and section 3.2.4.3 describes the matrix-free GMRES algorithm. Finally, Sec.

3.2.4.4 presents the numerical results and Sec. 3.2.4.5 draws some concluding remarks.

3.2.4.2 A Brief Overview of the FR/CPR Method

The FR/CPR method was originally developed by Huynh [4] in 2007 for hyperbolic partial differential

equations, and later it was extended to hybrid unstructured meshes [24]. Further developments on

the FR/CPR method are reviewed in [25,26]. This method belongs to discontinuous finite element

methods, similar to the DG method, but also has some unique advantages. For example, FR contains

a larger family of schemes [4], which may allow larger time steps than the DG method [27]. Many

groups also reported that FR is more efficient than the DG method [28,29]. Here we present a brief

introduction of the FR/CPR method starting from a hyperbolic conservation law



 󰇛󰇜

Eq. 3.2.3

with non-overlapping elements, and introducing an arbitrary test function W in each element, the

weighted residual formulation of Eq. 3.2.3 on element Vi can be expressed as

 

 󰇛󰇜



Eq. 3.2.4

The conservative variables inside one element are assumed to be polynomials, and expressed by

nodal values at certain points called solution points (SPs). After applying integration by parts to the

divergence of flux, replacing the normal flux term with a common Riemann flux Fncom and integrating

back by parts, we obtain

 

  󰇛󰇜



 󰇟

󰇛󰇜󰇠



Eq. 3.2.5

Here, the common Riemann flux is computed with a Riemann solver





󰇛󰇜

Eq. 3.2.6

where Ui+ stands for the solution outside the current element, and n denotes the outward normal

direction of the interface. For the current work, the Riemann solver used is the Roe scheme [30]. The

normal flux at the interface is: 󰇛󰇜󰇛󰇜

Eq. 3.2.7

Note that if the face integral in Eq. 3.2.5 can be transformed into an element integral then the test

function , will be eliminated. In order to do so, a “correction field” δi is defined in each element as

 

 󰇟󰇠



Eq. 3.2.8

where Fn = Fncom - Fn(Ui) is the normal flux jump. Eq. 3.2.5 and Eq. 3.2.8 result in

 

 󰇛󰇜



Eq. 3.2.9

The final formulation for each solution point i is



 󰇟󰇛󰇜󰇠

 

Eq. 3.2.10

Where ∏j denotes a projection to the polynomial space, and subscript j denotes the j-th solution point

in a certain element. The formulation used for the viscous flux and the solution gradient is the so

called BR2 scheme [31].

3.2.4.3 Matrix-free GMRES Algorithm

The GMRES algorithm developed by Saad and Schultz [23] is an iterative linear system solver. It is

widely used in the CFD community, including with high-order methods [21]. However, its efficiency

is highly influenced by the matrix condition number [22] and typically high-order discretization lead

to stiffer problems. Furthermore, the Jacobian matrix size derived from high-order methods grows

very quickly with the solution polynomial order ?. It is proportional to ?6 for 3-D simulations. The

matrix free version of the GMRES algorithm takes advantage of the fact that the Jacobian matrix is

used only on matrix-vector multiplications. Therefore, one do not need to form the matrix exactly.

Different approximations can be used for this operation [22], and in this work a simple first order

finite difference approach is used like in Eq. 3.2.15.

A robust and efficient preconditioner that scales well on massive parallel systems is still a research

challenge [21]. This current work does not use any preconditioner. The fact that only unsteady

simulations are performed helps to partially mitigate this issue. Previous works have shown the

importance of preconditioners for steady simulations starting the iterative process from the

freestream solution [21,22,32].

The framework used here is often called the Newton-Krylov method [22]. It has an outer non-linear

solver loop like the Newton method that is used to drive the unsteady residual to zero and an inside

iterative linear system solver like the GMRES method to solve the linear system that originates from

the Newton method. The following is the derivation of this framework. Assume that after the spatial

discretization the semi-discrete problem can be written as





 



Eq. 3.2.11

in which U󰀫 represents all the degrees of freedom in the problem and R is the residual operator given

by the spatial discretization. The problem can be rewritten so that in every physical time step the

following nonlinear system is solved:









 

Eq. 3.2.12

where R* is the unsteady residual. The second order backward differentiation formula (BDF2)

time scheme is used for the time derivative discretization in this work. A Newton solver can be used

to solve this system, with the Jacobian matrix J defined as

󰇛

󰇜





Eq. 3.2.13

The Newton method consists of a sequence of iterations, with the following linear systems to be

solved:









 





Eq. 3.2.14

with a given initial condition Uˆ0 equal to the current time step solution. The Newton iteration can be

terminated based on a required drop in the norm of the unsteady nonlinear residual defined in Eq.

3.2.12. This nonlinear convergence tolerance here named ’time-eps’ can affect the final solution

accuracy, and its value is studied in this paper. In order to solve the linear system defined in Eq.

3.2.14 we use the matrix-free GMRES method without a preconditioner. This linear system can be

solved up to some tolerance named ’linear-eps’, and this parameter’s effect on the method efficiency

and convergence is studied in this paper too. The matrix vector multiplication in the GMRES solver

is replaced by the following approximation:









Eq. 3.2.15

where ε is a small perturbation, here chosen to be ε = |U󰀫|k 10-7. The pseudocode of the GMRES

method for solving the problem Jx = b is shown in Algorithm 1 following the reference [33]. The

size of the Krylov subspace is given by the parameter m  15, which is very small compared to the

mesh size.

It is important to draw attention to some points in this algorithm. First, lines 4 and 5 are the classical

Gram Schmidt (CGS) orthogonalization algorithm. Typically the modified Gram Schmidt process

could be used here to provide robustness for ill-conditioned problems [33], but the CGS algorithm is

chosen since it is more suitable for GPU parallelism. No impact in the robustness and stability of the

algorithm was noticed by using the classical method instead of the modified one in the cases

considered in the current work. The CGS algorithm has a better performance on GPUs because it

involves matrix-vector operations instead of just vector-vector operations in the modified algorithm.

Secondly, note that the cost of line 3 is basically one residual evaluation. Therefore, it is important to

have an efficient implementation of the residual calculation [19].

The numerical results in Sec. 3.2.4.4 show that most of the runtime is spent on the residual

calculation in line 3 and in the CGS algorithm. Another important aspect of the GMRES solver is that

an efficient global reduction and dot product kernel on GPU is important for lines 1, 4 and 6 in the

Algorithm 1. The current implementation uses the warp shuffle reduction proposed by Nvidia to

reduce the local vector inside the GPU to a scalar, and then we use a CUDA aware MPI implementation

to obtain the final results across multi-GPUs.

The least square problem defined in line 10 has a very small size, of the order of the number m of

Krylov vectors used. Moreover, in the current work a common technique is used to solve it, that is,

to transform the Hessenberg matrix into upper triangular form by using plane rotations [33]. This

technique allows to check the convergence of the method in each iteration l without extra work, and

the final solution update ym is calculated on the CPU by a backward substitution algorithm. In

addition, note that if the linear system is not solved up to the tolerance requested, the current

implementation uses the GMRES output to update the solution anyway and moves on to the next non-

linear iteration. The restarted version of the GMRES is not used to continue to find the linear system

solution.

Following this description of the Newton-Krylov approach, there are three major parameters: the

unsteady residual convergence tolerance ’time-eps’, the linear system convergence tolerance ’linear-

eps’ and the maximum size < of the search space. Their impact are numerically investigated in the

following section. The parameter ’time-eps’ is related to the accuracy of the time discretization, and

both ’linear-eps’ and m are related to the performance of the GMRES. In addition, two other methods

are used for performance

comparison: the LU-SGS [34, 35]

solver with BDF2 time scheme

and the third order explicit

strong stability-preserving

Runge-Kutta scheme (RK3) [36].

The LU-SGS solver is

implemented only for CPUs and

its performance for the current

code is shown in Ref. [8]. The

RK3 scheme is also implemented

on GPUs, with its performance

described in Ref. [8,19].

3.2.4.4 Numerical Results

3.2.4.4.1 Multi-GPU

Performance and

Scalability Results

In this section, results related to

single and multi-GPU

performance are described and

discussed. The goal is to

investigate the difference in

performance between

simulations with four cell types

and different solution orders.

Figure 3.2.26 shows the

speedup obtained with respect

to a single CPU core of an Intel

Xeon CPU E5-2620 chip. This

Intel chip has 6 cores and hyper-

threading capability, but the

comparison here is with respect

Figure 3.2.26 NVIDIA V100 GPU Speedup compared to a single

CPU core of an Intel Xeon CPU E5-2620 chip

to just a single core performance.

The GPU speedup for the explicit scheme with hexahedral cells is around 400 for all orders

considered. It increases slightly for p3 and p5, when it becomes 475 and 455 respectively. On the

other hand, the speedup results for tetrahedral cells increase with the solution polynomial order. It

grows from 450 to 1600 for p1 and p5 respectively. This is discussed by Ref. [19], and one of the

reasons is that a tetrahedral cell is more arithmetic intense and it has a much bigger ratio between

floating point operations and global memory access as the polynomial degree increases [19]. As also

discussed in Ref. [19], a pyramid cell follows closely the tetrahedron performance and the prism

speedup is between a hexahedron and a tetrahedron.

The GMRES results in Figure 3.2.26 show a similar trend between different cell types as the explicit

scheme, but with a slightly lower speedup. For hexahedral cells, it is 390 and 380 for p3 and p5

respectively, while for tetrahedral cells it is 380 and 1300 for p1 and p5 respectively. This slightly

lower values can be understood by the fact that almost all of cost of the explicit iteration comes from

the residual operator, which is very arithmetic intense as discussed before [19] and involves matrix-

vector and matrix-matrix operations.

On the other hand, the GMRES solver has additional operations other than the residual calculation.

As described in Sec. 3.2.4.3, they are mainly vector-vector operations and global reductions, which

have a lower parallel speedup in the GPU architecture than matrix-matrix operations. Table 3.2.4

shows the percentage

measured of a typical

simulation iteration

that is spent in the

residual operator

(line 3 of Algorithm

1) and in the classical

Gram-Schmidt (CGS)

algorithm (line 4 and

5 of Algorithm 1).

We can see that the

CGS method account

for 7% to 18% of the

iteration cost.

Interestingly, this

percentage increases

with the polynomial

order for the hex and

prism elements,

which have

Table 3.2.4 Results of the percentage of a typical simulation iteration that is spent in different parts of

the GMRES algorithm. Results for different solution order and cell types

Figure 3.2.27 Strong scalability study with GPU simulations on the KU cluster

for a mesh with 149k hex elements

considerably higher number of degrees per cell than the pyramid and tet elements for higher order.

Another important aspect investigated is the strong scalability performance of the matrix-free

GMRES solver. Strong scalability tests are done with two different grids: a coarse hexahedral mesh

with 149k cells and a fine hexahedral mesh with 15.6M elements. The coarse mesh is used on the KU

cluster. In this test, 14 nodes are used, and each node has 2 V100 GPUs. Each node also has one

Intel(R) Xeon(R) CPU E5-2680 v3 chip with 24 CPU cores after hyper threading. The network

connection uses InfiniBand but the CUDA-aware MPI is not available. The results are shown in

Figure 3.2.27.

We can see that the parallel efficiency grows with the polynomial order. Moreover, the difference

between the scalability of the explicit scheme and the GMRES solver is not so significant for p3 and

p5 as it is for p1. With 28 GPUs the number of cells per partition is 5.3k and the GMRES parallel

efficiency is 43%, 67% and 87% for p1, p3 and p5 respectively.

The strong scalability results for the fine hex mesh on Summit are shown in Figure 3.2.28. The

Summit supercomputer is a leadership computing facility operated by the Oak Ridge National

Laboratory and the United States Department of Energy. From June of 2018 to 2020, Summit was the

most powerful

supercomputer in

the TOP500 list

[11]. Each node on

Summit has 6 V100

GPUs and 2 IBM

POWER9 CPU chips

totaling 44 CPU

cores. Here, CUDA

aware MPI

communication is

used.

The results in

Figure 3.2.28

indicate again that

the GMRES parallel

efficiency grows

with the solution

order. Other than

the p1 case, the

scalability of the

explicit scheme and the GMRES solver seem similar. More than 88% of parallel efficiency is achieved

with 240 GPUs on all simulations considered. With 1200 GPUs, the number of cells per partition is

13k and the GMRES parallel efficiency is 21%, 45% and 79% for p1, p3 and p5 respectively.

Figure 3.2.28 Strong scalability study with GPU simulations on Summit for a

mesh with 15.6M hex elements

3.2.4.4.2 Transitional Flow Over the T106C Low Pressure Turbine Cascade

This case setup is taken from the 4th International Workshop on High-Order CFD Methods [37]. The

blade has a chord of C = 0.093 m, a pitch to chord ratio of 0.95, and a span to chord ratio of 10%. The

inlet condition for the cascade is chosen such that the isentropic exit Mach number is Mis = 0.65 and

the Reynolds number is Re = 80,000 based on the isentropic exit velocity. The inlet flow angle for this

case is 32.7. As the inlet turbulence is very low, the flow has a laminar separation and a relatively

slow natural transition. A direct numerical simulation (DNS) of this case is established in Ref. [38]

with an extensive hp-refinement study. The solver used in this reference [38] is the CPU version of

the same solver of the

current work, namely,

the hpMusic solver. In

addition, the fine mesh

used in this reference

was made available for

the current GPU work.

This test case is used to

evaluate the effect of the

parameters ’time-eps’,

’linear-eps’ and Krylov

space size, as well as to

compare the

turnaround time of

simulations with three

different time solvers:

the LU-SGS solver [8, 34,

35] on CPUs, the matrix-

free GMRES on GPUs

and the explicit RK3

Figure 3.2.30 Mesh provided by Alhawwary et al. [38] and time-averaged solution contours of the p2

simulation with GPU

Figure 3.2.29 Results of GPU GMRES speedup relative to the GPU RK3 for

running one characteristic time Tc using different physical time steps

scheme [36] on GPUs. A closeup view of the mesh around the blade is shown in Figure 3.2.30 (a).

It has a total of 148,704 hexahedral elements. Periodic boundaries are used in the spanwise and pitch

wise directions. For a p2 polynomial order, the total number of degrees of freedom is 4

million/equation. We use the same definition of characteristic time as Ref. [38], )Tc = C/U∞ with C

being the blade cord and U∞ the free stream velocity based on the isentropic exit conditions.

Following the analysis in Ref. [38], all simulations are run for 180Tc, with the last 100Tc being used

for time averaging. Figure 3.2.30 (b) displays

the average pressure contours around the blade.

The average values for the equivalent y+, x+ and

z+ for all simulations are 0.5, 10 and 11

respectively.

The first test is related to the physical time step.

The GMRES method is tested with different time

steps, from a factor of 1 to 100 times the

maximum stable explicit RK3 time step. The

performance results are shown as the speedup

relative to the GPU RK3 simulation when

comparing the cost to run one characteristic time

Tc. Figure 3.2.29 shows the results for p1, p3

and p5 simulations using the ’linear-eps’ as 0.2,

the Krylov subspace size as 10 and the ’time-eps’

as 0.01. The results in Figure 3.2.29 indicate

that there is an optimum time step to be used, and

for this case it is around 10 to 15 times the time

step from the RK3 stability limit. For a bigger time

step, the GMRES solver convergence rate starts to

deteriorate as the Jacobian matrix condition

number increases. For this case, the calculated improvement by using GPU-GMRES in the turnaround

time of the simulations when compared to the GPU-RK3 is about 1.5 for p1 and p3 and 2.0 for p5. The

second study is related to the linear system convergence tolerance parameter. Simulations are run

varying the ’linear-eps’ between 0.2 and 0.01. In order for the GMRES solver to converge with 0.01

as convergence tolerance, a Krylov subspace size of 30 is used in these tests. The p2 results with a

Table 3.2.5 CPU and GPU simulation final settings for the T106C test case

Figure 3.2.31 Results of GPU GMRES speedup

relative to the GPU RK3 for running one

characteristic time Tc using different GMRES

linear convergence tolerance

time step 10 times bigger than the RK3 time step and ’time-eps’ of 0.01 are shown in Figure 3.2.31.

It is noticeable that for this case the final speedup value does not change significantly varying the

’linear-eps’, at least when there are enough Krylov vectors to converge. Moreover, the unsteady

residual tolerance of 0.01 is relatively high when compared to the convergence tolerance for steady

problems. Since GPU memory is a scarce resource, the ’linear-eps’ of 0.2 is chosen with a Krylov

subspace of 10, which is already enough to converge in this problem.

Considering the results of the parametric tests, two GPU GMRES simulations are run for the duration

of 180Tc with ’linear-eps’ equal to 0.2 and 10 Krylov vectors. One converges the unsteady residual

tolerance to 0.01 and the other to 0.001. The complete settings are shown in Table 3.2.5 including

the one for the CPU LU-SGS run and the GPU RK3 simulation. The speedup when using the GPU

GMRES algorithm over the GPU RK3 is a factor of 1.7 and 1.2 for unsteady residual tolerances of 0.01

Figure 3.2.32 Comparison of the time- and spanwise-averaged pressure coefficient between the CPU

simulation and different GPU simulations

Figure 3.2.33 Comparison of the time- and spanwise-averaged coefficient of streamwise friction

between the CPU simulation and different GPU simulations

and 0.001 respectively. For this test case and mesh, the ratio between the maximum and minimum

cell Jacobian is 104, not so dramatic as the case in the next section. In order to understand the effect

of the parameter ’time-eps’ in the solution accuracy, a comparison is made in terms of time- and

spanwise-averaged coefficients. Four simulations are considered: CPU, GPU-RK3, GPU-GMRES-eps2

and GPU-GMRES-eps3. The one named ’GPU-GMRES-eps2’ has ’time-eps’ equal to 0.01 and ’GPU-

GMRES-eps3’ has ’time-eps’ equal to 0.001. Figure 3.2.32 shows the results for time- and spanwise-

averaged pressure coefficient Cp. The CPU and GPU-RK3 curves are visually on top of each other,

which helps to verify the GPU implementation. When the GMRES time solver is used, the run with

’time-eps’ equal to 0.001 is also on top of the explicit time run and the CPU simulation.

Some differences start to appear in the suction side of the blade, close to the trailing edge, when the

unsteady convergence tolerance for the GMRES is just 0.01. It is at this location that the laminar flow

separation happens with vortex breakdown and transition into turbulent structures. This region is

difficult to accurately resolve. The closer view at the trailing edge better shows the difference

between the results. It is interesting that at least for this case the GMRES solver needs to have a ’time-

eps’ equal to 0.001 to

match the results of the

LU-SGS solver with a

’time-eps’ of 0.01.

Another comparison is

made for the time- and

spanwise-averaged

streamwise friction

coefficient Cf in Figure

3.2.33. This coefficient

is harder to converge

and harder to reach a

statistically steady state

[38]. The CPU, GPU-RK3

and GPU-GMRES-eps3

hardly show any visual

difference on the global

view figure. Similar to

the pressure coefficient,

the blade section where

the difference is large for

the GPU-GMRES-eps2 is

in the suction side, close

Table 3.2.6 Comparison of mean lift and drag coefficients between the CPU simulation and different

GPU simulations. Relative error is calculated with respect to the CPU simulation

Figure 3.2.34 Comparison of the PSD of pressure at wake point(1)

to the trailing edge as shown in the enlarged view. The GPU-GMRES-eps2 slightly under-predicts the

Cf peak at 0.95 of the axial.

The mean lift coefficient Cl󰂸 and drag coefficient Cd are also calculated and compared between each

other in Table 3.2.6. The relative error with respect to the CPU simulation is calculated and also

shown in the table. Note that even the GPU-RK3 and CPU simulations have a difference in 0.1% for

the Cl; and 4 drag counts for the Cd. Possible explanations are the differences in force calculation

sampling frequency, time step and time scheme. The GPU-GMRES-eps3 simulation has very similar

results as the GPU-RK3 case as well, with a difference of 2 drag counts . Interestingly, the GPU-

GMRES-eps2 run has an error of only 5 drag counts when compared with the CPU run, but the Cl

relative error

A final comparison is made in Figure 3.2.34 with respect to the power spectral density (PSD) of the

pressure at a point in the wake. Following reference [38], the point is located close to the trailing

edge with coordinates (0.8591, -0.5137)C. The same tool [39] is used in the PSD calculation, with

more details provided in Ref. [38]. We conduct a spanwise average using four locations in the

spanwise direction, Z/C ϵ (0.0. 0.025. 0.05. 0.075). The Strouhal number St = fc/U∞ is calculated

based on the chord C, the exit velocity U∞ and frequency 5 in Hz. Due to differences in the sampling

frequency, the Strouhal number cutoff is different between the GPU-GMRES simulations and the CPU

run. All simulations are able to capture the first fundamental peak and agree in its frequency and

magnitude. Moreover, the CPU, GPU-RK3 and GPU-GMRES-eps3 curves seem to be on top of each

other. The GPU-GMRES-eps2 appears to capture some of the 2nd and 3rd harmonic frequencies of

this first peak, but it slightly damps their magnitudes. The high-frequency slope decay is also slightly

damped by the GPU-GMRES-eps2 simulation.

3.2.4.4.3 Flow Over a High-Lift Configuration of the Common Research Model

This test case is from the most recent 4th AIAA High-Lift Prediction Workshop

(https://hiliftpw.larc.nasa.gov/). Both the geometry and flow conditions are provided on the

workshop web site. The High Lift Common Research Model (CRM-HL) is an open-source, publicly-

available wing-body high lift configuration being utilized for CFD validation within a broad

international CRM-HL ecosystem. Geometry associated with the NASA 10% scale, semi-span model

configuration tested in the QinetiQ 5-metre wind tunnel is used for this study. This selected problem

is case 2a at the highest angle of attack of 21.46 degrees. The high-order (Q2) computational mesh

was generated by Steve Karman of Pointwise. This mesh contains over 4 million mixed elements.

Selected views of the computational meshes are displayed in Figure 3.2.35. The flow conditions

are: free-stream Mach number of 0.2, and Reynolds number based on the mean aerodynamic chord

of 5.49 million. The far-field boundary is placed 1,000-chords away from the geometry. The smallest

and largest element sizes differ by more than 6 orders of magnitude. We report preliminary wall-

clock times for both explicit and implicit schemes for the p2 simulations on KU’s NVIDIA V100 cluster.

For the explicit three-stage SSP Runge-Kutta scheme, the maximum time step for stability is about

1.39e-9s because of some small cells near wall boundaries. The simulation used 30 V100 GPUs, and

one time step took roughly 0.053s. For the implicit run, the optimized 2nd-order BDF scheme was

used together with a GMRES solver without a preconditioner. We considered two inner convergence

tolerance of 0.01 and 0.001. Much bigger time steps of 5e-7s and 1e-6s can be used for the implicit

simulation. In fact, for dt = 1.e-6s, we employed an inner tolerance of 0.01. One implicit time-

marching step costs 1.1s of wall clock time. Considering that the time-step for the implicit scheme

used a much bigger time step, the speedup in this case is roughly 35. When we used an inner tolerance

of 0.001, it took roughly twice the wall-clock time for the implicit scheme to converge, resulting in a

speed-up of 15. We are still in the process of running this case, and generating time-converged mean

and Reynolds stresses for the workshop. But the simulation is proceeding as expected. A preliminary

flow field is displayed in Figure 3.2.36 which shows the iso-surfaces of the Q-criterion colored by

the stream-wise velocity. This picture clearly depicts a vortex-dominated turbulent flow filed.

3.2.4.5 Concluding Remarks

The present paper evaluates the performance of an implementation of the matrix-free GMRES

algorithm with the high-order FR/CPR method on GPU clusters for industrial large eddy simulation.

Difference in speedups are discussed for four cell types: hexahedrons, prisms, pyramids and

tetrahedrons. The effect of increasing the polynomial order from p1 to p5 is also studied. This implicit

solver does not need to form the Jacobian matrix and uses less memory, a scarce resource on GPUs,

and no matrix preconditioner is considered.

The speedup from one CPU core to a GPU card is measured for NVIDIA V100 GPUs. It is about 380 for

hexahedral cells and it grows from 380 to 1300 for tetrahedral cells when increasing the polynomial

order from p1 to p5. Depending on the solution order and cell type, the GMRES solver spends around

Figure 3.2.35 Views of the high-order (Q2) hybrid mesh generated by Pointwise with over 4 million

elements

70% in the residual

operator and 15% in

the classical Gram-

Schmidt algorithm.

This last part is

important in order

keep the implicit solver

with similar CPU/GPU

speedups and

scalability

characteristics as the

explicit RK3 scheme.

A benchmark case of

the flow over the T106c

turbine blade is used

for parametric tests. A

linear convergence

tolerance of 0.2 and a

Krylov subspace size of

10 seems to give the

best performance in

this case. Two GMRES

simulations are done

with different unsteady

residual convergence

tolerances, and a value

of 0.001 seems to be necessary in order to match the explicit simulation results. In this case, the GPU

GMRES solver speedup over the GPU RK3 simulation is 1.2.

A test case from the most recent 4th AIAA High-Lift Prediction Workshop is used to measure the

performance of this method on mixed meshes and with a higher Reynolds number. Preliminary

results on a GPU cluster indicate that the GMRES solver when compared to the RK3 scheme reduces

the simulation cost by factor of 15 when the unsteady convergence tolerance is 0.001 and by 35 when

the tolerance is 0.01.

Acknowledgements

The present research has been supported by GE Global Research and the Army Research Office under

Award Number W911NF-20-1-0065. This research used resources of the Oak Ridge Leadership

Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of

the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

3.2.5 References

[1] Wang, Z. J., Fidkowski, K., Abgrall, R., Bassi, F., Caraeni, D., Cary, A., Deconinck, H., Hartmann, R.,

Hillewaert, K., Huynh, H. T., Kroll, N., May, G., Persson, P., van Leer, B., and Visbal, M., “High-Order CFD

Methods: Current Status and Perspective,”

[2] Slotnick, J., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D., “CFD

Vision 2030 study: a path to revolutionary computational aero sciences,” NASA CR-2014-218178,

NASA, 2014.

[3] Cockburn, B., and Shu, C. W., “TVB Runge-Kutta Local Projection Discontinuous Galerkin Finite

Element Method for Conservation Laws V: Multidimensional Systems,” Journal of Computational

Physics, Vol. 141, 1998, pp. 199–224.

Figure 3.2.36 Instantaneous flow field showing iso-surfaces of the Q-

criterion colored by the stream-wise velocity for a p2 simulation

[4] Huynh, H. T., “A flux reconstruction approach to high-order schemes including discontinuous

Galerkin methods,” 18rd AIAA Computational Fluid Dynamics Conference, Miami, FL, 2007.

[5] Uranga, A., Persson, P.-O., Drela, M., and Peraire, J., “Implicit Large Eddy Simulation of transition

to turbulence at low Reynolds numbers using a Discontinuous Galerkin method,” International

Journal for Numerical Methods in Engineering, Vol. 87, No. 1-5, 2011, pp. 232–261.

doi:10.1002/nme.3036.

[6] Beck, A. D., Bolemann, T., Flad, D., Frank, H., Gassner, G. J., Hindenlang, F., and Munz, C.-D., “High-

order discontinuous Galerkin spectral element methods for transitional and turbulent flow

simulations,” International Journal for Numerical Methods in Fluids, Vol. 76, No. 8, 2014, pp. 522–548.

doi:10.1002/fld.3943.

[7] Wang, Z., Li, Y., Jia, F., Laskowski, G. M., Kopriva, J., Paliath, U., and Bhaskaran, R., “Towards

industrial large eddy simulation using the FR/CPR method,” Computers & Fluids, Vol. 156, 2017.

[8] Jia, F., Wang, Z., Bhaskaran, R., Paliath, U., and Laskowski, G. M., “Accuracy, efficiency and

scalability of explicit and implicit FR/CPR schemes in large eddy simulation,” Computers & Fluids, Vol.

195, 2019, p. 104316. doi:https://doi.org/10.1016/j.compfluid.2019.104316,

URL http://www.sciencedirect.com/science/article/pii/S0045793019302762.

[9] Pazner, W., Franco, M., and Persson, P.-O., “High-order wall-resolved large eddy simulation of

transonic buffet on the OAT15A airfoil,” 2019 AIAA Aerospace Sciences Meeting, AIAA Paper No. 2019-

1152, San Diego, CA, 2019.

[10] Loppi, N. A., Witherden, F. D., Jameson, A., and Vincent, P. E., “A high-order cross-platform

incompressible Navier–Stokes solver via artificial compressibility with application to a turbulent jet,”

Computer Physics Communications, Vol. 223, 2018, pp. 193–205.

[11] Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., and Meuer, M., “TOP500 high performance

computing list.” https: //www.top500.org/lists/2019/11/, 2019.

URL https://www.top500.org/lists/2019/11/.

[12] Vermeire, B., Witherden, F., and Vincent, P., “On the utility of GPU accelerated high-order

methods for unsteady flow simulations: A comparison with industry-standard tools,” Journal of

Computational Physics, Vol. 334, 2017, pp. 497–521.

[13] Xu, C., Deng, X., Zhang, L., Rang, J., Wang, G., Jiang, Y., Cao, W., Che, Y., Wang, Y., Wang, Z., Liu, W.,

and Cheng, X., “Collaborating CPU and GPU for large-scale high-order CFD simulations with complex

grids on the TianHe-1A supercomputer,”Journal of Computational Physics, Vol. 278, 2014.

[14] Khajeh-Saeed, A., and Perot], J. B., “Direct numerical simulation of turbulence using GPU

accelerated supercomputers,”Journal of Computational Physics, Vol. 235, 2013, pp. 241 – 257.

doi:https://doi.org/10.1016/j.jcp.2012.10.050,

URL http://www.sciencedirect.com/science/article/pii/S0021999112006547.

[15] Romero, J., Crabill, J., Watkins, J., Witherden, F., and Jameson, A., “ZEFR: A GPU-accelerated high-

order solver for compressible viscous flows using the flux reconstruction method,” Computer Physics

Communications, Vol. 250, 2020, p. 107169. doi:https://doi.org/10.1016/j.cpc.2020.107169.

[16] Zolfaghari, H., Becsek, B., Nestola, M. G., Sawyer, W. B., Krause, R., and Obrist, D., “High-order

accurate simulation of incompressible turbulent flows on many parallel GPUs of a hybrid-node

supercomputer,” Computer Physics Communications, Vol. 244, 2019, pp. 132 – 142.

doi:https://doi.org/10.1016/j.cpc.2019.06.012, URL http://www.sciencedirect.com/

science/article/pii/S0010465519301997.

[17] Oyarzun, G., Chalmoukis, I. A., Leftheriotis, G. A., and Dimas, A. A., “A GPU-based algorithm for

efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers,” Applied

Mathematical Modelling, Vol. 85, 2020, pp. 141 – 156.

doi:https://doi.org/10.1016/j.apm.2020.04.010.

[18] Aissa, M., Verstraete, T., and Vuik, C., “Toward a GPU-aware comparison of explicit and implicit

CFD simulations on structured meshes,” Computers & Mathematics with Applications, Vol. 74, No. 1,

2017, pp. 201 – 217. doi:https://doi.org/10.1016/j.camwa.2017.03.003,

URL http://www.sciencedirect.com/science/article/pii/S0898122117301438, 5th European

Seminar on Computing ESCO 2016.

[19] Jourdan, E., and Wang, Z. J., “Efficient Implementation of the FR/CPR Method on GPU Clusters for

Industrial Large Eddy Simulation,” AIAA Aviation 2020 Forum, Reno, NV, 2020.

[20] Franco, M., Camier, J.-S., Andrej, J., and Pazner, W., “High-order matrix-free incompressible flow

solvers with GPU acceleration and low-order refined preconditioners,” Computers & Fluids, Vol. 203,

2020, p. 104541. doi:https://doi.org/10.1016/j.compfluid.2020.104541.

[21] May, G., and Jameson, A., “Efficient Relaxation Methods for High-Order Discretization of Steady

Problems,” Advances in Computational Fluid Dynamics, Vol. 2, 2011.

doi:10.1142/9789814313193_0013.

[22] Knoll, D., and Keyes, D., “Jacobian-free Newton–Krylov methods: a survey of approaches and

applications,” Journal of Computational Physics, Vol. 193, No. 2, 2004, pp. 357 – 397. doi:

https://doi.org/10.1016/j.jcp.2003.08.010, URL

http://www.sciencedirect.com/science/article/pii/S0021999103004340.

[23] Saad, Y., and Schultz, M. H., “GMRES: A Generalized Minimal Residual Algorithm for Solving

Nonsymmetric Linear Systems,” SIAM Journal on Scientific and Statistical Computing, Vol. 7, No. 3,

1986, pp. 856–869. doi:10.1137/0907058, URL https://doi.org/10.1137/0907058.

[24] Wang, Z. J., and Gao, H., “A unifying lifting collocation penalty formulation including the

discontinuous Galerkin, spectral volume/difference methods for conservation laws on mixed grids,”

Journal of Computational Physics, Vol. 2228, No. 21, 2009, pp. 8161–8186.

[25] Huynh, H. T.,Wang, Z. J., and Vincent, P. E., “High-order methods for computational fluid

dynamics: a brief review of compact differential formulations on unstructured grids,” Computers &

Fluids, Vol. 98, 2014, pp. 209–220.

[26] Wang, Z. J., “A perspective on high-order methods in computational fluid dynamics,” Science

China Physics, Mechanics & Astronomy, Vol. 59, No. 1, 2016, p. 614701.

[27] Vincent, P. E., Castonguay, P., and Jameson, A., “A new class of high-order energy stable flux

reconstruction schemes,” Journal of Scientific Computing, Vol. 47, No. 1, 2011, pp. 50–72.

[28] Park, J. S., You, H., and Kim, C., “Higher-order multi-dimensional limiting process for DG and

FR/CPR methods on tetrahedral meshes,” Computers & Fluids, Vol. 154, 2017, pp. 322–334.

[29] Yu, M., Wang, Z. J., and Liu, Y., “A numerical method for solving incompressible viscous flows

problems,” Journal of Computational Physics, Vol. 259, 2014, pp. 70–95.

[30] Roe, P. L., “Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes,” Journal

of Computational Physics, Vol. 43, No. 2, 1981, pp. 357–372.

[31] Bassi, F., and Rebay, S., “A high order discontinuous Galerkin method for compressible turbulent

flows.” Discontinuous Galerkin Methods. Theory, Computation and Applications, edited by B. Cockburn,

G. Karniadakis, and C. Shu, Springer, 2000, pp. 77–88.

[32] Saad, Y., and van der Vorst, H. A., “Iterative solution of linear systems in the 20th century,” Journal

of Computational and Applied Mathematics, Vol. 123, No. 1, 2000, pp. 1 – 33. doi:

https://doi.org/10.1016/S0377-0427(00)00412-X, URL

http://www.sciencedirect.com/science/article/pii/S037704270000412X,

numerical Analysis 2000. Vol. III: Linear Algebra.

[33] Saad, Y., Iterative Methods for Sparse Linear Systems, 2nd ed., Society for Industrial and Applied

Mathematics, 2003.

doi:10.1137/1.9780898718003, URL https://epubs.siam.org/doi/abs/10.1137/1.9780898718003.

[34] Yoon, S., and Jameson, A., “Lower-upper Symmetric-Gauss-Seidel method for the Euler and

Navier-Stokes equations,” AIAA Journal, Vol. 26, No. 9, 1988, pp. 1025–1026. doi:10.2514/3.10007.

[35] Chen, R. F., and Wang, Z. J., “Fast, block lower-upper symmetric Gauss-Seidel scheme for

arbitrary grids.” AIAA Journal, Vol. 38, No. 12, 2000, pp. 2038, 2245.

[36] Gottlieb, S., and Shu, C. W., “Total variation diminishing Runge-Kutta schemes,” Mathematics of

Computation of the American Mathematical Society, Vol. 67, No. 221, 1998, pp. 73–85.

[37] Vincent, P.,Wang, Z. J., Ekaterinaris, J., Huynh, H. T., Kroll, N., and Hillewaert, K., “4th International

Workshop on High-Order CFD Methods,” https://how4.cenearo.be, 2016.

URL https://how4.cenearo.be.

[38] Alhawwary, M., and Wang, Z. J., “On the mesh resolution of industrial LES based on the DNS of

flow over the T106C turbine,” Advances in Aerodynamics, Vol. 1, No. 21, 2019. doi:10.1186/s42774-

019-0023-6.

[39] Alhawwary, M., “A C++ toolbox for computing Discrete and Fast Fourier Transforms (DFT,FFT),

Power Spectral Density (PSD) estimates, and the sound pressure level (SPL) in (dB).”

https://github.com/mhawwary/FFTpsd, 2019. URL, https://github.com/mhawwary/FFTpsd.

4 CFD and HPC Trends Forecasted for 2030

CFD codes utilize High Performance Computing (HPC) systems, so understanding where HPC

technology might be in the 2030 timeframe is an important component of creating a vision for CFD

codes in 2030

. Of course, forecasting where HPC technologies will be in the future requires a

significant amount of extrapolation, which is especially hard in such a fast changing area as HPC. The

fastest current systems can perform more than

peta-FLOPS (1 petaFLOPS is 1015 floating point

operations per second) and the HPC community is working toward systems capable of 1018 FLOPS

(exaFLOPS), which are expected sometime between 2018 and 2023. Some work is even looking at

1021 FLOPS (zetaFLOPS). However, reaching that level of performance is unlikely without radically

new technologies. A common, though controversial, measure of HPC systems is the total number of

floating point operations a given system can perform in a second while solving a large linear system

of equations using Gaussian elimination; this is the HP LINPACK benchmark. Twice a year a list of the

top 500 systems in the world against which those numbers are measured is published by the Top500

organization. The current list (June 2013) is topped by the Tianhe-2 system, developed by China’s

National University of Defense Technologies, which achieved 33.86 petaFLOPS on the LINPACK

benchmark. Here, we will estimate only the peak floating-point performance in terms of the

maximum number of operations that can be performed per second. We note that the performance of

many applications, including CFD applications, may be more accurately estimated by using sustained

memory bandwidth; for the purposes, provided that other aspects of system performance remains

the same.

4.1 Relationship Between Semiconductors, SMT, and Microelectronics

Today’s mainstream electronics manufacturing consists mainly of semiconductor packages and

surface mount technology processes [Powell]

. Together, the two make up the vast majority of

readily acknowledged devices from cell phones to PCs, tablets, and laptops. Even smart high-end

toasters leverage these two technologies. The resulting products also end up in automobiles, stereos,

TVs, and remote controls. A lesser known technology, however, is working its way into both of these

markets – microelectronics. Below are some abbreviated definitions, meant to gain some common

understanding before diving in further:

➢ SMT (Surface Mount Technology) – examples include a motherboard in a PC, or an FR4 board

with integrated circuits (ICs), resistors, and capacitors reflow soldered making a completed

electronic product. The number of products built with this technology is staggeringly large.

➢ Semiconductor – consider lead frame packages with a single IC that is die-attached, wire-

bonded, and over-molded. These packages are made by the millions, can be surface mounted,

and soldered into through hole substrates/FR4 boards. The number of products built with

the technology is staggeringly large.

➢ Hybrid Microelectronics – multiple ICs/packages incorporated into a space-saving package.

This can include SMT and semiconductor technologies and can be very complex—mixed

technologies that allow for powerful and/or small final products. The number of products

built with this technology is small compared to SMT and semiconductor packaging, but it is a

growing market.

J, Slotnick and A, Khodadoust, J, Alonso , D, Darmofal , W, Gropp , E, Lurie , Dimitri Mavriplis , “CFD Vision 2030

Study: A Path to ,Revolutionary Computational Aero sciences” , NASA/CR–2014-218178.

Kraft, E. M., “Integrating Computational Science and Engineering with Testing to Re-engineer the Aeronautical

Development Process”, AIAA Paper 2010-0139, 48th AIAA Aerospace Sciences Meeting, January 2010,

10.2514/6.2010-139.

PTI Blog , Posted by Janine Powell on Tue, Aug 20, 2013

While SMT technology generally incorporates semiconductor products, semiconductor products do

not typically include SMT components. Microelectronics can incorporate both, and the result is

smaller and usually faster final products.

Through-hole technologies are becoming

rare. Microelectronics is a growing

packaging method and is working its way

into both of these mainstream markets –

but in a different way. Both

semiconductor and microelectronic

packages can utilize the same (or

similar) manufacturing techniques. Both

can mount bare die with conductive or

non-conductive epoxies, or eutectic

solders. Both generally employ

traditional wire ball bonding for the first

level interconnects. These markets are

distinct, but they do mix.

Just as SMT technologies incorporate

semiconductor packages but rarely the

other way around, microelectronics

can incorporate both semiconductor and SMT technologies to create complex packages and

products (see Figure 4.1.1).

4.2 Comparison of Semiconductor Fabrication Sizes in HPC

A significant measure of a processor is the feature size of its components. The smaller the features,

the more elements can be placed in the same area, and hence the more powerful a processor becomes.

Feature size also has a direct impact on power consumption, and heat generation, with smaller sizes

being better. Thus, forecasting feature sizes of future processors is very important. Unfortunately,

the industry has not always been

good in that forecasting, which is

one reason why predicting

where HPC technology will be in

2030 is particularly hard. For

example, in 2005, the

International Technology

Roadmap for Semiconductors

(ITRS) forecasted a 22-nm

(nm=10-9m) gate length by

2008; that is, the structures in a

modern processor were forecast

to have features with sizes

around 22 nm. However, in 2008

the forecast date moved to 2011

and in 2011, it moved again to

2012. A similar slip occurred for

other (smaller) gate lengths (see

Figure 4.2.1). Note that the forecasts of the ITRS combine inputs from all major chip manufacturers,

equipment suppliers, and research communities and collections, so it represents the combined

wisdom of the industry. Nevertheless, as Figure 4.2.1 shows, forecasting a key feature of even this

basic component of processors is hard. Another critical component of HPC capability in 2030 is the

Figure 4.2.1 Changing Predictions About Semiconductor Sizes

Figure 4.1.1 Relationship between Microelectronics,

SMT and Semiconductor

Microelectronics

SMT

Semiconductors

advances in software infrastructure and programming methodologies that will be necessary to take

advantage of these future HPC systems. The ultimate purpose for these systems is to solve the most

pressing problems in academia and industry. In particular, industrial users pursue this technology

because of the large impact on future product designs, and the ability to avoid or minimize the use of

other, more costly methods such as wind tunnels or other types of physical tests.

4.3 Current Status of CFD

At present, CFD is used extensively in the aerospace industry for the design and analysis of air and

space vehicles and components. However, the penetration of CFD into aerospace design processes is

not uniform across vehicle types, flight conditions, or across components. CFD often plays a

complementary role to wind tunnel and rig tests, engine certification tests, and flight tests by

reducing the number of test entries and/or reducing testing hours

. But, in many circumstances,

CFD provides the only affordable or available source of engineering data to use in product design due

to limitations either with model complexity and/or wind tunnel capability, or due to design

requirements that cannot be addressed with ground-based testing of any kind. As a result, CFD

technology development has been critical in not only minimizing product design costs, but also in

enabling the design of truly novel platforms and systems. Generally, the design process is composed

of three key phases: conceptual design, preliminary and detailed design, and product validation. The

current usage of CFD tools and processes in all three design phases is summarized below.

4.3.1 Conceptual Design

CFD is often used in the early, conceptual design of products where it has been both previously

calibrated for similar applications using data-morphing techniques, as well as for new configurations

where little or no engineering data is available to guide design decisions. Simplified models are

typically used during the conceptual optimization phase to allow reasonably accurate trades to be

made between drag, fuel consumption, weight, payload/range, thrust, or other performance

measures. Use of simplified models is necessary to allow often time consuming optimization

processes to be used in the overall design effort, but inherently places conservatism into the final

design. This conservatism derives from the use of models that are too similar within the existing

product design space, other geometric simplifications, or the use of low-fidelity CFD tools that trade

off flow physics modeling accuracy for execution speed.

4.3.2 Preliminary/Detailed Design

Once a product development program is launched, CFD is a necessary and uniformly present tool in

the detailed configuration design process. For example, CFD is indispensable in the design of cruise

wings in the presence of nacelles for commercial airplanes, and for inlet and nozzle designs; wind

tunnels are used to confirm the final designs

. In both military and commercial aircraft design

processes, CFD is the primary source of data for aircraft load distributions and ground effect

estimations. Similarly, gas turbine engine manufacturers rely on CFD to predict component design

performance, having reduced the number of single-component rigs substantially as CFD capability

has become more mature. Increasingly, multicomponent and Multiphysics simulations are

performed during the design cycle, but the long clock times often associated with these processes

restricts their widespread adoption. For space exploration, CFD is often used to gain important

Jameson, A., “Re-engineering the Design Process Through Computation”, AIAA Journal of Aircraft, 1999.

Goldhammer, M. I., “Boeing 787 – Design for Optimal Airplane Performance”, CEAS/KATnet Conference on

Key Aerodynamic Technologies, Bremen, Germany, June 2005.

Malik, M. R. and Bushnell, D. M. (eds.), “Role of Computational Fluid Dynamics and Wind Tunnels in Aeronautics

R&D”, NASA TP-2012-217602, September 2012.

Goldhammer, M. I., “Boeing 787 – Design for Optimal Airplane Performance”, CEAS/KATnet Conference on

Key Aerodynamic Technologies, Bremen, Germany, June 2005.

insight into flow physics used to properly locate external components on the surface of launch

vehicles or spacecraft. CFD is also increasingly providing substantial portions of the aero and

propulsion performance database. In many cases, wind tunnel data is used only to anchor the CFD

data at a few test points to provide confidence in the CFD database. CFD is the primary source of data

for the hypersonic flight regime when ground testing is limited or does not exist.

4.3.3 Product Validation and Certification

As the product development process moves into the validation phase and certification testing, CFD is

often used to confirm performance test results, assess the redesign of components that show

potential for improved performance, and to answer any other questions that arise during product

testing. Typically, product configurations evolve over the testing period based on a combination of

measured results and engineering judgment bolstered by the best simulation capability available. In

general, CFD modeling capability grows to capture the required scope and physics to answer the

questions raised during testing. The expense of responding to often unplanned technical surprises

which results in more time on the test stand or in flight test, or changes in hardware drives

conservatism into aerospace designs and is a significant motivation for improving the accuracy and

speed of CFD. If CFD is sufficiently accurate and fast, engineers can move from their traditional design

space with greater confidence and less potential risk during testing. For each of these design phases,

the performance of CFD is of critical.

4.3.4 CFD usage of High Performance Computing (HPC)

The effectiveness and impact of CFD on the design and analysis of aerospace products and systems

is largely driven by the power and availability of modern HPC systems. During the last decades, CFD

codes were formulated using message passing (e.g., MPI) and thread (e.g., OpenMP) software models

for expressing parallelism to run as efficiently as possible on current generation systems. However,

with the emergence of truly hierarchical memory architectures having numerous graphical

processing units (GPUs), new CFD algorithms may need to be developed to realize the potential

performance offered by such systems. Government labs, such as Oak Ridge National Lab (ORNL),

Argonne National Lab (ANL), and the NASA Advanced Supercomputing (NAS) facility at NASA

Ames research center, have often led the way with the acquisition and testing of new hardware. Much

research on testing and tailoring of CFD algorithms takes place on these platforms with heavy

participation from academia, national labs and to some extent industry as well. Government

computing resources are also used to tackle large-scale calculations of challenge problems, such as

the detailed direct numerical simulation (DNS) of spray injector atomization or high fidelity

simulations of transition and turbulent separation in turbomachinery. However, because of the high

cost of these leadership-class systems, industry and academia often purchase smaller commodity

clusters utilizing similar types of processors when the latest hardware technology is fully

demonstrated on CFD problems and other important applications.

4.3.5 Turbulence Modeling

Current practices for CFD-based workflows utilize steady Reynolds-average Navier-Stokes (RANS)

with 1 or 2-equation turbulence models

, although hybrid unsteady RANS/LES methods are

increasingly common for certain classes of simulations in which swirling and intentionally separated

flows are dominant, such as combustors. Techniques to combine near-wall RANS regions and outer

flow field, large-eddy simulation (LES) regions in these hybrid methods are immature. Many CFD

design processes include an estimation of boundary layer transition, using a range of models, from

Spalart, P. R. and Allmaras, S. R., "A One-Equation Turbulence Model for Aerodynamic Flows", La Recherche

Aerospatiale, No. 1, 1994, pp. 5-21.

Wilcox, D. C., Turbulence Modeling for CFD, DCW Industries, 3rd edition, November 2006.

purely empirical to coupled partial-differential equation (PDE) solutions of stability equations

Both approaches involve much empiricism, may be missing some modes of transition, and are

evolving. As a result, no generalized transition prediction capability is in widespread use in Navier-

Stokes CFD, and the default practice is to run the codes “fully turbulent”. Steady-state CFD accounts

for a vast majority of simulations while unsteady flow predictions are inherently more expensive

and not yet uniformly routine in the design process, with some exceptions.

4.3.6 Process Automation

Current CFD workflows are often paced by the geometry preprocessing and grid generation phases,

which are significant bottlenecks. In some cases, where the design effort involves components of

similar configurations, specialized, automated processes are built that considerably reduce set-up

time, execution of the CFD solver, and post-processing of results. This process to production

capability of the CFD workflow only occurs in areas where the design work is routine and the

investment in automation makes business sense; single prototype designs and novel configurations

continue to suffer the pacing limits of human-in-the-loop workflows because the payoff for

automating is not evident. This issue is not unique to the aerospace industry.

4.3.7 Solution Uncertainty and Robustness

In practice, CFD workflows contain considerable uncertainty that is often not quantified. Numerical

uncertainties in the results come from many sources, including approximations to geometry, grid

resolution, problem setup including flow modeling and boundary conditions, and residual

convergence. Although NASA and professional organizations such as ASME and AIAA have created

standards for the verification and validation of CFD and heat transfer analyses, such techniques are

not widely used in the aerospace industry. With a few notable exceptions, CFD is carried out on fixed

grids that are generated using the best available practices to capture expected flow features, such as

attached boundary layers

. Such approaches cannot reliably provide adequate resolution for flow

features when locations are not known a priori, such as shocks, shear layers, and wakes. Although

grid refinement is often seen as a solution to addressing grid resolution issues, it is seldom done in

practice because uniform refinement is impractical in 3D. Adaptive mesh refinement strategies offer

the potential for superior accuracy at reduced cost, but have not seen widespread use due to

robustness, error estimation, and software complexity issues. Achieving consistent and reliable flow

solver or residual convergence remains problematic in many industrial cases. Although many CFD

codes are able to demonstrate convergence for a few simple problems, for many flows involving

difficult flow physics or complex geometries such as an aircraft in high-lift configuration, many

of the current solver techniques are not strong enough to ensure robust convergence.

Engineering judgment is required to interpret results that are not well converged, which introduces

conservatism into decision making. Furthermore, the use of steady-state flow solvers itself is in

question for many flows of engineering interest.

4.3.8 Multidisciplinary Analysis and Optimization (MDAO)

Although the basic concepts of MDAO are fairly well accepted in the community, the routine use of

MDAO methods is not, by any means, universal. At moderate levels of fidelity, it is common industrial

practice to perform coupled multidisciplinary analyses (MDA) of the most tightly integrated

disciplines in a design. Aero structural analyses, conjugate heat transfer calculations, and aero-

Stock, H.W., and Haase, W., “Navier-Stokes Airfoil Computations with eN Transition Prediction Including

Transitional Flow Regions”, AIAA Journal, Vol. 38, No. 11, pp. 2059–2066, 2006, 10.2514/2.893.

Langtry, R. B., Menter, F. R., “Correlation-Based Transition Modeling for Unstructured Parallelized

Computational Fluid Dynamics Codes”, AIAA Journal, Vol.47, pp. 2894-2906, 2009, 10.2514/1.42362.

Mavriplis, D. J., Vassberg, J., Tinoco, E., Mani, M., Brodersen, O., Eisfeld, B., Wahls, R., Morrison, J., Zickuhr, T.,

Levy, D., and Murayama, M., “Grid Quality and Resolution Issues from the Drag Prediction Workshop Series”, AIAA

Journal of Aircraft, Vol. 46, No. 3, pp. 935-950, March 2009.

acoustic simulations all tend to take place in aircraft, spacecraft, jet engine, and rotorcraft analysis

and design processes. High fidelity CFD is not routinely used in such MDAs, although recent years

have witnessed a significant rise in the coupling of state-of-the-art CFD with additional disciplines.

While frameworks for the coupling of disciplinary analyses are widely available, the ability to couple

CFD with other high fidelity descriptions of participating disciplines is limited by the availability of

coupling software and, more fundamentally, by a lack of general methodologies for accurate, stable,

and conservative MDAs. The application of optimization techniques in industry is mostly limited to

single-discipline simulations

Although conceptual design tools have long benefited from multidisciplinary optimization (MDO)

approaches, high fidelity CFD-based optimizations are still rare. During the past decade, the

development of advanced surrogate modeling techniques and the introduction of adjoint-based

optimal shape design techniques have enabled the use of CFD in aerodynamic optimization of aircraft

and gas turbine components. However, the use of optimization with multiple disciplines treated

using high-fidelity methods is still within the realm of advanced research and is by no means a routine

practice.

4.4 Vision of CFD in 2030 as anticipated by NASA

This is in fact a mirror image of the report done by USDOE, which will be covered later, but with

emphasis on CFD. Given the inherent difficulties of long-term predictions, our vision for CFD in 2030

is grounded on a desired set of capabilities that must be present for a radical improvement in CFD

predictions. Of special interests are critical flow phenomena associated with the key aerospace

application, including commercial/military aircraft, engine propulsion, rotorcraft, space exploration,

launch vehicle programs, air-breathing space-access, and spacecraft entry

. This set of capabilities

includes not only the accurate and efficient prediction of fluid flows of interest, but also the usability

of CFD in broader contexts (including uncertainty quantification, optimization, and multidisciplinary

applications) and in streamlined/automated industrial analysis and design processes. To complicate

things further, CFD in 2030 must effectively leverage the uncertain and evolving environment of HPC

platforms that, together with algorithmic improvements, will be responsible for a large portion of the

realized improvements. The basic set of capabilities for CFD must include, at a minimum:

➢ Emphasis on physics-based, predictive modeling. In particular, transition, turbulence,

separation, chemically reacting flows, radiation, heat transfer, and constitutive models must

reflect the underlying physics more closely than ever before.

➢ Management of errors and uncertainties resulting from all possible sources:

1. Physical modeling errors and uncertainties addressed,

2. Numerical errors arising from mesh and discretization inadequacies, and

3. Uncertainties derived from natural variability, as well as epistemic uncertainties due

to lack of knowledge in the parameters of a particular fluid flow problem.

➢ A much higher degree of automation in all steps of the analysis process is needed

including geometry creation, mesh generation and adaptation, the creation of large databases

of simulation results, the extraction and understanding of the vast amounts of information

generated, and the ability to computationally steer the process. Inherent to all these

improvements is the requirement that every step of the solution chain executes high levels of

reliability/robustness to minimize user intervention.

Jeffrey Slotnick and Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie

,Dimitri Mavriplis ,”CFD Vision 2030 Study: A Path to Revolutionary Computational Aerosciences”, NASA/CR–

2014-218178.

Same as above.

➢ Ability to effectively utilize massively parallel, heterogeneous, and fault-tolerant HPC

architecture. For complex physical models with nonlocal interactions, the challenges of

mapping the underlying algorithms onto computers with multiple memory hierarchies,

latencies, and bandwidths must be overcome.

➢ Flexibility to tackle capability and capacity-computing tasks in both industrial and

research environments so that both very large ensembles of reasonably-sized solutions

(such as those required to populate full-flight envelopes, operating maps, or for parameter

studies and design optimization).

➢ Seamless integration with multidisciplinary analyses that will be the norm in 2030.

Without sacrificing accuracy or numerical stability of the resulting coupled simulation, and

without requiring a large amount of effort such that only a handful of coupled simulations are

possible. Included in this desired set of capabilities is a vision of the interaction between the

engineer/scientist, the CFD software itself, its framework and all the ancillary software

dependencies (databases, modules, visualization, etc.), and the associated HPC environment.

A single engineer/scientist must be able to conceive, create, analyze, and interpret a large

ensemble of related simulations in a time-critical period (e.g., 24 hours), without individually

managing each simulation, to a pre-specified level of accuracy. There should be less emphasis

on the mechanics of running and collecting the information, and more emphasis on

interpreting and understanding the results of the work. At the moment, CFD is not yet

sufficiently predictive and automated to be used in critical/relevant engineering decisions by

the non-expert user, particularly in situations where separated flows are present

➢ Finally, we define a set of Grand Challenge (GC) problems that are bold and in fact may not

be solvable in the 2030 timeframe, but are used as drivers to identify critical technologies in

need of investment, and to serve as benchmarks for continually measuring progress toward

the long term development goals. These GC problems are chosen to embody the requirements

for CFD in 2030, and cover all important application areas of relevance to NASA’s aeronautics

mission, as well as important aspects of NASA’s space exploration mission

. They are:

1. LES of aircraft configuration across the full flight envelope.

2. Off-design turbofan engine transient simulation.

3. MDAO of a highly flexible advanced aircraft configuration.

4.4.1 Technology Roadmap to achieve GC Challenge

The CFD technology roadmap is a complete and concise view of the key research technologies and

capabilities that must be developed and integrated into production CFD. The individual elements on

the roadmap were identified based on the results of the CFD user survey, detailed technical

discussions held during the Vision 2030 CFD workshop, and from interactions among our team

members. Key technology milestones, proposed technology demonstrations, and critical decision

gates are positioned along timelines, which extend to the year 2030. Separate timelines are identified

for each of the major CFD technology elements that comprise the overall CFD process. The key

milestones indicate important advances in CFD technologies or capabilities that are needed within

each technology element. Technology demonstrations are identified to help verify and validate when

technology advances are accomplished, as well as to validate advances toward the simulations of the

GC problems identified above. Specific details of the development plan for each technology element

are given below.

Jeffrey Slotnick and Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie

,Dimitri Mavriplis ,”CFD Vision 2030 Study: A Path to Revolutionary Computational Aerosciences”, NASA/CR–

2014-218178.

See Previous.

4.4.1.1 High Performance Computing (HPC)

As mentioned previously, advances in HPC hardware systems and related computer software are

critically important to the advancement of the state of the art in CFD simulation, particularly for high

Reynolds turbulent flow simulations. Based on feedback from the user community survey, we

envision HPC technology advancing along two separate paths. Ongoing development of exa-scale

systems, as mentioned earlier, will continue through 2030, and represents the technology that will

most likely provide the large increase in throughput for CFD simulation in the future

. However,

novel technologies, such as quantum computing or molecular computing, offer a true paradigm shift

in computing potential and must be carefully considered at strategic points in the overall

development plan, even though the technology is at a very low level today.

In order to properly address the HPC challenge, three specific thrusts must be supported. Firstly,

current simulation software must be ported to evolving and emerging HPC architectures with a view

toward efficiency and software maintainability. Secondly, investments must be made in the

development of new algorithms, discretization, and solvers that are well suited for the massive levels

of parallelism

. Finally, increased access to the latest large-scale computer hardware must be

provided and maintained, not only for production runs, but also for algorithmic research and

software development projects, which will be critical for the design and validation of new simulation

tools and techniques

. We propose several key milestones that benchmark the advances that we

seek: modification of NASA and related CFD codes to efficiently execute on hierarchical memory

(GPU/co-processor) systems by 2020, initial evaluation of exa-scale performance on a representative

CFD problem, and a demonstration of 30 exa-FLOP performance for one or more of the proposed GC

problems in the 2030 time frame.

Concurrently, we stress the importance of closely observing advances in revolutionary HPC

technologies, such as superconducting logic, new memory technologies, alternatives to current.

Because these technologies are in their infancy, we foresee decision gates in 2020,-2025, and 2030

to establish the ability of these systems to solve a relevant model problem. Implicit in this strategy is

the need to provide access to experimental hardware on a continual basis and to explore radical new

approaches to devising CFD simulation capabilities. If, at any of these decision points, the technology

clearly shows its expected potential, we recommend increased investment to accelerate the use of

these machines for CFD applications.

4.4.1.2 Physical Modeling

Advances in the physical modeling of turbulence for separated flows, transition, and combustion are

critically needed to achieve the desired state of CFD. For the advancement of turbulent flow

simulation, we propose three separate tracks for research: RANS-based turbulence treatments;

hybrid RANS/LES approaches where the entire boundary layer is resolved with RANS-based models,

and the outer flow is resolved with LES models; and LES, including both Wall-Model and Wall-

Resolved. Details on each of the three development tracks and for transition and combustion

modeling, are given below. Additionally, a longer term high-risk effort should investigate radically

new approaches to physical modeling.

RANS-based turbulence models continue to be the standard approach used to predict a wide range

Kogge, P. (Ed.), “Exa-scale Computing Study: Technology Challenges in Achieving Exa-scale Systems”,

Contractor report for AFRL Contract No. FA8650-07- C-7724, September 2008.

Mavriplis, D., Darmofal, D., Keyes, D. and Turner, M., “Petaflops Opportunities for the NASA Fundamental

Aeronautics Program”, AIAA Paper 2007-4084,18th AIAA Computational Fluid Dynamics Conference, June

2007, 10.2514/6.2007-4084.

Sarkar, V. (ed.), “Exa-scale Software Study: Software Challenges in Extreme Scale Systems”, DARPA, IPTO,

AFRL report under contract FA8650-07-C-7724, September 2009.

Biswas, R., Aftosmis, M. J., Kiris, C., and Shen, B. W., “ Petascale Computing: Impact on Future NASA Missions”,

Petascale Computing: Architectures and Algorithms (D. Bader, ed.), Chapman and Hall / CRC Press, 2007.

of flows for very complex configurations across virtually all aerospace product categories. They are

easy to use, computationally efficient, and generally able to capture wall-bounded flows, flows with

shear, flows with streamline curvature and rotation, and flows with mild separation. For these

reasons, as well as the fact that RANS models will remain as an important component in hybrid

RANS/LES methods, their use will continue through 2030. An advanced formulation of the RANS-

based approach, where the eddy viscosity formulation is replaced with the direct modeling of the

Reynolds stresses, known as the Reynolds Stress Transport method, in principle will be able to

capture the onset and extent of flow separation for a wider range of flows

Currently Hybrid RANS/LES methods show perhaps the most promise in being able to capture more

of the relevant flow physics for complex geometries at an increasingly reasonable computational

cost

. From the user survey, the majority of survey participants ranked the continued development

of hybrid RANS/LES methods as the top priority in the area of turbulence modeling. However, as

mentioned previously, several issues still exist. First, the prediction of any separation in the boundary

layer will still require improvements in RANS-based methods. Second, a seamless, automatic RANS-

to-LES transition in the boundary layer is needed to enhance the robustness of these methods.

Continued investment in hybrid RANS/LES methods to address these two critical shortcomings will

be required. Additionally, more effective discretization and solvers designed specifically for LES type

problems must be sought. When combined with advances in HPC hardware, these three

developments will enable continued reduction in the RANS region as larger resolved LES regions

become more feasible. It is fully anticipated that hybrid RANS/LES methods will become viable in

production mode by the 2030 timeframe for problems typical of the proposed GCs.

4.4.1.3 Numerical Algorithms

The development of novel numerical algorithms will be critical to achieving the stated CFD 2030

goals. Indeed, the proposed GCs are sufficiently ambitious that advances in HPC hardware alone

during the next 20 years will not be sufficient to achieve these goals. As demonstrated in Case Study

2, even for LES of relatively simple geometries, leadership class HPC hardware in 2030 will be needed

for 24-hour turnaround if existing algorithms are used. Thus, to tackle the proposed GCs, orders of

magnitude improvement in simulation capabilities must be sought from advances in numerical

algorithms

. The focus of investment must be on discretization and solvers that scale to massive

levels of parallelism, that are well-suited for the high-latency, deep memory hierarchies anticipated

in future HPC hardware, and that are robust and fault tolerant. A well balanced research program

must provide for incremental advances of current techniques (e.g., extending the scalability of

current CFD methods to the exa-scale level whenever possible), while at the same time investing in

the fundamental areas of applied mathematics and computer science to develop new approaches

with better asymptotic behavior for largescale problems and better suitability for emerging HPC

hardware.

Discretization techniques such as higher-order accurate methods offer the potential for better

accuracy and scalability, although robustness and cost considerations remain

. Investment must

focus on removing these barriers in order to unlock the superior asymptotic properties of these

methods, while at the same time pursuing evolutionary improvements in other areas such as low

Eisfeld, B., “Reynolds Stress Modeling for Complex Aerodynamic Flows”, Presented at the European

Conference on Computational Fluid Dynamics, ECCOMAS CFD 2010, Lisbon, Portugal, June 14−17, 2010.

Song, F., Haase, W., Peng, S-H., and Schwamborn, D. (eds.), Progress in Hybrid RANS-LES Modeling, Springer

Press, ISBN 978-3-642-31817-7, Sept. 2011.

Mavriplis, D., Darmofal, D., Keyes, D. and Turner, M., “Petaflops Opportunities for the NASA Fundamental

Aeronautics Program”, AIAA Paper 2007-4084,18th AIAA Computational Fluid Dynamics Conference, 2007.

Kroll, N., Bieler, H., Deconinck, H., Couallier, V., van der Ven, H.; and Sorensen, K. (Eds.), “ADIGMA – A

European Initiative on the Development of Adaptive High-Order Variational Methods for Aerospace

Applications”, Notes on Numerical Fluid Mechanics and Multidisciplinary Design, Vol. 11, 2010, Springer.

dissipation schemes, flux functions, and limiter formulations. Simultaneously, novel nontraditional

approaches, such as Lattice-Boltzmann methods or other undeveloped schemes, should be

investigated for special applications. Improved linear and nonlinear solvers must be developed, and

here as well, the focus must be on highly scalable methods that are designed to be near optimal for

the large-scale, time-implicit unsteady CFD and MDAO simulations anticipated in the future.

These may include the extension of well-known matrix-based techniques, [Krylov methods]

, highly

parallel multigrid methods

, or the development of completely novel approaches such as systematic

upscaling methods

. Furthermore, these methods must be extensible to tightly coupled

multidisciplinary problems. Investment in discretization and solvers must also consider the potential

of these methods to operate on dynamically adapting meshes, to enable optimization procedures,

and to incorporate advanced uncertainty quantification capabilities.

In many cases, adjoint technology

will be required from the outset for all of these capabilities,

but the potential of other more advanced technologies such as second-order gradients [Hessians]

should be investigated as well. Longer term, high-risk research should focus on the development

of truly enabling technologies such as monotone or entropy stable schemes in combination with

innovative solvers on large-scale HPC hardware. The technology roadmap envisions the

demonstration of improved robust and scalable solvers in the 2015-2017 timeframe, for both

second-order and higher-order accurate methods. The demonstration of complete configuration-grid

convergence technology in the 2020 time frame relies on the use of robust higher order discretization

combined with improved scalable solvers and adaptive h-p refinement. Toward the 2030 time frame,

it is anticipated that novel entropy stable formulations will begin to bear fruit for industrial

simulations.

4.4.1.4 Uncertainty Quantification (UQ)

With regard to uncertainty quantification, a new thrust in the area of probabilistic large-scale CFD

for aerospace applications should be initiated. An initial thrust in this area should focus on enabling

current aerospace CFD tools with well-known uncertainty quantification techniques, such as

sensitivity analysis and propagation methods using adjoints and forward linearization, nonintrusive

polynomial chaos methods, and other reduced-order model formulations

. Additionally, a

concerted effort should be made to characterize important aerospace uncertainties and to make

these available to the general research community for enabling relevant UQ research in these areas.

Improved error estimation techniques must be investigated and developed, given the known

deficiencies of current approaches (including adjoint methods). This will require a foundational

program in the mathematics of error estimation and its application to CFD software. Finally, longer

Saad, Y., Iterative Methods for Sparse Linear Systems, Second Edition, SIAM, 2003.

Baker, A. H., Falgout, R. D., Kolev, Tz. V., and Yang, U. M., “Scaling Hypre’s Multigrid Solvers to 100,000 Cores”,

High Performance Scientific Computing: Algorithms and Applications, M. Berry et al., eds., Springer (2012).

Brandt, A., “Multiscale Solvers and Systematic Upscaling in Computational Physics”, Computer Physics

Communications, Vol 169, Issues 1–3, pp. 438-441, July 2005.

Jameson, A., “Aerodynamic Design via Control Theory”, ICASE Report No. 88-64, November 1988, also, J. of

Scientific Computing, Vol. 3, pp. 233-260, 1988.

Errico, R. M., “What is an adjoint model?”, Bulletin of the American Meteorological Society, 2577– 2591, 1997.

Taylor, A. C., Putko, M. M., Green, L. L., and Newman, P. A., “Some Advanced Concepts in Discrete Aerodynamic

Sensitivity Analysis”, AIAA Journal, Vol.41, pp. 224-1229, 2003,10.2514/2.2085.

Rumpfkeil, M. P., and Mavriplis, D. J., “Efficient Hessian Calculations Using Automatic Differentiation and the

Adjoint Method with Applications”, AIAA Journal, Vol.48, pp. 2406-2417, 2008, 10.2514/1.J050451.

Shankaran, S., and Jameson, A., “Robust Optimal Control using Polynomial Chaos and Adjoints for Systems

with Uncertain Inputs”, AIAA Paper 2011-3069, 20th AIAA Computational Fluid Dynamics Conference, 2011.

Ng, L. W-T., Huynh, D. B. P., and Willcox, K., “Multifidelity Uncertainty Propagation for Optimization Under

Uncertainty”, 12th AIAA Aviation Technology, Integration, and Operations (ATIO) Conference and 14th

AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2012, 10.2514/6.2012-5602.

100

term research must focus on statistical approaches such as Bayesian techniques for quantifying more

accurately modeling and other nonlinear error sources

. The technology roadmap includes an early

target date of 2015 for the characterization of typical aerospace uncertainties in order to stimulate

work in this area. Improved error estimation techniques will be gradually brought into the simulation

capabilities and the state of these estimates will be assessed in the 2018 time frame. Comprehensive

uncertainty propagation techniques including discretization error, input and parameter

uncertainties in production-level CFD codes should be targeted for 2025, while the development of

more sophisticated stochastic and Bayesian approaches will continue through the 2030 timeframe.

4.4.1.5 Geometry and Grid Generation

Substantial new investment in geometry and grid generation technology will be required in order to

meet the Vision CFD 2030 goals. In general, this area has seen very little NASA investment during the

last decade, although it remains one of the most important bottlenecks for large-scale complex

simulations. Focused research programs in streamlined CAD access and interfacing, large-scale mesh

generation, and automated optimal adaptive meshing techniques are required. These programs must

concentrate on the particular aspects required to make mesh generation and adaptation less

burdensome and, ultimately, invisible to the CFD process, while developing technologies that enable

the capabilities that will be required by Vision 2030-CFD applications, namely very large scale

parallel mesh generation, curved mesh elements for higher order methods

, highly scalable

dynamic overset mesh technology

, and in anisotropic adaptive methods for time-dependent

problems. It is important to realize that advances in these areas will require a mix of investments in

incremental software development, combined with advances in fundamental areas such as

computational geometry, possibly with smaller components devoted to high risk disruptive ideas

such as anisotropic cut-cell meshes

, strand mesh ideas

, and even meshless methods

Additionally, because significant technology currently resides with commercial software vendors,

particularly for CAD interfaces and access, involving these stakeholders in the appropriate focused

research programs will be critical for long-term success. Innovative approaches for achieving such

partnerships must be sought out, such as the formation of consortiums for the definition and

adoption of standards or other potential issues such as large scale parallel licensing of commercial

software. The technology development roadmap envisions the demonstration of tight CAD coupling

and production adaptive mesh refinement (AMR) in the 2015-2017 time frame, followed by

maturation of large-scale parallel mesh generation in the 2020-2025 time frame, and leading

ultimately to fully automated in-situ mesh generation and adaptive control for large-scale time-

dependent problems by 2030.

Press, S. J. , Subjective and Objective Bayesian Statistics: Principles, Methods and Applications, 2nd edition,

2003, Wiley, New York.

Wang, L., Anderson, W. K., Erwin, J., and Kapadia, S., “High-order Methods for Solutions of Three dimensional

Turbulent Flows”, AIAA Paper 2013-856, 51st AIAA Aerospace Sciences Meeting, Jan 2013.

Persson, P-O., Willis, D., and Peraire, J., “The Numerical Simulation of Flapping Wings at Low Reynolds

Numbers”, AIAA Paper 2010-724, 48th AIAA Aerospace Sciences Meeting, Jan 2010.

Pulliam, T. H. and Jespersen, D. C., “Large Scale Aerodynamic Calculation on Pleiades”, Proceedings of the 21st

International Conference on Parallel Computational Fluid Dynamics, Moffett Field, California, May 18−22, 2009.

Modisette, J., and Darmofal, D., “Toward a Robust, Higher-Order Cut-Cell Method for Viscous Flows”, AIAA

Paper 2010-721, 48th AIAA Aerospace Sciences Meeting, Jan 2010.

Katz, A., Wissink, A., Sitaraman, J., and Sankaran, V., “Application of Strand Meshes to Complex Aerodynamic

Flow Fields”, AIAA Paper 2010-4934, 28th AIAA Applied Aerodynamics Conference, June 2010.

Katz, A., and Jameson, A, “Meshless Scheme Based on Alignment Constraints”, AIAA Journal, Vol.48, pp. 2501-

2511, 2010.

101

4.4.1.6 Knowledge Extraction

Peta-scale and exa-scale simulations will generate vast amounts of data and various government

agencies such as the NSF and DOE have instituted major programs in data-driven simulation

research. In order to make effective use of large scale CFD and MDAO simulations in aerospace

engineering, a thrust in data knowledge extraction should be initiated. Ideally, this should contain

three components, a visualization component, a database management component, and a variable

fidelity, data integration component. Methods to process and visualize very largescale unsteady CFD

simulations in real time, including results from higher-order discretization, are required to support

the advanced CFD capabilities envisioned in 2030. Although many of the current efforts in maturing

visualization technology are being led by commercial vendors who continue to supply enhanced

capabilities in this area, more fundamental research to directly embed visualization capabilities into

production CFD tools optimized for emerging HPC platforms is needed to achieve real-time

processing

Moreover, the CFD capability in 2030 must provide the analyst with a more intuitive and natural

interface into the flow solution to better understand complex flow physics. Foreseeing the capability

of generating large databases with increasing computational power, techniques for rapidly

integrating these databases, querying them in real time will be required. Finally, integrating high

fidelity simulation data with lower fidelity model data, as well as experimental data from wind tunnel

tests, engine test rigs, or flight-test data will provide a powerful approach for reducing overall risk in

aerospace system design

. Techniques for building large-scale flexible databases are in their infancy,

and range from simple software infrastructures that manage large numbers of simulation jobs to

more sophisticated reduced-order models

, surrogate models, and Kriging methods

. The objective

of a research thrust in this area should be to apply existing techniques to current CFD simulation

capabilities at a large scale, while simultaneously performing foundational research in the

development of better reduced-order models and variable fidelity models that are applicable to

aerospace problems and can support embedded uncertainty quantification strategies.

The technology roadmap envisions the demonstration of real time analysis and visualization of a

notional 1010 point unsteady CFD simulation in 2020, and a 1011 point simulation in 2025. These

technology demonstrations would be an integral part of the GC problems designed to benchmark

advances in other CFD areas. The development of reduced-order models and other variable fidelity

models will entail long term research and will likely remain an active research topic past the 2030

time frame. However, the technology roadmap envisions the periodic assessment of the state-of-the-

art in these areas at 5 to 10 year intervals, with investment directed toward demonstrating promising

approaches on large-scale aerospace applications.

4.4.1.7 Multidisciplinary Design and Optimization

The ability to perform CFD-based multidisciplinary analysis (MDA) and analysis/optimization

(MDAO) relies on the availability of future capabilities that need to be developed between now and

2030. Pervasive and seamless MDAs (that can be routinely exercised in industrial practice for

configuration studies, e.g., full aero-thermo-elastic/aero-acoustic simulations of entire

airframe/propulsion systems including shielding) will require the development of accepted

standards and APIs for disciplinary information and the required multidisciplinary couplings (such

Wang, Y., Yu, H., and Ma, K-L, “Scalable Parallel Feature Extraction and Tracking for Large Time-Varying 3D

Volume Data”, Proceedings of EGPGV 2013, May 2013, pp. 55-62.

The 1st Workshop on Integration of Experimental Fluid Dynamics (EFD) and Computational Fluid Dynamics

(CFD), JAXA Special Publication SP-09-002, January 2010.

Washabaugh, K., Amsallem, D., Zahr, M., and Farhat, C., “Nonlinear Model Reduction for CFD Problems Using

Local Reduced-Order Bases”, AIAA Paper 2012-2686, 42nd AIAA Fluid Dynamics Conference, June 2012.

Han, Z-H., and Görtz, S., “Hierarchical Kriging Model for Variable-Fidelity Surrogate Modeling”, AIAA Journal,

Vol.50, pp.1885-1896, 2012, 10.2514/1.J051354.

102

as with acoustics, combustion, structures, heat transfer, radiation). A concerted effort is envisioned

that results in a set of standards available to the community around 2016. In parallel with this effort,

it will also be necessary to develop high-fidelity coupling techniques that guarantee the accuracy and

stability of high fidelity, tightly coupled MDAs

, while ensuring that the appropriate conservation

principles are satisfied with errors below acceptable thresholds.

This capability, together with the coupling software that includes such information transfers must be

available around 2018. Together, the standards and the coupling techniques/software would enable

demonstrations of two-way coupled MDAs with the best and most robust existing CFD solvers of the

time, and guaranteeing coupling fidelity by the year 2020. Such demonstrations can focus on multiple

aerospace problems of interest, including aircraft aero-structural/aero-elastic analyses, aircraft

aero-acoustics, rotorcraft aero-structural and aero-acoustic couplings, unsteady combustion, re-

entry aerothermodynamics and material response, and the like. Initially, such routine MDAs would

focus on portions of an entire vehicle (around 2020) and would transition to the treatment of the

entire system around 2025.

A number of capabilities also must be developed in order to enable MDAO with and without the

presence of uncertainties (robust and reliability-based design). A major research component that is

likely to span a significant period (2015 -2025) is the work needed to endow industrial strength CFD

solvers with both gradient calculation and uncertainty quantification capabilities for use in

multidisciplinary optimization. Some of this work has been described in the “Numerical Algorithms”

section. For the gradient/ sensitivity analysis capability, we envision that the CFD solver will be able

to compute this information for full unsteady flows for the turbulence models available at the time.

Finally, all these new capabilities must come together on a series of MDAO grand-challenge

demonstrations in the 2030 timeframe.

4.4.2 Recommendations

In order to effectively execute the CFD development plan described above and achieve the goals laid

out in the vision of CFD in 2030, a comprehensive research strategy and set of recommendations are

presented. This research strategy calls for the renewed preeminence of NASA in the area of

computational sciences and aerodynamics, and calls for NASA to play a leading role in the pursuit

of revolutionary simulation based engineering. Aerospace engineering has had a long history of

developing technology that impacts product development well beyond the boundaries of aerospace

systems. As such, NASA is a critical force in driving technology throughout aerospace engineering

directly by fulfilling its obligation. Computational methods are a key example of this broad impact,

as NASA has historically been a leader in the development of structural finite-element methods,

computational fluid dynamics, and applications of HPC to engineering simulations. NASA’s effort

must be targeted toward research and technology development that can make revolutionary impacts

on simulation-based engineering in the aerospace sciences. In particular, the current state of CFD is

such that small, incremental improvements in existing capability have not had revolutionary effects.

In an environment of constrained resources, this will require that NASA evaluate its activities with a

critical eye toward supporting those efforts whose impact could be revolutionary. To ensure that the

technology plan and roadmap are as effective as possible, we propose specific recommendations (see

Figure 4.4.1). Naturally, individual research thrusts affect multiple technical areas, which in turn

affect the ability to meet various milestones and progress toward the GC problems.

“Multiphysics Simulations: Challenges and Opportunities”, Argonne National Lab Report ANL/MCS-TM-321,

Report from Workshop sponsored by the Institute for Computing in Science (ICiS), Park City, Utah, June -

August, 2011.

103

4.5 HPC Envisioned by Department of Energy (DOE)

The aim here is whether or not to pursue the main issues raised by ‘going to the exa-scale’, and to

provide some guidance on the level of risk involved in pursuing

, and not pursuing, this direction of

high performance computing. ‘Going to the exa-scale’ will mean a radical change in computing

architecture basically. It vastly increasing the levels of parallelism to the point of millions of

processors working in cycle which will force radical changes in how hardware is designed. It will

dictate in how we go about solving problems (e.g., the application codes), and in how we marry

application codes to the underlying hardware (e.g., the compilers, I/O, middleware, and related

software tools). Understanding the advantages to be gained by going to the exa-scale, and evaluating

the risks involved by going down this path, requires both an evaluation of past experiences in moving

from the megaflop era to the present petaflop era, as well as an assessment of the readiness of

advanced applications to take transformative advantage of exa-scale computing. The challenges

inherent in developing exa-scale computing as a practical endeavor are considerable, and significant

investments will be needed to accomplish this.

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

Figure 4.4.1 Proposed New Computational Sciences Program Structure

GEOMETRY AND GRID

GENERATION

•CAD access and interfaces

•Large scale parallel mesh

generation

•Adaptive mesh refinement

•Curved mesh elements for

higher‐order

MDAO

•Interfaces and standards

•Accurate and stable

coupling techniques

•UQ support and

sensitivities (systemlevel)

KNOWLEDGE

MANAGEMENT

•Visualization

• Data‐base management

•Variable fidelity models

HPC

•Increasing access to

leading‐edge HPC hardware

•Porting of current and future

codes to leading‐edge HPC

•Radical emerging HPC

technologies

PHYSICAL MODELING

•RANS turbulence modeling

• Hybrid RANS‐LES modeling

a. Improved RANS component

b. Seamless interface

• LES (wall‐modeled and

wall‐resolved)

•Transition

•Combustion

•Radically new modeling

approaches

NUMERICAL

ALGORITHMS

•Advances in current algorithms

for HPC

•Novel discretizations a.

Higher‐order methods b.

Low dissipation/dispersion

schemes c. Novel approaches

(foundational)

•Solvers

a. Linear and non‐linear scalable

solvers

b. Enhancements for MDAO and

•UQ

a. Define aerospace

uncertainties

b. Leverage kn own techniques

c. Improved error estimation

techniques

d. Statistical approaches

104

4.5.1 What is Exa-scale Computing?

Exa-scale computing refers to computing systems capable of at least one exa-FLOPS, or a billion x

billion calculations per second

. Such capacity represents a thousand fold increase over the first

peta-scale computer that came into operation in 2008

. (One exaflops is a thousand petaflops or a

quintillion, 1018, floating point operations per second). At a supercomputing conference in 2009,

Computerworld projected exa-scale implementation by 2018. Exa-scale computing would be

considered as a significant achievement in computer engineering, for it is believed to be the order of

processing power of the human brain at neural level (functional might be lower). It is, for instance,

the target power of the Human Brain Project

4.5.2 Why Exa-scale?

The most obvious question, the key question really is of course: why go to the exa-scale? This

question is not meant in the trivial sense that one would pose for any expenditure whatsoever in

leading edge computing technologies, but rather is motivated by the fact that the transition from

current peta-scale computing to the exa-scale will involve investments across the board from

hardware to fundamental algorithms, programming models, compilers, and application codes that

will dwarf previous levels of investment made as computer architectures have evolved in the past.

That is, we recognize that the values to society extracted from this change in computing paradigm

has to be commensurate with the costs of developing this type of computing and given the substantial

costs, we need to be sure that the extracted values are similarly substantial. We will make the

argument in the following that the extracted values are in fact very large but will do so in two stages,

first by making some general points about the present frontiers of computing independent of

discipline and then by focusing on a few example disciplines to illustrate the more general point.

4.5.3 Range of Applications may be Transformed by Going to the Exa-scale

As discussed earlier, a key question to be addressed in considering going to the exa-scale is the

readiness of key applications to take this step, as well as the likelihood that taking this approach will

lead to transformative changes in these application areas. This question is addressed in the present

section, focusing once again on a selection of disciplines to illustrate the breadth of applications that

are ready for this transition.

4.5.3.1 Aerospace, Airframes and Jet Turbines

Computing at an extreme scale will have transformational effects on several key applications in the

aerospace industry

. The move from RANS to LES as the industry standard and its use in the design

cycle represents a paradigm shift for the aerospace industry. In addition, there are several

outstanding scientific problems in these sectors that can be understood and hopefully controlled

using extreme scale computing. The accuracy achieved with the RANS approach for prediction of

quantities of engineering interest in the airframe industry has reached a plateau owing to the

epistemic uncertainties inherent in such turbulence models. As a result, the design of aircraft and

propulsions systems relies on an iterative process where several expensive prototypes are

constructed and tested in wind tunnels. Hybrid RANS/LES approaches with grounding in the first

principles can overcome the limitations of RANS and enhance the predictive capability of CFD beyond

the present seemingly stagnant state of speculative trial-and-error in design

. In addition, building

From Wikipedia, the free encyclopedia.

National Research Council, “The potential impact of high-end capability computing on four illustrative fields of

science and engineering”, The National Academies. p. 11. ISBN 978-0-309-12485-0.

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

See Above.

Wall-Modeled LES (WM-LES) and hybrid RANS-LES methods provide a clear path to first-principles design

of next-generation aircraft as exa-scale computing arrives. Transitioning this technology to future exa-scale

105

a complete flight-envelope characterization (accounting for irreducible uncertainties, e.g. angle-of-

attack, flight conditions, and geometry) will only be possible with computing at the Exa-scale and

beyond. Such a design framework for aerodynamically optimized vehicles and propulsion systems is

a critical resource for the design and construction of next generation aircraft and propulsion systems.

Figure 4.5.2 provides estimates of the computing requirements to meet these design goals to

address several Grand Challenges in aerospace systems where the computer speed and memory

requirements for analysis and design of airfoils, wings, and complete aircraft for three different

stages of approximation.

One of the major problems confronting the aircraft industry is the aerodynamic noise generated by

engine exhaust jets and airframes, particularly during take-off and landing approaches. The noise

problem has been a major issue for high-speed commercial aircraft and more recently for military

aircraft, both for impact on communities surrounding airports and military bases, and on the crew

stationed on aircraft carrier decks. It is known that turbulence is a major contributor to aircraft noise.

Unfortunately, modern optical diagnostic techniques are far from adequate in measuring the spatial-

temporal data needed to reveal the mechanics of aerodynamic noise; only high-fidelity simulation

techniques, such as LES, are capable of predicting both the far-field noise as well as details of the

noise generating turbulent eddies. Exa-scale computing would have transformational impact on the

discovery of the mechanics of noise generation, and would be instrumental in designing noise

mitigation strategies. Figure 4.5.3 shows the turbulent flow from a supersonic exhaust jet (M = 1.7)

obtained from a breakthrough state of the art LES computation in 2010. This first-of-a-kind of

calculation lacks high-fidelity representation of the flow inside the nozzle, and the agreement with

the measured noise data is only fair, presumably due to this inadequate grid resolution. As exa-scale

computing tools become available, high-fidelity tools would not only be used to understand and

platforms will have a transformative impact upon simulation-based engineering design, making possible the

design of aerodynamically optimized vehicles including integrated effects of propulsion, structures, and active

controls, a “Grand Challenge” of aerodynamic design.

Figure 4.5.2 Computer speed and memory requirements for the Grand Challenge

106

predict flow-generated noise, they will be used to learn how to control it. Such demonstration

calculations have been extremely computer intensive, and limited to very simple flows. Exa-scale

computing would be the enabling technology for complex flow control and shape optimization (e.g.,

of aircraft wings and nozzle exits), potentially leading to a major transformational effect on the

aerospace industry.

The other outstanding technical problems in the gas-turbine industry is the migration of hot fluid

parcels from the combustor to the turbine. The hot-streak migration is a limiting factor in the design

of turbines, as turbine blades, designed based on mean flow temperatures, are damaged severely

when encountering the migrating hot-spots. High-fidelity simulation of the flow inside the combustor

of a jet engine is a daunting task due to the multi-physics phenomena present. Even in the modern

LES computations of combustors using peta-scale class computers, reduced order models are used

for critical phenomena such as primary atomization of the injected liquid fuel into micron size

droplets, the evaporation process of the droplets and the chemical mechanisms involved. Exa-scale

computing would be the enabling technology for simulation of the jet engine combustors based on

first principles, which in turn promises to facilitate the discovery of mitigating strategies for the

suppression of the hot-streak migrations

4.5.3.2 Combustion

Reliable prediction requires, for example, the incorporation of heterogeneous kinetics with

quantified uncertainties in turbulent combustion simulations for processes such as soot

formation/burnout and increased fidelity coupling of high-pressure, low-temperature chemistry

with turbulent transport and these vital enhanced modeling techniques will only be feasible at exa-

scale computing performance levels. In particular, combustion scientists must focus on the science

underlying the development of non-petroleum based fuels, including carbon-neutral biofuels, and

their optimal use in transportation. This science intrinsically involves chemistry with transport at

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

Figure 4.5.3 A supersonic Jet Engine Nozzle Rapidly Accelerates High-Pressure Gas into the Atmosphere

107

conditions far from equilibrium and at extreme pressures and a coordinated multi-scale approach

for understanding and predicting combustion in turbulent environments

Combustion in practical devices covers a myriad of time and length scales, from the scale of the

electron to those of the largest scales of turbulence dependent upon the geometry of the device. To

tackle this daunting challenge and complexity, a multi-scale approach is adopted wherein

experiments, theory and direct computation are brought to bear on a limited range of scales (4-5

decades) and fundamental physical insights gained are encapsulated in reduced-order

parameterizations that are used to upscale knowledge to bridge the scales. Several high-fidelity

computational approaches in both the atomistic and continuum regimes utilize peta-scale computing.

Exa-scale computing would greatly facilitate higher fidelity or access to more practically relevant

parameter regimes (e.g., higher pressure, higher turbulence levels, and more complex fuels). In the

continuum regime where turbulence scales interact with flame, ignition, and mixing scales

turbulence-chemistry interactions are important. Virtually all combustion devices operate under

turbulent environments due to enhanced mixing and greater efficiency. Many of the fundamental

turbulence chemistry interactions are amenable to investigation by first principles direct numerical

simulation (DNS) and high-fidelity large-eddy simulation (LES) of building block, laboratory scale

flows. Whereas DNS focuses on the fully resolving the fine-grained physics, LES resolves the energy-

containing end of the turbulence spectrum down to a specified cut-off in the inertial or dissipative

end of the spectrum and the unresolved sub-grid scales are modeled. As such these methods are

complementary. Both DNS and LES require the horsepower of high-performance supercomputing at

the exa-scale and beyond to resolve all relevant flow and chemical scales. Exa-scale simulations are

required, for example, to understand the coupling between low-temperature ignition kinetics and

turbulent mixing at high pressure that determines lifted flame stabilization, ignition timing, rate of

combustion, and emissions characteristics. Understanding complex low-temperature high pressure

kinetics of alternative fuels and its coupling with turbulent transport at high pressure requires much

greater resolution and the transport of large numbers of reactive scalars only afforded by extreme

scale computing power. Moreover, in-situ reduction strategies for accurate and computationally

affordable inclusion of heterogeneous kinetics with quantified uncertainties in DNS and LES are

required. The insights gained from exa-scale simulations will enable the development of predictive

multi-scale models to optimally design future evolving fuels and engines.

Future predictive simulation tools running on exa-scale computing systems will enable deep

understanding of underlying chemical and combustion science processes, enhance combustion

engine design and performance, and ultimately yield a dramatic reduction in engine development

timescales, time to market, and development costs, while ensuring the timely achievement of energy

security and emissions goals, and enhancing the competitiveness of U.S. engine manufacturers and

fuel producers.

4.5.3.3 Climate Modeling

Although substantial uncertainty exists as to the degree and impacts of future climate change,

especially at local and regional scales, it is generally agreed that significant adaptation will be

required. Furthermore, the magnitude of climate change later in the century depends upon the near

and intermediate-term mitigation strategies used to reduce the emission of greenhouse gases. These

strategies also must satisfy an increasing energy demand of a growing global population experiencing

an improvement in its standard of living. Predicting these future climate changes and evaluating the

effects of mitigation strategies require Earth system models (ESMs) that are far more accurate and

comprehensive than those in use today. Integrated assessment models provide the framework for

climate predictions by defining the emissions scenarios and elucidating the relationships among the

natural and human systems that are at the core of climate change studies. In the next decade,

See Previous.

108

integrated assessment and comprehensive ESMs will probably be combined into a single system that

could be used to investigate scientific issues and to formulate policy options for adaptation and

mitigation.

The predictions from integrated ESMs will be most credible if the important processes in the climate

system, for example mixing by ocean eddies, are simulated at their native spatial and temporal scales.

Critical organized features in the atmosphere and ocean including clouds and eddies have

characteristic sizes of 1 to 10 km. Some of the major sources of uncertainty in climate predictions

from existing models are associated with the aggregate effects of these phenomena. Experience with

current climate models suggests that simulation of climate change with a model with 10-km grid

resolution is inherently a peta-scale problem. In fact, even higher resolution is required to resolve

these features with sufficient fidelity to the physical principles underlying their formation and

evolution. Since the computational cost increases nonlinearly with higher resolution, it is likely that

predictions of societal and environmental change at 1-km resolution would require truly extreme

scale computers.

4.5.3.4 Computational Biology

The ultimate goal of exa-scale computing applications to challenges in modern biology is to go from

atoms to organs or from microbes to ecosystems: for example, to enable an understanding of how

the brain works as an energy efficient, biologically-based information system, or to understand

microbial processes and their impact on the geosphere. In the process, these newly enlarged scales

of computing will resolve unfathomably complex research issues in a host of fields as diverse as

neuroscience and microbial metagenomics. At exa-scale, new scalable tools that admit a variety of

time, space and trajectory sampling methods (and fully exploit the hundreds of millions of cores of

an exa-scale machine) will enable long time integrations, implicit solvation conditions, and mixed

molecular mechanics and quantum mechanics models, to allow breakthrough science. For example,

a large biochemical network within a full-scale model of a eukaryotic cell could be modeled in the

span of a few hours.

It is important to note that the first million-atom simulation in biology was conducted just five years

ago an all-atom simulation of the ribosome conducted at Los Alamos National Laboratory. This

million particle simulation milestone had already been achieved a decade prior in materials science

and cosmology (computational scientists in both these fields now perform multibillion-particle

simulations). While biology researchers have achieved impressive methodological advances that

permit the modeling of the largest assemblies in the cell, it is only for short periods of time. And, these

simulations are unlikely to scale to the size of a single cell, even a small bacterium, for relevant times

such as minutes or hours even if researchers can employ computers capable of achieving 1,000

petaflops/s. Today, researchers are currently limited to the microsecond timescale for protein

folding required by the huge number of intermolecular, interaction computations. Scientists also lack

rigorous coarse grained models that permit the scaling up of macromolecular pathways and

supramolecular cellular processes. Similarly, systems biology methods lack the dynamic resolution

needed for coupling genomic and other data in order to fully map cellular networks, to predict their

functional states, and to control the time varying responses of living cells. Nor can current kinetics

models adequately analyze the dynamics of complex living systems. Exa-scale computing will be

needed to achieve those capabilities. Within the next decade, scientists expect to have the complete

genome sequence of more than 10,000 bacteria and archaea and other single-celled microbes. Exa-

scale computing platforms will make it possible in principle to systematically reconstruct the

metabolic networks of all sequenced microbes through automated comparative analysis, to

reconstruct their regulatory networks by integrating a variety of data sources, and to combine these

reconstructions into functional models of cellular states. Exa-scale computing will be critical to make

this a routine class of computation such that it can become part of the standard way we analyze

genomes in the future.

109

4.5.3.5 Materials Science

Materials innovations are central to many of the technological advances responsible for the quality

of life and prosperity. In fact, many of the disruptive technological advances since the turn of the last

century modern transportation, medical treatments and prosthetics, space exploration, global

communication, computers and the electronics industry used advances arising from every corner of

the materials world: metals, ceramics, semiconductors, polymers, and novel combinations of these.

Materials establish and support entire industries, and tens of millions of manufacturing jobs depend

on the availability of these advanced materials at affordable costs. A quantifiable understanding of

novel materials and their response is central as well to the technological challenges facing our

country. Whether it is ceramics for high-efficiency automobiles, photovoltaics for next-generation

solar power or smart alloys for efficient building construction, the nation requires the development

of advanced materials with superior properties that will drive the next generation of technologies. In

the highly competitive global marketplace that we find ourselves, minimizing time to solution and

time to market is crucial.

It is instructive to consider two workhorse techniques for materials modeling hydrodynamics and

molecular dynamics and examine the reasons why a simulation might fail to provide sufficiently

useful information. Molecular dynamics simulations are characterized by a force field or potential,

involving many adjustable parameters, which describes the interactions between atoms. There are

no parameters required to describe the response of the materials, however all the constitutive

response emerges naturally from the interaction potentials. Such calculations are currently limited

in size to fractions of a cubic micron simulated for 10’s of nanoseconds, even on the largest

computers. Hydrodynamics, by comparison, involves many adjustable parameters describing both

interaction and the materials response. However, there is no real size or time limit in the simulation.

There is a practical lower limit on resolution, as it

makes no sense to model a atomically sized region

of space using continuum equations. At a given

level of computing, computational scientists using

either method encounter two common barriers to

success: (a) the largest (or most finely resolved)

simulation possible is still too small (or too poorly

resolved) to capture the relevant behavior of

interest, or (b) the most complex, compute-

intensive simulation that can be solved in a

reasonable time is still too simple or approximate

to adequately describe the physics of interest. In

many cases both (a) and (b) are true which is

particularly damning, since it prevents the

investigator from performing the traditional

trade-off between these two constraints: very

often, one makes simplifying approximations to enable a larger simulation or investigates smaller

systems in order to perform a more complicated calculation.

On the other hand, investigating grain formation using molecular dynamics may not be possible, even

in the simplest metals on today’s computers. The availability of an exa-scale platform will move the

location of the constraints, allowing quite generally more detailed calculations of more complex

materials. State of-the-art calculations involving billions of atoms have been performed that

demonstrate the ability to model macroscopic (i.e., continuum) materials behavior with an atomistic

model that makes no assumptions about the cooperative response. Figure 4.5.4 shows a detail view

of 9 Billion-atom molecular dynamics simulation of a developing Kelvin-Helmholtz instability at the

sheared interface between aluminum and copper. With the development of an exa-scale computer it

is possible that such a calculation (which was heroic on a peta-scale computer) could be performed

Figure 4.5.4 Detail View of 9-Billion Atom

Molecular Dynamics Simulation Instability

110

on demand during a hydrodynamics calculation, determining, for example, the equation of state for

a mixed region at precisely the temperature, pressure and composition that was required. By

tabulating this information as it is generated, one can envision that such a simulation would teach

itself as it runs, learning only those regions of this three dimensional phase space that is needed.

4.5.3.6 Nuclear Engineering

Recent studies have reviewed the status and basic science, challenges, opportunities, and research

needs for advanced nuclear energy systems, with specific attention to the role of predictive modeling

and simulations (M&S) in addressing the difficulties posed by the radioactive materials and harsh

environments found in these systems:

• Computational M&S offers the opportunity to accelerate nuclear energy development by

simulating complex systems to evaluate options and predict performance, thus narrowing

the technology path and optimizing testing requirements.

• Today’s high-performance computational systems are capable of modeling complete reactor

systems and related technologies; the availability of exa-scale systems will enable high-

fidelity

M&S that can further improve the performance of existing reactors and have a significant

positive impact on both the design and the operation of future reactors.

Simulation has the potential for addressing the critical needs of advanced nuclear energy systems by

providing the tools necessary for safety assessments, design activities, cost, and risk reduction. One

can, for example, imagine virtual prototyping of reactor cores yielding data that leads to more

accurate identification of design margins, allows early experimentation with novel design concepts,

and ultimately significantly reduces plant certification timelines. In other areas, such as advanced

fuel fabrication, atomistic fuel simulations could ultimately make it possible to target a small subset

of promising candidate fuel types for further experimentation, greatly reducing the number of

experiments to be performed. A simulation-based methodology is within reach with exa-scale

computers. The scope of the M&S tools needed to support the design, analysis and engineering of

next-generation nuclear energy systems is daunting:

1. Integrated 3D reactor core simulations with rigorous propagation of uncertainty;

2. Coupled thermal hydraulic and primary loop simulation;

3. Advanced fuel design and performance;

4. Fuel behavior engineering;

5. Advanced secondary loop and balance of plant engineering and analysis;

6. Advanced fuel cycle design;

7. Separations facility engineering optimization;

8. Repository design including seismic, geological, chemical, and thermal modeling and simulation;

9. Overall nuclear energy systems model development suitable for alternative economic analysis.

Spent fuel reprocessing is very complicated with a large number of different materials, multiple

pathways must be considered; waste streams must be treated; improve coupling between

computations and experiments must occur. Reprocessing occurs at high temperature, and is in dire

need of better multi-scale M&S. The opportunities for impact on reprocessing with exa-scale M&S

abound. These include developing new separation agents, full-scale plant simulations using first

principles, integrating multiple codes, and separations simulations. Empirical understanding does

not lead to appropriate scale up it will instead require exa-scale computing. Some of the payoffs for

exa-scale computation include: reduced R&D cost and time; improved/accelerated design; process

scaleup; reduced facility cost; opportunity for major change; and waste form design. Many challenges

confront viable and useful (predictive) M&S of fuel performance. These include the ability to reduce

fuel development and qualification time, assess life cycle performance, address safety concerns,

111

predict fuel rod behavior in design basis accident (DBA), and predict current and advanced (e.g.,

transuranic) fuel behavior. Important effects and requirements to incorporate include material

properties, swelling, microstructural phase change, thermal properties, crack formation and

mechanical property change. High-fidelity modeling of fuel performance is inherently multiscale, e.g.,

the effects of point defects and fission products must be considered. Exa-scale platform requirements

drivers in fuel performance can be quantified.

Opportunities for exa-scale M&S of existing and future advanced reactors include eliminating

unrealistic assumptions that drive to more conservative designs and thus higher installation cost,

helping to achieve higher power efficiencies, a reduction of learning curves to get efficiencies, helping

to reduce the required number of repositories, improving safety posture, optimizing design of the

power grid and the fuel cycle and better (more efficient) operations, including in-line monitoring and

operator training. There are numerous issues confronting advanced reactor M&S today. The core is

a coupled physics problem (not currently being done very well today) and the full system needs to

be analyzed in one tool. Current reactor designs are excessively conservative.

4.5.3.7 Others Disciplines

Other frequently mentioned disciplines which will be realty impacted by Exa-scale are:

• Astrophysics

• Fusion Energy

• National Security

Users should consult the report “The Opportunities and Challenges of Exa-scale Computing” of USDOE

for further information.

4.5.4 Challenges in Going to the Exa-scale

Creating an exa-scale computer capable of effectively running the applications just described will

require significant R&D breakthroughs. The previous section laid out the case for the wide range of

scientific and technical advances that could be made with an exaflop computer. This section discusses

the challenges required to make that three order of magnitude jump in technology. In this type of

discussion, it is often far too easy to talk about that jump as some quantitative steps in an

evolutionary process, when in fact the jump implies significant qualitative changes in the way

solutions must be approached. Consider the following Table 4.5.1 illustrates three orders of

magnitude in change. The analogy to computing challenges is not quite the same, because we do not

have to explore totally different technologies to make the leap in three orders of magnitude. However,

just like we would not think of asking a marathon runner to explore the solar system, we cannot use

current technology to produce an Exaflop system. Below we highlight the important steps necessary

to take this giant step while users could consult the [Report on Exa-scale Computing]

for additional

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

Technology

Quantitative Rate

Qualitative Change

Marathon Runner

10 mph

Explore a town

Car

100 mph

Explore a country

Jet

1000 mph

Explore a world

Space Craft

10000 mph

Explore the solar System

Table 4.5.1 Three Order of Magnitude Jump

112

information.

4.5.4.1 The Hardware Challenges

The architectural challenges for reaching exa-scale are dominated by power, memory,

interconnection networks and resilience. Table 4.5.2 compares current HPC designs with potential

exa-scale designs from the DOE

. The baseline we need is a factor of 500 change in peak system

performance. The difference in factor changes for the various components show where simple scaling

of systems (e.g., buying 500 2 Pf/s systems) will be inadequate. Take for example, the power line in

the table. While the peak speed goes up by 500, the power cost cannot go up by more than a factor of

3. That means that the power solution for an exaflop system has to be over 150 times more efficient

than current technology. That is a huge challenge. Looking through the other entries, the table clearly

echoes the sentiments of the IAA, and highlight key features that must be addressed in hardware or

downstream in software. Other potential challenges are:

• Exaflop hardware needs major R&D progress

• Power

• System Memory

• Data Movement

• System Resiliency

4.5.4.2 The Applied Mathematics Challenges

The applied mathematics component of an exa-scale program should include attention to activities

with time horizons ranging from medium term to very long term, where both ends of the time scale

are essential. The description “medium-term” is deliberate because experience in adapting to new

computational modalities shows that short-term, one-off strategies are likely to be wasteful. Even

though much remains unknown about the details of exa-scale systems, a clear medium-term priority

is the definition and implementation of algorithms that are scalable at very large levels of parallelism

(such as on million-core machines) and that remain sufficiently fast under different hardware

See Previous.

Table 4.5.2 Potential Exa-scale Computer Design for 2018 and Its Relationship to Current HPC

Designs (DOE)

113

decisions about bandwidth and latency. Scalability should be modeled and analyzed mathematically,

using abstractions that represent key architectural ingredients. Simulations and experiments that

indicate the effects of hardware and software perturbations on algorithmic efficiency can then guide

the definition of methods that retain scalability under a variety of hardware scenarios. In this spirit,

the strategies for applied mathematics in exa-scale science will require sustained support over time

for people-intensive activities, early identification of the hardest (and least straightforward) research

problems, and built-in flexibility to pursue unexpected and promising new directions as they arise.

Some other points important but not discussed here are:

4.5.4.3 Mathematical Modeling

It is natural for those developing mathematical models of practical problems to limit themselves to

formulations that can be solved numerically using currently available methods. Although essential

when the problem needs to be solved in the short term, an ab initio focus on feasibility can create a

too-rigid environment in which non-standard or “blue-sky” formulations are either never thought

about or else summarily rejected. For example, a problem formulation that represents many real-

world problems yet tends to be avoided because of its known intractability is constrained nonlinear

optimization with a mixture of continuous, discrete, and categorical variables. But the prospect of

massive increases in computational power means that modeling ideas previously dismissed as

impossible or impractical may well become realistic, and should be carefully examined and analyzed.

Creative rethinking of mathematical models is an essential strategy to address the challenges of exa-

scale science. The highly desired “transformational” changes flowing from exa-scale computing are

most likely to come from new formulations that change the way we think about problems, rather

than from applying more resources to an existing formulation to obtain a more accurate solution or

to solve a larger problem. Mathematical models are inherently an approximation of reality, and an

exa-scale initiative provides an opportunity to loosen the grip of, or even remove, computationally-

imposed simplifications. The major challenge is to devise models that capture the important details

of physical and engineered systems as they really are. This will almost certainly generate much

harder sub-problems and/or much more data, but the gains are likely to be eminently worthwhile.

4.5.4.4 Numerical Algorithms

The need for scalable algorithms in an exa-scale initiative has already been stressed. Another

essential feature, highlighted in a 2009 talk by Kathy Yelick called Ten ways to waste a parallel

computer, is a “back to basics” approach to reformulation. Without careful analysis of both new

models and new numerical methods, there is the risk of significant inaccuracy or large computational

overhead in unexpected parts of the overall solution process, as illustrated in the following two

examples related to numerical methods for partial differential equations:

1. All indications are that memory will become the rate-limiting factor along the path to exa-

scale, and investments should accordingly be made in designing algorithms with reduced

memory requirements. Examples where this work is appropriate include:

i. Algorithmically scalable matrix-free methods (e.g., multigrid) for sparse systems of

equations, where “algorithmically scalable” means that the total resources needed to

solve the problem (flops plus memory) are proportional to the resources needed to

evaluate the associated operator;

ii. High-order methods that perform more computation to obtain greater accuracy for

each computational degree of freedom;

iii. Adaptive models/methods designed to use the smallest possible number of degrees

of freedom to obtain the needed level of accuracy.

2. Many calculations related to DOE missions involve models that depend on both space and

time. In current methods, obtaining better spatial resolution typically requires a comparable

reduction in the time step. A frequent argument for exa-scale science is that it will allow much

114

finer spatial resolution in numerous application domains, with (for example) meshes reduced

in size by a factor of ten.

Unfortunately, simply reducing mesh spacing by a factor of ten could lead to a ten-fold increase in the

time for solution, even with perfect weak scaling. Several strategies, all in the spirit of rethinking,

should be explored to avoid this inefficiency. For example, models can be made more implicit to

avoid restrictive time-step conditions arising from stiff processes that rapidly relax to equilibrium

(e.g., in the context of low Mach-number fluid flows). A further strategy is aggressive use of sub-

cycling in time for processes that are fast, but either are localized in physical space or involve only a

small subset of the variables in state space. A motivating example here is advection in the jet stream

in atmospheric modeling.

Approaches of this flavor across the spectrum of numerical methods will almost certainly lead to

increased algorithmic complexity, in addition to the daunting software-related challenges discussed.

The substantially greater work needed to devise exa-scale numerical methods and software leads us

to observe that, for decades, there has been, roughly speaking, a dichotomy in the wish list for the

mathematical software used to solve scientific and engineering problems. On one hand, many DOE

scientists have neither time nor inclination to become experts in numerical methods and software

techniques, preferring to leave software development to mathematicians and computer scientists.

On the other hand, some scientists and engineers want to become deeply involved in producing

domain-specific methods and software to attain the highest possible efficiency for their particular

problem. An exa-scale science program needs to address the needs of both these groups. For the first,

“professional” mathematical software and libraries (meaning software developed by mathematicians

and computer scientists for relatively generic problems such as solving linear systems or eigenvalue

problems) should be developed for increasingly broad problem categories as we move toward exa-

scale. In this way, domain scientists will be able to use state-of-the-art software components that can

be shared across multiple application domains. Since writing software is universally recognized to

be time consuming and error-prone, scientists and engineers will benefit from availability of

software that they can use off the shelf while experimenting with domain-specific challenges rather

than writing their own sparse matrix package. For the second group, specific scientific case studies

should be identified that require significant involvement of domain scientists, mathematicians, and

computer scientists in end-to-end software development.

• Mathematics for massive data

➢ Machine learning

➢ Compressive sampling

• Symbolic computing

4.5.4.5 The Algorithmic Challenges

Advancing science in key areas requires development of next-generation physical models to satisfy

the accuracy and fidelity needs for targeted simulations. The impact of these simulation fidelity needs

on requirements for computational science is twofold.

• First, more complex physical models must be developed to account for more aspects of the

physical phenomena being modeled.

• Second, for the physical models being used, increases in resolution for key system variables,

such as numbers of spatial zones, time steps or chemical species, are needed to improve

simulation accuracy, which in turn places higher demands on computational hardware and

software.

Application models represent the functional requirements that drive the need for certain numerical

algorithms and software implementations. Science priorities lead to science models, and models are

implemented in the form of algorithms. Algorithm selection is based on various criteria, such as

115

appropriateness, accuracy, verification, convergence, performance, parallelism and scalability.

Moving forward to exa-scale will put heavier demands on algorithms in at least two areas:

• the need for increasing amounts of data locality in order to perform computations efficiently,

• the need to obtain much higher factors of fine-grained parallelism as high-end systems

support increasing numbers of compute threads.

As a consequence, parallel algorithms must adapt to this environment, and new algorithms and

implementations must be developed to extract the computational capabilities of the new hardware.

Significant new model development, algorithm re-design and science application code

reimplementation, supported by exa-scale-appropriate programming models, will be required to

exploit efficiently the power of exa-scale architectures. The transition from current sub-peta-scale

and peta-scale computing to exa-scale computing will be at least as disruptive as the transition from

vector to parallel computing in the 1990’s. Uncertainty quantification will permeate the exa-scale

science workload. The demand for predictive science results will drive the development of improved

approaches for establishing levels of confidence in computational predictions. Both statistical

techniques involving large ensemble calculations and other statistical analysis tools will have

significantly different dynamic resource allocation requirements than in the past, and the significant

code redesign required for the exa-scale will present an opportunity to embed uncertainty

quantification techniques in exa-scale science applications. Some other points are:

• New multicore-friendly and multicore-aware algorithms

• Adaptive Response to Load Imbalance

• Multiple precision algorithms/software

• Communication avoiding

• Fast implicit solvers

• Auto-tuning

• Scheduling and memory management for heterogeneity and scale

• Fault tolerance and robustness for large-scale systems

• Building energy efficiency into algorithms foundations

• Sensitivity analysis

• Multiscale/multi-physics modeling

4.5.4.6 Computer Science Challenges

The coming transition in computer architectures as peak capability approaches the exa-scale offers

both challenges and opportunities

. The challenges involve a paradigm shift in programming

methodologies. Existing technologies for writing parallel scientific applications have sustained HPC

application software development for the past decade and have been successful for Peta-scale

computing, but were architected for coarse-grained concurrency largely dominated by bulk

synchronous algorithms. Future hardware constraints and growth in explicit on-chip parallelism will

likely require a mass migration to new algorithms and software architecture that is as broad and

disruptive as the migration from vector to parallel computing systems that occurred 15 years go. The

applications and algorithms will need to rely increasingly on fine-grained parallelism, strong scaling,

and fault resilience. Addressing these challenges opens up a renewed opportunity to introduce a

higher level of software engineering into current fusion application subsystems that will enhance the

modularity, portability, and performance of codes while extending their capabilities to new levels. At

the same time, past sound investments must be protected, and a migration path from current to

future environments must be elaborated. Some other themes are:

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

116

• Programming Models

• I/O

• Getting There from Here

• Tools

• Fault Tolerance

4.5.4.7 Educational Challenges

Major challenges in exa-scale science include the building of understanding and awareness among

groups with high prestige in both academia and industry, and the dearth of highly competent young

scientists in this field, two issues that are not entirely unrelated. Many of the reasons for these

problems are reasonably well understood, but not easily dealt with. Application scientists who focus

primarily on building computational tools are sometimes regarded by their scientific community as

not being “real” scientists. This phenomenon is particularly noticeable in both physics and chemistry,

reflecting in part the penetration of “community codes”. From the opposite perspective, high-level

software designers and programmers may not welcome or appreciate the contributions made by

scientific disciplines to building state-of-the-art computational tools. On the bright side, interest in

computational science and engineering worldwide has measurably increased during the past 15

years.

Almost no universities, even those with faculty working on computational science and engineering,

have, or are likely to develop, a curriculum that focuses on issues associated with exa-scale science.

In addition, as our subcommittee has noted already, many of the issues in exa-scale science are not

yet understood, which means that a straightforward program of training in the usual sense is

impossible. Exa-scale hardware and its features will keep changing, so that training people too early

to think about specific hardware configurations is a bad idea. However, it is important to start soon

to lay the foundations for future thinking about exa-scale science. To be successful, an exa-scale

science education and training program needs to be devised and managed with creative flair, not

business as usual

The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific

Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.

Performance and Scalability Analysis of AI-Accelerated CFD Simulations Across Various Computing Platforms

Chapter

Full-text available

May 2023

In this paper, we perform an extensive benchmarking and analysis of the performance and scalability of our software tool called CFD suite, which implements the AI-based domain-specific method for accelerating CFD (computation fluid dynamic) simulations proposed by us recently. By exploring various computing platforms containing both CPUs and GPUs, this analysis helps select suitable platforms for training and inference stages across heterogeneous execution environments. We propose and investigate two modes of utilizing the proposed decomposition of the AI model at the inference stage – either by calling each sub-model one by one (on GPUs) with reduced memory requirements or by performing pipeline predictions (on CPUs with large RAM) to improve the overall performance. It is shown that for the whole inference stage (including overheads), due to the pipeline execution and excluding overheads for data transfers through PCIe, the speedup provided by two Intel Xeon Gold CPUs (Skylake) is 2.4 times higher than for V100 GPU.KeywordsAI-accelerated HPCCPU/GPU/cluster computingchemical mixingCFDOpenFOAMperformancescalability

Parallel Physics-Informed Neural Networks via Domain Decomposition

Article

Full-text available

Sep 2021
J COMPUT PHYS

We develop a distributed framework for the physics-informed neural networks (PINNs) based on two recent extensions, namely conservative PINNs (cPINNs) and extended PINNs (XPINNs), which employ domain decomposition in space and in time-space, respectively. This domain decomposition endows cPINNs and XPINNs with several advantages over the vanilla PINNs, such as parallelization capacity, large representation capacity, efficient hyperparameter tuning, and is particularly effective for multi-scale and multi-physics problems. Here, we present a parallel algorithm for cPINNs and XPINNs constructed with a hybrid programming model described by MPI + X, where X ∈{CPUs,GPUs}. The main advantage of cPINN and XPINN over the more classical data and model parallel approaches is the flexibility of optimizing all hyperparameters of each neural network separately in each subdomain. We compare the performance of distributed cPINNs and XPINNs for various forward problems, using both weak and strong scalings. Our results indicate that for space domain decomposition, cPINNs are more efficient in terms of communication cost but XPINNs provide greater flexibility as they can also handle time-domain decomposition for any differential equations, and can deal with any arbitrarily shaped complex subdomains. To this end, we also present an application of the parallel XPINN method for solving an inverse diffusion problem with variable conductivity on the United States map, using ten regions as subdomains.

On the mesh resolution of industrial LES based on the DNS of flow over the T106C turbine

Article

Full-text available

Dec 2019

A high-order Navier-Stokes solver based on the flux reconstruction (FR) or the correction procedure via reconstruction (CPR) formulation is employed to perform a direct numerical simulation (DNS) and large eddy simulations (LES) of a well-known benchmark problem – transitional flow over the low-pressure T106C turbine cascade. Hp-refinement studies are carried out to assess the resolution requirement. A 4th order (p3) simulation on the fine mesh is performed with a DNS resolution to establish a "converged" solution, including the mean pressure and skin-friction distributions, and the power spectral density in the wake. Then LES on the coarse and fine meshes with lower order schemes are conducted to assess the mesh and order dependence of the solution. In particular, we study the error in the transition location, the mean skin-friction distribution, and the mean lift and drag coefficients. These h- and p-refinement studies provide a much-needed guideline in h- and p- resolutions to achieve a certain level of accuracy for industrial LES applications.

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Conference Paper

Aug 2020

Efficient Implementation of the FR/CPR Method on GPU Clusters for Industrial Large Eddy Simulation

Conference Paper

Jun 2020

This paper addresses aspects related to the efficient use of the FR/CPR method for large eddy simulations (LES) on GPU clusters. The present work considers four different cell types: hexahedron, prism, pyramid and tetrahedron. Issues related to their difference in speedup performance on a GPU cluster are discussed. Approaches to improve the parallel efficiency of multi-GPU simulations are also investigated, and a CUDA-aware MPI implementation is compared against a non-CUDA-aware one. A new communication strategy named time lagging MPI is proposed and implemented. It reduces the simulation temporal order of accuracy, but has the potential to considerably improve parallel efficiency. A LES benchmark case is simulated using the time lagging approach and its solution compared with the literature. The points of flow separation and reattachment are accurately predicted and the relative error for mean values of lift and drag coefficients are of the order of 0.1. Strong scalability studies are done at the Summit supercomputer with up to 800 nodes, or 4800 GPUs. The node-hour cost to run at Summit is evaluated for both GPU and CPU cases. A cost reduction of a factor of 10 is measured for hexahedral meshes and a factor of 100 for tetrahedral cells. The use of GPUs enable the reduction of the iteration cost per degree of freedom for increasing polynomial order and for all cell types, including pyramids and tetrahedrons. The same is only observed on CPU simulations for hexahedral cells.

A GPU-based algorithm for efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers

Article

May 2020
APPL MATH MODEL

Αn optimized MPI+OpenACC implementation model that performs efficiently in CPU/GPU systems using large-eddy simulation (LES) is presented. The code was validated for the simulation of wave boundary-layer flows against numerical and experimental data in the literature. A direct Fast-Fourier-Transform-based solver was developed for the solution of the Poisson equation for pressure taking advantage of the periodic boundary conditions. This solver was optimized for parallel execution in CPUs and outperforms by 10 times in computational time a typical iterative preconditioned conjugate gradient (PCG) solver in GPUs. In terms of parallel performance, an overlapping strategy was developed to reduce the overhead of performing MPI communications using GPUs. As a result, the weak scaling of the algorithm was improved up to 30%. Finally, a large-scale simulation (Re=2 × 10⁵) using a grid of 4 × 10⁸ cells was executed, and the performance of the code was analyzed. The simulation was launched using up to 512 nodes (512 GPUs + 6144 CPU-cores) on one of the current top 10 supercomputers of the world (Piz Daint). A comparison of the overall computational time showed that the GPU version was 4.2 times faster than the CPU one. The parallel efficiency of this strategy (47%) is competitive compared with the state-of-the-art CPU implementations, and it has the potential to take advantage of modern supercomputing capabilities.

High-order matrix-free incompressible flow solvers with GPU acceleration and low-order refined preconditioners

Article

Apr 2020
COMPUT FLUIDS

We present a matrix-free flow solver for high-order finite element discretizations of the incompressible Navier-Stokes and Stokes equations with GPU acceleration. For high polynomial degrees, assembling the matrix for the linear systems resulting from the finite element discretization can be prohibitively expensive, both in terms of computational complexity and memory. For this reason, it is necessary to develop matrix-free operators and preconditioners, which can be used to efficiently solve these linear systems without access to the matrix entries themselves. The matrix-free operator evaluations utilize GPU-accelerated sum-factorization techniques to minimize memory movement and maximize throughput. The preconditioners developed in this work are based on a low-order refined methodology with parallel subspace corrections, as described for diffusion problems in [1]. The saddle-point Stokes system is solved using block-preconditioning techniques, which are robust in mesh size, polynomial degree, time step, and viscosity. For the incompressible Navier-Stokes equations, we make use of projection (fractional step) methods, which require Helmholtz and Poisson solves at each time step. The performance of our flow solvers is assessed on several benchmark problems in two and three spatial dimensions.

ZEFR: A GPU-accelerated high-order solver for compressible viscous flows using the flux reconstruction method

Article

May 2020
COMPUT PHYS COMMUN

In this work we present ZEFR, a GPU accelerated flow solver based around the high-order accurate flux reconstruction (FR) approach. Written in a combination of C++ and CUDA, ZEFR is designed to perform scale resolving simulations within the vicinity of complex geometrical configurations. A unique feature of ZEFR is its support for overset grids; a feature which greatly expands the addressable problem space compared with existing high-order codes. The C++ implementation of FR in a manner which is suitable for modern hardware platforms is described in detail. Additionally, an overview of the input deck used by ZEFR is included. Validation results are presented for a range of steady and unsteady flow problems including Couette flow, the Taylor–Green vortex, and flow around an SD7003 airfoil. Single node performance on a NVIDIA V100 GPU is analyzed where it is shown that all of the kernels in ZEFR attain a high proportion of peak memory bandwidth. Moreover, multi-node performance is also assessed with strong scalability being demonstrated from 60 to 3840 NVIDIA V100 GPUs. Program summary Program title: ZEFR v1.0 Program files doi: http://dx.doi.org/10.17632/wzy83bscxd.1 Licensing provisions: BSD 3-clause Programming language: C++ and CUDA External routines/libraries: Eigen, HDF5, METIS, MPI, and TIOGA. Nature of problem: Compressible Euler and Navier–Stokes equations. Solution method: High-order direct flux reconstruction approach suitable for curved, mixed, unstructured grids. Unusual features: Code incorporates support for overset grids.

Accuracy, Efficiency and Scalability of Explicit and Implicit FR/CPR Schemes in Large Eddy Simulation

Article

Oct 2019
COMPUT FLUIDS

The accuracy, efficiency and scalability of explicit and implicit temporal schemes associated with FR/CPR spatial schemes are studied in the context of large eddy simulation. One low pressure turbine and one high pressure turbine blade cases with different Mach numbers and Reynolds numbers are employed in this study. The 3 stage SSP Runge–Kutta (RK) scheme is compared with implicit backward difference formulas (BDF) of first and second order accuracy with a non-linear LU-SGS solver in the present evaluation. Various factors such as inner convergence tolerance, the frequency of the implicit operator update, and the order of time accuracy are investigated for large eddy simulation. The implicit BDF2-LUSGS algorithm can achieve good time accuracy, and is more efficient than the 3rd order explicit RK scheme, but not as scalable on a small cluster. At a very low Mach number, the explicit scheme is clearly not adequate, suggesting the need for pre-conditioning and/or an implicit scheme.

High-order accurate simulation of incompressible turbulent flows on many parallel GPUs of a hybrid-node supercomputer

Article

Jun 2019
COMPUT PHYS COMMUN

Turbulent incompressible flows play an important role in a broad range of natural and industrial processes. High-order direct numerical simulations are often used for resolving the spatio-temporal scales of such flows. Such high-fidelity simulations require an extensive computational layout which often results in prohibitive computational costs. Recent advances in modern computing platforms, such as GPU-powered hybrid-node supercomputers, appear to become an enabler for high-fidelity CFD at large scales. In this work, we propose methods for accelerating a distributed-memory high-order incompressible Navier–Stokes solver by using NVIDIA Pascal GPUs of a Cray XC40/50 supercomputer. Arithmetically intensive or chronically invoked routines were ported to the GPUs using CUDA C. Host- side driver routines were developed to invoke CUDA C ‘‘external" kernels from the FORTRAN legacy code. Numerical methods, for some of the most intensive operations, namely multigrid preconditioners, were modified to be suited to the SIMD standard for graphics processors. Customized unit testing was performed to ensure double-precision accuracy of GPU computations. The optimization layer maintained the memory structure of the legacy code. Post-profiling confirms that backbone distributed memory communications increase the number of dynamic CPU–GPU memory copies, which offsets a part of the computational performance. Strong scalability of the entire flow solver and of the stand- alone pressure solver has been examined on up to 512 P100 GPUs. Strong scaling efficiency decreased for higher numbers of GPUs, probably due to a less favorable communication-to-computation ratio. Weak scalability of the entire solver was tested on up to 4096 P100 GPUs for two problems of different sizes. The solver maintained nearly ideal weak scalability for the larger problem, illustrating the potential of GPUs in dealing with highly resolved flows. The GPU-enabled solver is finally deployed for the scale-resolving simulation of flow transition in the wake of a solid sphere at Re=3700, by utilizing 192 GPUs. The time-averaged pressure coefficient along the sphere surface was in good agreement with previously reported data acquired from CPU-based direct numerical simulations and experiments.

Optimization and acceleration of flow simulations for CFD on CPU/GPU architecture

Article

Jul 2019
J BRAZ SOC MECH SCI

With the increasing requirement of high computational power in computational fluid dynamics (CFD) field, the graphic processing units (GPUs) with great floating-point computing capability play more important roles. This work explores the porting of an Euler solver from central processing units (CPUs) to three different CPU/GPU heterogeneous hardware platforms using MUSCL and NND schemes, and then the computational acceleration of one-dimensional (1D) Riemann problem and two-dimensional (2D) flow past a forward-facing step is investigated. Based on hardware structures, memory models and programming methods, the working manner of heterogeneous systems was firstly introduced in this paper. Subsequently, three different heterogeneous methods employed in the current study were presented in detail, while porting all parts of the solver loop to GPU possessed the best performance among them. Several optimization strategies suitable for the solver were adopted to achieve substantial execution speedups, while using shared memory on GPU was relatively rarely reported in CFD literature. Finally, the simulation of 1D Riemann verified the reliability of the modified codes on GPU, demonstrating strong ability in capturing discontinuities of both schemes. The two cases with their 1D computational domains discretized into 10,000 cells both realized a speedup exceeding 25, compared to that executed on a single-core CPU. In simulation of the 2D step flow, we came to the highest speedups of 260 for MUSCL scheme with 800 × 400 mesh size and 144 for NND scheme with 400 × 200 computational domain, respectively.

Basics of Computer Architecture as related to CFD

Abstract and Figures

Recommended publications

A Matrix-free GMRES Algorithm on GPU Clusters for Implicit Large Eddy Simulation

Efficient Implementation of the FR/CPR Method on GPU Clusters for Industrial Large Eddy Simulation

Accelerating CFD simulation with high order finite difference method on curvilinear coordinates for...

Accelerating CFD simulation with high order finite difference method on curvilinear coordinates for...