Technical ReportPDF Available

Basics of Computer Architecture as related to CFD

Authors:
  • CFD Open Series

Abstract and Figures

As we know, Computers and Software’s are one of the pillars of CFD and next two chapter are devoted to that. We argued that being a CFD analysis is not necessities a computer expert, nevertheless, knowing the essentials of it never hurts. So, it is wise to get familiar with modern computer architectures, as well as, software optimization. Even if you can speed up the computational aspects of a processor infinitely fast, you still must load and store the data and instructions to and from a memory. Today's processors continue to creep ever closer to infinitely fast processing. But memory performance is increasing at a much slower rate (it will take longer for memory to become infinitely fast). Many of the interesting problems in high performance computing use a large amount of memory. As computers are getting faster, the size of problems they tend to operate on also goes up. The trouble is that when you want to solve these problems at high speeds, you need a memory system that is large, yet at the same time fast; a big challenge.
Content may be subject to copyright.
5a
CFD Open Series / Patch 2.45
Basics of Computer
Architecture as
related to CFD
Edited & Adapted by : Ideen Sadrehaghighi
AN N A P O L I S , MD
Memory
Management
and Smooth
Programing
CPU vs. GPU
2
Contents
1 Introduction to Modern Computer Architectures via CFD ........................................... 8
1.1 Preliminaries ............................................................................................................................................................ 8
2 Memory Systems & Basics of Good Programming ........................................................... 9
2.1 Memory Technology ............................................................................................................................................. 9
2.1.1 Memory Access Time ......................................................................................................... 10
2.1.2 Memory Access Patterns ................................................................................................... 10
2.1.2.1 Loop Interchange to Ease Memory Access Patterns ............................................... 11
2.1.3 Virtual Memory ................................................................................................................. 11
2.2 Registers ................................................................................................................................................................. 11
2.3 Caches ...................................................................................................................................................................... 12
2.3.1 Cache Organization ............................................................................................................ 14
2.3.1.1 Direct-Mapped Cache .............................................................................................. 14
2.3.1.2 Fully Associative Cache ............................................................................................ 15
2.3.1.3 Set-Associative Cache .............................................................................................. 15
2.3.1.4 Instruction Cache ..................................................................................................... 16
2.4 Timing a Program ............................................................................................................................................... 17
2.4.1 Timing a Portion of the Program ....................................................................................... 18
2.4.2 Getting Time Information .................................................................................................. 18
2.5 Subroutine Profiling ........................................................................................................................................... 20
2.6 Loop Optimizations ............................................................................................................................................ 21
2.6.1 Operation Counting ........................................................................................................... 22
2.6.2 Basic Loop Un-Rolling ........................................................................................................ 23
2.6.3 Loops with Low Trip Counts .............................................................................................. 24
2.6.4 Fat Loops ............................................................................................................................ 24
2.6.5 Loops Containing Procedure Calls ..................................................................................... 25
2.6.6 Loops with Branches .......................................................................................................... 25
2.6.7 Nested Loops ..................................................................................................................... 26
2.6.8 Outer Loop Un-Rolling ....................................................................................................... 26
2.6.9 Loop Interchange to Move Computations to the Center .................................................. 27
2.7 Matrix Multiplication ......................................................................................................................................... 27
2.7.1 Matrix Optimization .......................................................................................................... 28
2.7.2 Blocking to Ease Memory Access Patterns ........................................................................ 29
2.8 Shared-Memory Parallel Processors ........................................................................................................... 29
2.8.1 Dependencies .................................................................................................................... 30
2.8.1.1 Control Dependencies ............................................................................................. 31
2.8.1.2 Data Dependencies ................................................................................................. 31
2.8.2 Forming a Flow Graph ....................................................................................................... 32
2.8.2.1 Loop Dependencies ............................................................................................... 33
2.8.2.2 Loop-Carried Dependencies .................................................................................... 34
2.8.2.3 Flow Dependencies ................................................................................................. 35
2.8.2.4 Output Dependencies ............................................................................................. 35
2.8.2.5 Dependencies Within an Iteration .......................................................................... 36
2.9 Pointer Ambiguity in C ...................................................................................................................................... 37
3
3 Performance of CFD Codes as Related to Hardware (CPU vs. GPU) ......................... 39
3.1 CFD for Next Generation High Performance Computing .................................................................... 39
3.2 Hardware Consideration and CPU vs. GPU Technology ...................................................................... 39
3.2.1 Case Study 1 2D Laplace Equation .................................................................................. 40
3.2.1.1 Results ..................................................................................................................... 40
3.2.1.2 Future Work Heterogeneous Computing ............................................................. 41
3.2.2 Case Study 2 - Unstructured Grid Based CFD Solvers on Modern Graphics Hardware ..... 41
3.2.2.1 Background and Literature Survey .......................................................................... 41
3.2.2.2 Implementation on Graphics Hardware .................................................................. 42
3.2.2.3 Test Case .................................................................................................................. 43
3.2.3 Case Study 3 - Accelerating CFD Simulation With High Order Finite Difference Method on
Curvilinear Coordinates for Modern GPU Clusters ........................................................................... 44
3.2.3.1 Introduction ............................................................................................................. 44
3.2.3.2 Numerical Methods ................................................................................................. 46
3.2.3.3 Hardware Environment ........................................................................................... 46
3.2.3.4 Programming Implementation and Optimization of HiResX ................................... 48
3.2.3.4.1 Code Introduction .............................................................................................. 48
3.2.3.4.2 Domain Decomposition ...................................................................................... 52
3.2.3.4.3 Hardware Technique .......................................................................................... 52
3.2.3.4.4 CPU-GPU Communication Optimization ............................................................ 53
3.2.3.4.5 GPU-GPU Communication Optimization ............................................................ 54
3.2.3.4.6 Memory Utilization............................................................................................. 55
3.2.3.4.7 CUDA Kernels...................................................................................................... 57
3.2.3.4.8 Inviscid Fluxes ..................................................................................................... 58
3.2.3.4.9 Viscous Fluxes ..................................................................................................... 61
3.2.3.4.10 Kernels in Other Section of The Solver ............................................................... 63
3.2.3.5 Performance Result ................................................................................................. 63
3.2.3.5.1 Speedup Varies with Grid Size ............................................................................ 63
3.2.3.5.2 Performance of Kernels ...................................................................................... 64
3.2.3.5.3 Performance of Running with Multiple GPUs .................................................... 65
3.2.3.5.4 Case Study .......................................................................................................... 68
3.2.3.6 Conclusions .............................................................................................................. 69
3.2.3.7 References ............................................................................................................... 70
3.2.4 Case Study 4 - A Matrix-free GMRES Algorithm on GPU Clusters for Implicit Large Eddy
Simulation ......................................................................................................................................... 72
3.2.4.1 Introduction ............................................................................................................. 72
3.2.4.2 A Brief Overview of the FR/CPR Method ................................................................. 73
3.2.4.3 Matrix-free GMRES Algorithm ................................................................................. 75
3.2.4.4 Numerical Results .................................................................................................... 77
3.2.4.4.1 Multi-GPU Performance and Scalability Results ................................................ 77
3.2.4.4.2 Transitional Flow Over the T106C Low Pressure Turbine Cascade .................... 80
3.2.4.4.3 Flow Over a High-Lift Configuration of the Common Research Model .............. 84
3.2.4.5 Concluding Remarks ................................................................................................ 85
3.2.5 References ......................................................................................................................... 86
4 CFD and HPC Trends Forecasted for 2030 ....................................................................... 90
4.1 Relationship Between Semiconductors, SMT, and Microelectronics............................................. 90
4
4.2 Comparison of Semiconductor Fabrication Sizes in HPC ................................................................... 91
4.3 Current Status of CFD ........................................................................................................................................ 92
4.3.1 Conceptual Design ............................................................................................................. 92
4.3.2 Preliminary/Detailed Design .............................................................................................. 92
4.3.3 Product Validation and Certification ................................................................................. 93
4.3.4 CFD usage of High Performance Computing (HPC) ........................................................... 93
4.3.5 Turbulence Modeling......................................................................................................... 93
4.3.6 Process Automation .......................................................................................................... 94
4.3.7 Solution Uncertainty and Robustness ............................................................................... 94
4.3.8 Multidisciplinary Analysis and Optimization (MDAO) ....................................................... 94
4.4 Vision of CFD in 2030 as anticipated by NASA ....................................................................................... 95
4.4.1 Technology Roadmap to achieve GC Challenge ................................................................ 96
4.4.1.1 High Performance Computing (HPC) ....................................................................... 97
4.4.1.2 Physical Modeling .................................................................................................... 97
4.4.1.3 Numerical Algorithms .............................................................................................. 98
4.4.1.4 Uncertainty Quantification (UQ) ............................................................................. 99
4.4.1.5 Geometry and Grid Generation ............................................................................. 100
4.4.1.6 Knowledge Extraction ............................................................................................ 101
4.4.1.7 Multidisciplinary Design and Optimization ........................................................... 101
4.4.2 Recommendations ........................................................................................................... 102
4.5 HPC Envisioned by Department of Energy (DOE) ............................................................................ 103
4.5.1 What is Exa-scale Computing? ........................................................................................ 104
4.5.2 Why Exa-scale? ................................................................................................................ 104
4.5.3 Range of Applications may be Transformed by Going to the Exa-scale .......................... 104
4.5.3.1 Aerospace, Airframes and Jet Turbines ................................................................. 104
4.5.3.2 Combustion ........................................................................................................... 106
4.5.3.3 Climate Modeling .................................................................................................. 107
4.5.3.4 Computational Biology .......................................................................................... 108
4.5.3.5 Materials Science................................................................................................... 109
4.5.3.6 Nuclear Engineering .............................................................................................. 110
4.5.3.7 Others Disciplines .................................................................................................. 111
4.5.4 Challenges in Going to the Exa-scale ............................................................................... 111
4.5.4.1 The Hardware Challenges ...................................................................................... 112
4.5.4.2 The Applied Mathematics Challenges ................................................................... 112
4.5.4.3 Mathematical Modeling ........................................................................................ 113
4.5.4.4 Numerical Algorithms ............................................................................................ 113
4.5.4.5 The Algorithmic Challenges ................................................................................... 114
4.5.4.6 Computer Science Challenges ............................................................................... 115
4.5.4.7 Educational Challenges .......................................................................................... 116
List of Tables
Table 2.3.1 Memory Access Speed on a DEC Alpha ............................................................................................. 12
Table 3.2.1 Present GPU specs. Titan V has highest double precision operation performance. RTX
2080 Ti utilizes the newest architecture and provides highest single precision operation performance
....................................................................................................................................................................................................... 47
Table 3.2.2 Present spec of CPU. The double precision operation performance of E5-2680v3 is
slightly higher than RTX 2080 Ti’s. ................................................................................................................................ 47
Table 3.2.3 Maximum grid capacity of HiResX solver in different GPUs ..................................................... 56
5
Table 3.2.4 Results of the percentage of a typical simulation iteration that is spent in different parts
of the GMRES algorithm. Results for different solution order and cell types ............................................... 78
Table 3.2.5 CPU and GPU simulation final settings for the T106C test case ............................................. 81
Table 3.2.6 Comparison of mean lift and drag coefficients between the CPU simulation and different
GPU simulations. Relative error is calculated with respect to the CPU simulation ................................... 83
Table 4.5.1 Three Order of Magnitude Jump ...................................................................................................... 111
Table 4.5.2 Potential Exa-scale Computer Design for 2018 and Its Relationship to Current HPC
Designs (DOE) ...................................................................................................................................................................... 112
List of Figures
Figure 1.1.1 Contributions From Other Disciplines to CFD ............................................................................... 8
Figure 2.3.1 Cache Lines can come from Different Parts of Memory ........................................................... 12
Figure 2.3.2 Many memory addresses map to the same cache line ............................................................. 14
Figure 2.3.3 Two-Way Set-Associative Cache ........................................................................................................ 16
Figure 2.5.1 Sharp Profiling (right) vs. Flat Profiling (right) .......................................................................... 20
Figure 2.8.1 (a) Control Dependency; (b) A section of your program; (c) Expensive Operation
Moved so that it's Rarely Executed ................................................................................................................................ 31
Figure 2.8.2 Types of Data Dependencies ............................................................................................................... 32
Figure 2.8.3 Flow Graph for Data Flow Analysis .................................................................................................. 33
Figure 2.8.4 Flow Graph including a loop .............................................................................................................. 34
Figure 3.2.1 Architecture Differences Between CPU and GPU ....................................................................... 40
Figure 3.2.2 Result for a V cycle Multigrid .............................................................................................................. 40
Figure 3.2.3 Heterogeneous Computing using CPUs and GPUs ...................................................................... 41
Figure 3.2.4 Pressures at the Surface and Plane for the NACA 00012 (Left) and at the Surface for
the Missile (Right) ................................................................................................................................................................. 42
Figure 3.2.5 Running Times in Double Precision Per Element Per Iteration for the NACA 0012 .... 43
Figure 3.2.6 Framework of modern GPU cluster. The computational nodes are connected with high
speed network. GPUs deliver the majority of performance of modern GPU cluster ................................. 46
Figure 3.2.7 Several typical PCIe root architectures of GPU server. For type (a), each NUMA node
has PCIe switch attached on them, and there are two GPUs mounted on each PCIe switch. For type
(b), only one PCIe switch is mounted on each NUMA node, and there are 4 GPUs attached on one PCIe
switch. For type (c), all GPUs are mounted on one NUMA node by two PCIe switches. For type (d), all
GPUs are mounted on one NUMA node. ....................................................................................................................... 48
Figure 3.2.8 Main structure of HiResX summarized by pseudo code ........................................................... 49
Figure 3.2.9 Framework of communication of HiResX without GPU peer to peer communication
technology support. For communication between processes that are all running on CPU, if processes
are located in the same node, they exchange data within RAM with MPI-3 shared memory technique,
see red path. If processes are located in different nodes, the standard MPI communication is utilized,
see yellow path that connects triangle pair. For processes with GPU acceleration, data on GPU should
be downloaded back to CPU, and then processes communicate in the same way as processes without
acceleration. ............................................................................................................................................................................. 50
Figure 3.2.10 A global glance of execution procedures of HiResX running on GPU. Except for
initialization and data performed on CPU, all computations are performed on GPU while CPU is used
to schedule kernels of GPU only. "Data 0" represents the initial data computed in CPU and uploaded
to GPU. "Data 1" represents data exchanged between processes. "Data 3" represents flow field data
to be written, and it should be downloaded to CPU and written with a new thread. ............................... 50
Figure 3.2.11 Domain decomposition strategies of HiResX. The domain is divided into several blocks
according to performance of CPU and GPU in order to balance workloads of each process. For the
6
blocks computed by CPU, the whole block is computed by a CPU process. For the blocks computed by
GPU, each grid cell is computed by one CUDA thread. ........................................................................................... 51
Figure 3.2.12 CPU-GPU and GPU-GPU memory access models. In (a), red path is the optimal one,
while pink path is inferior because CPU to GPU memory access across NUMA nodes is worse than
local access. In (b), if GPUs support peer-to-peer (P2P) communication technology, green path
indicates that two GPUs communicate by Nvlink, which is the fast path. Purple path indicates that
two GPUs communicate by PCIe switch (PLX), which is most efficient communication way without
Nvlink. Yellow path means that two GPUs communicate by host (CPU). Red path is the worst one
because P2P is not supported across NUMA nodes, and memory access between two GPU in this way
must be transferred by both CPUs. ................................................................................................................................ 52
Figure 3.2.13 Organization of CUDA threads for NVIDIA’s GPU and its mapping to hardware
structure .................................................................................................................................................................................... 57
Figure 3.2.14 Code Structure For Derivative of Inviscid Flux Computed With Original AFWENO . 58
Figure 3.2.15 Comparison of WENO interpolations. In original WENO interpolation, the
characteristic space is built at i + 1/2, and u± i+1/2 are computed simultaneous in a loop. In modified
WENO interpolation, the characteristic space is built at i, and ui+1/2 and u+i−1/2 are computed
simultaneous in a loop. In modified WENO interpolation, point stencils are the same, so the
characteristic variables and smoothness indicators are shared, thus it reduces calculation and global
memory access. ...................................................................................................................................................................... 60
Figure 3.2.16 Code structure of derivative of inviscid flux computed with modified AFWENO ...... 61
Figure 3.2.17 Code structure of the derivative of viscous flux summarized by a pseudo code ........ 62
Figure 3.2.18 Global performance varying with grid sizes. The speedup is defined as the ratio of
elapsed time of running on CPU with one core to lapsed time of running with a GPU ............................ 63
Figure 3.2.19 Performance of main parts of the solver. Generally, the parts that contain large
amount of computation get apparent acceleration. In the sections of time advance and viscous flux,
due to the high occupancies of kernels, both GPUs get higher speedups. For evaluation of time step
and inviscid flux, Titan V gets higher speedups than RTX 2080 Ti, because Titan V has more double
precision operation processing unit, which means higher double precision operation performance,
see Table 4.2.1. ...................................................................................................................................................................... 65
Figure 3.2.20 Strong scaling result of different GPU. Both GPUs’ scalabilities deviate from linear
result, but the efficiencies are all larger than 75% when 10 GPUs are utilized. The drops of scalability
come from the fact that when grid size decreases, the speedup drops too, see Figure 4.2.18. Scalability
of Titan V drops faster than RTX 2080 Ti’s, because the speedup of Titan V drops faster than RTX
2080 Ti’s when the grid size decreases. ....................................................................................................................... 66
Figure 3.2.21 Cases configuration. In figure (a), there are 100 points in the spanwise direction. Data
exchange in vertical direction (red) is larger than data exchange in horizontal direction (blue). In
figure (b). five cases are set to test the communication performance of different strategies of
computations assigned to different GPUs. The number in block presents the GPU ID in which
computation is performed. Blocks with the same color indicate that they are in the same PCIe switch.
In cases I and II, computations are performed on GPUs that are in two different PCIe switches
respectively. In cases III to IV, computations are performed on GPUs belong to both two PCIe
switches. For case III, the maximum data transfers are performed across PCIe switch. In case IV, the
maximum data transfers are performed within the same PCIe switch. In case V, all data transfers are
performed across PCIe switch. ......................................................................................................................................... 67
Figure 3.2.22 Performance of HiResX running on different GPU topology. Cases in which all GPUs
are in the same PCIe switch get best communication efficiency, see case I and II. For case in which
GPUs are located in different PCIe switches, if GPU devices are optimally assigned according to
communication load, communication efficiency can be also improved, see case IV. Without GPU to
GPU optimization, the communication efficiency is lower, see case III and V. ............................................ 68
7
Figure 3.2.23 Shock-cell spacing Lh/h as a function of the fully expanded jet Mach number Mj . Our
result is more close to theoretical[17] and experimental results [18, 19], compared to the LES results
of Berland et al.[16]. ............................................................................................................................................................. 68
Figure 3.2.24 Strouhal number St = fh/Uj of the fundamental screech tone as a function of the fully
expanded jet Mach number Mj . Our result is in good agreement with experimental result of Ref.[18],
which is better than the LES result of Ref.[16]. ......................................................................................................... 69
Figure 3.2.25 Instantaneous snapshot of spanwise vorticity !z and the dilatation in the plane z/h =
2.5 as the background. The isosurface of vorticity is colored with the amplitude of velocity .............. 69
Figure 3.2.26 NVIDIA V100 GPU Speedup compared to a single CPU core of an Intel Xeon CPU E5-
2620 chip .................................................................................................................................................................................. 77
Figure 3.2.27 Strong scalability study with GPU simulations on the KU cluster for a mesh with 149k
hex elements ............................................................................................................................................................................ 78
Figure 3.2.28 Strong scalability study with GPU simulations on Summit for a mesh with 15.6M hex
elements .................................................................................................................................................................................... 79
Figure 3.2.29 Results of GPU GMRES speedup relative to the GPU RK3 for running one characteristic
time Tc using different physical time steps ................................................................................................................. 80
Figure 3.2.30 Mesh provided by Alhawwary et al. [38] and time-averaged solution contours of the
p2 simulation with GPU ...................................................................................................................................................... 80
Figure 3.2.31 Results of GPU GMRES speedup relative to the GPU RK3 for running one characteristic
time Tc using different GMRES linear convergence tolerance ............................................................................ 81
Figure 3.2.32 Comparison of the time- and spanwise-averaged pressure coefficient between the
CPU simulation and different GPU simulations ........................................................................................................ 82
Figure 3.2.33 Comparison of the time- and spanwise-averaged coefficient of streamwise friction
between the CPU simulation and different GPU simulations .............................................................................. 82
Figure 3.2.34 Comparison of the PSD of pressure at wake point(1) ............................................................ 83
Figure 3.2.35 Views of the high-order (Q2) hybrid mesh generated by Pointwise with over 4 million
elements .................................................................................................................................................................................... 85
Figure 3.2.36 Instantaneous flow field showing iso-surfaces of the Q-criterion colored by the
stream-wise velocity for a p2 simulation .................................................................................................................... 86
Figure 4.1.1 Relationship between Microelectronics, SMT and Semiconductor.................................... 91
Figure 4.2.1 Changing Predictions About Semiconductor Sizes .................................................................... 91
Figure 4.4.1 Proposed New Computational Sciences Program Structure .............................................. 103
Figure 4.5.2 Computer speed and memory requirements for the Grand Challenge .......................... 105
Figure 4.5.3 A supersonic Jet Engine Nozzle Rapidly Accelerates High-Pressure Gas into the
Atmosphere ........................................................................................................................................................................... 106
Figure 4.5.4 Detail View of 9-Billion Atom Molecular Dynamics Simulation Instability .................. 109
8
1 Introduction to Modern Computer Architectures via CFD
1.1 Preliminaries
As we know, Computers and Software’s are one of the pillars of CFD and next two chapter are
devoted to that. We argued that being a CFD analysis is not necessities a computer expert,
nevertheless, knowing the essentials
of it never hurts. (see Figure 1.1.1).
So, it is wise to get familiar with
modern computer architectures, as
well as, software optimization, as
details in [Severance & Dowd]
1
. Even
if you can speed up the
computational aspects of a
processor infinitely fast, you still
must load and store the data and
instructions to and from a memory.
Today's processors continue to creep
ever closer to infinitely fast
processing. But memory
performance is increasing at a much
slower rate (it will take longer for
memory to become infinitely fast).
Many of the interesting problems in
high performance computing use a
large amount of memory. As
computers are getting faster, the size of problems they tend to operate on also goes up. The trouble
is that when you want to solve these problems at high speeds, you need a memory system that is
large, yet at the same time fast; a big challenge.
1
Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.
Figure 1.1.1 Contributions From Other Disciplines to CFD
CFD
9
2 Memory Systems & Basics of Good Programming
Today's processors continue to creep ever closer to infinitely fast processing. But memory
performance is increasing at a much slower rate (it will take longer for memory to become infinitely
fast). Many of the interesting problems in high performance computing use a large amount of
memory. As computers are getting faster, the size of problems they tend to operate on also goes up.
The trouble is that when you want to solve these problems at high speeds, you need a memory system
that is large, yet at the same time fast; a big challenge. Possible approaches include the following:
Every memory system component can be made individually fast enough to respond to every
memory access request.
Slow memory can be accessed in a round-robin fashion (hopefully) to give the effect of a
faster memory system.
The memory system design can be made wide so that each transfer contains many bytes of
information.
The system can be divided into faster and slower portions and arranged so that the fast
portion is used more often than the slow one.
Again, economics are the dominant force in the computer business. A cheap, statistically optimized
memory system will be a better seller than a prohibitively expensive, blazingly fast one, so the first
choice is not much of a choice at all. But these choices, used in combination, can attain a good fraction
of the performance you would get if every component were fast. Chances are very good that your
high performance workstation incorporates several or all of them. Once the memory system has been
decided upon, there are things we can do in software to see that it is used efficiently. A compiler that
has some knowledge of the way memory is arranged and the details of the caches can optimize their
use to some extent. The other place for optimizations is in user applications, as we'll see later in the
book. A good pattern of memory access will work with, rather than against, the components of the
system. Next, we discuss how the pieces of a memory system work. We look at how patterns of data
and instruction access factor into your overall runtime, especially as CPU speeds increase. We also
talk a bit about the performance implications of running in a virtual memory environment
2
.
2.1 Memory Technology
Almost all fast memories used today are semiconductor-based
3
. They come in two flavors: Dynamic
Random Access Memory (DRAM) and Static Random Access Memory (SRAM). The term random
means that you can address memory locations in any order. This is to distinguishes from Serial
Memories, where you have to step through all intervening locations to get to the particular one you
are interested in. An example of a storage medium that is not random is magnetic tape. The terms
dynamic and static have to do with the technology used in the design of the memory cells. DRAMs are
charge-based devices, where each bit is represented by an electrical charge stored in a very small
capacitor. The charge can leak away in a short amount of time, so the system has to be continually
refreshed to prevent data from being lost. The act of reading a bit in DRAM also discharges the bit,
requiring that it be refreshed. It's not possible to read the memory bit in the DRAM while it's being
refreshed.
SRAM is based on gates, and each bit is stored in four to six connected transistors. SRAM memories
retain their data as long as they have power, without the need for any form of data refresh. DRAM
offers the best price/performance, as well as highest density of memory cells per chip. This means
lower cost, less board space, less power, and less heat. On the other hand, some applications such as
2
Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.
3
Magnetic core memory is still used in applications where radiation hardness resistance to changes caused by
ionizing radiation is important.
10
cache and video memory require higher speed, to which SRAM is better suited. Currently, you can
choose between SRAM and DRAM at slower speeds _ down to about 50 nanoseconds (ns). SRAM has
access times down to about 7 ns at higher cost, heat, power, and board space. In addition to the basic
technology to store a single bit of data, memory performance is limited by the practical
considerations of the on-chip wiring layout and the external pins on the chip that communicate the
address and data information between the memory and the processor.
2.1.1 Memory Access Time
The amount of time it takes to read or write a memory location is called the memory access time.
Whereas the access time says how quickly you can reference a memory location, cycle time describes
how often you can repeat references. They sound like the same thing, but they're not. For instance, if
you ask for data from DRAM chips with a 50-ns access time, it may be 100 ns before you can ask for
more data from the same chips. This is because the chips must internally recover from the previous
access. Also, when you are retrieving data sequentially from DRAM chips, some technologies have
improved performance. On these chips, data immediately following the previously accessed data may
be accessed as quickly as 10 ns
4
.
2.1.2 Memory Access Patterns
The best pattern is the most straightforward: increasing and unit sequential. For an array with a
single dimension, stepping through one element at a time will accomplish this. For multiply-
dimensioned arrays, access is fastest if you iterate on the array subscript offering the smallest stride
or step size. In FORTRAN programs, this is the leftmost subscript; in C, it is the rightmost. The
FORTRAN loop below has unit stride, and therefore will run quickly:
DO J = 1 , N
DO I = 1 , N
A (I , J) = B (I , J) + C (I ,J ) * D
ENDDO
ENDDO
In contrast, the next loop is slower because its pace is N. As N increases from one to the length of the
cache line (adjusting for the length of each element), the performance worsens. Once N is longer than
the length of the cache line (again adjusted for element size), the performance won't decrease:
DO J = 1 , N
DO I =1 , N
A (J , I) = B (J , I) + C (J , I) * D
ENDDO
ENDDO
Here's a unit-stride loop like the previous one, but written in C:
for (I = 0 ; I < n ; i++)
for (j = 0 ; j < n ; j++)
a [ I ] [j ] = a [ i] [ j ] + c [ I ] [ j ] * d;
4
Charles Severance, Kevin Dowd, “High Performance Computing”, Rice University, Houston, Texas, 2012.
11
2.1.2.1 Loop Interchange to Ease Memory Access Patterns
Loop interchange is a good technique for lessening the impact of stride memory references. Let's
revisit our FORTRAN loop with non-unit stride. The good news is that we can easily interchange the
loops; each iteration is independent of every other:
DO J = 1 , N
DO I = 1 , N
A (J , I) = B (J , I) + C (J , I) * D
ENDDO
ENDDO
After interchange, A, B, and C are referenced with the leftmost subscript varying most quickly. This
modification can make an important difference in performance. We traded three N-stride memory
references for unit strides:
DO I =1 , N
DO J = 1 , N
A (J , I) = B (J , I) + C (J , I) * D
ENDDO
ENDDO
2.1.3 Virtual Memory
Virtual memory decouples the addresses used by the program (virtual addresses) from the actual
addresses where the data is stored in memory (physical addresses). Your program sees its address
space starting at 0 and working its way up to some large number, but the actual physical addresses
assigned can be very different. It gives a degree of flexibility by allowing all processes to believe they
have the entire memory system to themselves. Another trait of virtual memory systems is that they
divide your program's memory up into pages chunks. Page sizes vary from 512 bytes to 1 MB or
larger, depending on the machine. Pages don't have to be allocated contiguously, though your
program sees them that way. By being separated into pages, programs are easier to arrange in
memory, or move portions out to disk.
2.2 Registers
At least the top layer of the memory hierarchy, the CPU registers, operate as fast as the rest of the
processor. The goal is to keep operands in the registers as much as possible. This is especially
important for intermediate values used in a long computation such as:
X = G * 2.41 + A / W - W * M
While computing the value of A divided by W, we must store the result of multiplying G by 2.41. It
would be a shame to have to store this intermediate result in memory and then reload it a few
instructions later. On any modern processor with moderate optimization, the intermediate result is
stored in a register. Also, the value W is used in two computations, and so it can be loaded once and
used twice to eliminate a wasted load. Compilers have been very good at detecting these types of
optimizations and efficiently making use of the available registers since the 1970s. Adding more
registers to the processor has some performance benefit. It's not practical to add enough registers to
the processor to store the entire problem data. So we must still use the slower memory technology.
12
2.3 Caches
Once we go beyond the registers in the memory hierarchy, we encounter caches. Caches are small
amounts of SRAM that store a subset of the contents of the memory. The hope is that the cache will
have the right subset of main memory at the right time. The actual cache architecture has had to
change as the cycle time of the processors has improved. The
processors are so fast that off-chip SRAM chips are not even fast
enough. This has led to a multilevel cache approach with one, or
even two, levels of cache implemented as part of the processor.
Table 2.3.1 shows the approximate speed of accessing the
memory hierarchy on a 500-MHz DEC Alpha. When every
reference can be found in a cache, you say that you have a 100% hit
rate. Generally, a hit rate of 90% or better is considered good for a
level-one (L1) cache. In level-two (L2) cache, a hit rate of above
50% is considered acceptable. Below that, application
performance can drop off steeply. One can characterize the average read performance of the memory
hierarchy by examining the probability that a particular load will be satisfied at a particular level of
the hierarchy. For example, assume a memory architecture with an L1 cache speed of 10 ns, L2 speed
of 30 ns, and memory speed of 300 ns. If a memory reference were satisfied from L1 cache 75% of
the time, L2 cache 20% of the time, and main memory 5% of the time, the average memory
performance would be:
(0.75 * 10 ) + ( 0.20 * 30 ) + ( 0.05 * 300 ) = 28.5 ns
You can easily see why it's important to have an L1 cache hit rate of 90% or higher. Given that a cache
holds only a subset of the main memory at any time, it's important to keep an index of which areas of
the main memory are currently stored in the cache. To reduce the amount of space that must be
dedicated to tracking which memory areas are in cache, the cache is divided into a number of equal
sized slots known as lines. Each line contains some number of sequential main memory locations,
generally four to sixteen integers or real numbers. Whereas the data within a line comes from the
same part of memory, other lines can contain data that is far separated within your program, or
perhaps data from somebody else's program, as in Figure 2.3.1 (Cache lines can come from
different parts of memory). When you ask for something from memory, the computer checks to see
if the data is available within one of these cache lines. If it is, the data is returned with a minimal
delay. If it's not, your program may be delayed while a new line is fetched from main memory. Of
Register
2 ns
4 ns
5 ns
30 ns
220 ns
Table 2.3.1 Memory Access
Speed on a DEC Alpha
Figure 2.3.1 Cache Lines can come from Different Parts of Memory
13
course, if a new line is brought in, another has to be thrown out. If you're lucky, it won't be the one
containing the data you are just about to need.
On multiprocessors (several CPUs), written data must be returned to main memory so the rest of the
processors can see it, or all other processors must be made aware of local cache activity. The problem
can become very complex in a multiprocessor system. Caches are effective because programs often
exhibit characteristics that help keep the hit rate high. This is called unit stride because the address
of each successive data element is incremented by one and all the data retrieved into the cache is
used. The following loop is a unit-stride loop:
DO I = 1 , 1000000
SUM = SUM + A (I)
END DO
When a program accesses a large data structure using non-unit stride, performance suffers because
data is loaded into cache that is not used. For example:
DO I = 1 , 1000000 , 8
SUM = SUM + A (I)
END DO
This code would experience the same number of cache misses as the previous loop, and the same
amount of data would be loaded into the cache. However, the program needs only one of the eight
32-bit words loaded into cache. Even though this program performs one-eighth the additions of the
previous loop, its elapsed time is roughly the same as the previous loop because the memory
operations dominate performance. While this example may seem a bit contrived, there are several
situations in which non-unit strides occur quite often. First, when a FORTRAN two-dimensional
array is stored in memory, successive elements in the first column are stored sequentially followed
by the elements of the second column. If the array is processed with the row iteration as the inner
loop, it produces a unit-stride reference pattern as follows:
REAL*4 A (200 , 200)
DO J = 1 , 200
DO I = 1 , 200
SUM = SUM + A (I , J)
END DO
END DO
Interestingly, a FORTRAN programmer would most likely write the loop (in alphabetical order) as
follows, producing a non-unit stride of 800 bytes between successive load operations:
REAL*4 A (200 , 200)
DO I = 1 , 200
DO J = 1 , 200
SUM = SUM + A (I , J)
END DO
END DO
Because of this, some compilers can detect this suboptimal loop order and reverse the order of the
loops to make best use of the memory system. As we will see, however, this code transformation may
produce different results, and so you may have to give the compiler _permission_ to interchange these
14
loops in this particular example (or, after reading this book, you could just code it properly in the first
place).
while ( ptr ! = NULL ) ptr = ptr -> next;
The next element that is retrieved is based on the contents of the current element. This type of loop
bounces all around memory in no particular pattern. This is called pointer chasing and there are no
good ways to improve the performance of this code. A third pattern often found in certain types of
codes is called gather (or scatter) and occurs in loops such as:
SUM = SUM + ARR ( IND (I))
where the IND array contains offsets into the ARR array. Again, like the linked list, the exact pattern
of memory references is known only at runtime when the values stored in the IND array are known.
Some special-purpose systems have special hardware support to accelerate this particular operation.
2.3.1 Cache Organization
The process of pairing memory locations with cache lines is called mapping. Of course, given that a
cache is smaller than main memory, you have to share the same cache lines for different memory
locations. In caches, each cache line has a record of the memory address (called the tag) it represents
and perhaps when it was last used. The tag is used to track which area of memory is stored in a
particular cache line. The way memory locations (tags) are mapped to cache lines can have a
beneficial effect on the way your program runs, because if two heavily used memory locations map
onto the same cache line, the miss rate will be higher than you would like it to be. Caches can be
organized in one of several ways: direct mapped, fully associative, and set associative.
2.3.1.1 Direct-Mapped Cache
Direct mapping, as presented in Figure 2.3.2 (Many memory addresses map to the same cache line),
is the simplest algorithm for deciding how memory maps onto the cache. Say, for example, that your
computer has a 4-KB cache. In a direct mapped scheme, memory location 0 maps into cache location
0, as do memory locations 4K, 8K, 12K, etc. In other words, memory maps onto the cache size.
Figure 2.3.2 Many memory addresses map to the same cache line
15
Another way to think about it is to imagine a metal spring with a chalk line marked down the side.
Every time around the spring, you encounter the chalk line at the same place modulo the
circumference of the spring. If the spring is very long, the chalk line crosses many coils, the analog
being a large memory with many locations mapping into the same cache line. Problems occur when
alternating runtime memory references in a direct-mapped cache point to the same cache line. Each
reference causes a cache miss and replaces the entry just replaced, causing a lot of overhead. The
popular word for this is thrashing. When there is lots of thrashing, a cache can be more of a liability
than an asset because each cache miss requires that a cache line be refilled an operation that moves
more data than merely satisfying the reference directly from main memory. It is easy to construct a
pathological case that causes thrashing in a 4-KB direct-mapped cache:
REAL*4 A (1024), B (1024)
COMMON /STUFF/ A , B
DO I=1,1024
A(I) = A(I) * B(I)
END DO
The arrays A and B both take up exactly 4 KB of storage, and their inclusion together in COMMON
assures that the arrays start exactly 4 KB apart in memory. In a 4-KB direct mapped cache, the same
line that is used for A(1) is used for B(1), and likewise for A(2) and B(2), etc., so alternating references
cause repeated cache misses. To fix it, you could either adjust the size of the array A, or put some
other variables into COMMON, between them. For this reason one should generally avoid array
dimensions that are close to powers of two.
2.3.1.2 Fully Associative Cache
At the other extreme from a direct mapped cache is a fully associative cache, where any memory
location can be mapped into any cache line, regardless of memory address. Fully associative caches
get their name from the type of memory used to construct them _ associative memory. Associative
memory is like regular memory, except that each memory cell knows something about the data it
contains. When the processor goes looking for a piece of data, the cache lines are asked all at once
whether any of them has it. The cache line containing the data holds up its hand and says _I have it_;
if none of them do, there is a cache miss. It then becomes a question of which cache line will be
replaced with the new data. Rather than map memory locations to cache lines via an algorithm, like
a direct- mapped cache, the memory system can ask the fully associative cache lines to choose among
themselves which memory locations they will represent. Usually the least recently used line is the
one that gets overwritten with new data. The assumption is that if the data hasn't been used in quite
a while, it is least likely to be used in the future. Fully associative caches have superior utilization
when compared to direct mapped caches. It's difficult to find real-world examples of programs that
will cause thrashing in a fully associative cache. The expense of fully associative caches is very high,
in terms of size, price, and speed. The associative caches that do exist tend to be small.
2.3.1.3 Set-Associative Cache
Now imagine that you have two direct mapped caches sitting side by side in a single cache unit as
shown in Figure 2.3.3 (Two-way set-associative cache). Each memory location corresponds to a
particular cache line in each of the two direct-mapped caches. The one you choose to replace during
a cache miss is subject to a decision about whose line was used last the same way the decision was
made in a fully associative cache except that now there are only two choices. This is called a set-
associative cache. Set-associative caches generally come in two and four separate banks of cache.
These are called two-way and four-way set associative caches, respectively. Of course, there are
benefits and drawbacks to each type of cache. A set associative cache is more immune to cache
thrashing than a direct-mapped cache of the same size, because for each mapping of a memory
address into a cache line, there are two or more choices where it can go. The beauty of a direct-
16
mapped cache, however, is that it's easy to implement and, if made large enough, will perform
roughly as well as a set-associative design. Your machine may contain multiple caches for several
different purposes. Here's a little program for causing thrashing in a 4-KB two-way set- associative
cache:
REAL*4 A(1024), B(1024), C(1024)
COMMON /STUFF/ A,B,C
DO I=1,1024
A(I) = A(I) * B(I) + C(I)
END DO
Like the previous cache thrasher program, this forces repeated accesses to the same cache lines,
except that now there are three variables contending for the choose set same mapping instead of two.
Again, the way to fix it would be to change the size of the arrays or insert something in between them,
in COMMON. By the way, if you accidentally arranged a program to thrash like this, it would be hard
for you to detect it; aside from a feeling that the program runs a little slow. Few vendors provide
tools for measuring cache misses.
2.3.1.4 Instruction Cache
So far we have glossed over the two kinds of information you would expect to find in a cache between
main memory and the CPU: instructions and data. But if you think about it, the demand for data is
separate from the demand for instructions. In superscalar processors, for example, it's possible to
execute an instruction that causes a data cache miss alongside other instructions that require no data
from cache at all, i.e., they operate on registers. It doesn't seem fair that a cache miss on a data
reference in one instruction should keep you from fetching other instructions because the cache is
tied up. Furthermore, a cache depends on locality of reference between bits of data and other bits of
data or instructions and other instructions, but what kind of interplay is there between instructions
and data? It would seem possible for instructions to bump perfectly useful data from cache, or vice
versa, with complete disregard for locality of reference.
Figure 2.3.3 Two-Way Set-Associative Cache
17
Many designs from the 1980s used a single cache for both instructions and data. But newer designs
are employing what is known as the Harvard Memory Architecture, where the demand for data is
segregated from the demand for instructions. Main memory is a still a single large pool, but these
processors have separate data and instruction caches, possibly of different designs. By providing two
independent sources for data and instructions, the aggregate rate of information coming from
memory is increased, and interference between the two types of memory references is minimized.
Also, instructions generally have an extremely high level of locality of reference because of the
sequential nature of most programs. Because the instruction caches don't have to be particularly
large to be effective, a typical architecture is to have separate L1 caches for instructions and data and
to have a combined L2 cache. For example, the IBM/Motorola PowerPC 604e has separate 32-K four-
way set-associative L1 caches for instruction and data and a combined L2 cache.
2.4 Timing a Program
Under UNIX, you can time program execution by placing the time command before everything else
you normally type on the command line. When the program finishes, a timing summary is produced.
For instance, if your program is called foo, you can time its execution by typing time foo. If you are
using the C shell or Korn shell, time is one of the shell's built-in commands. With a Bourne shell, time
is a separate command executable in /bin. In any case, the following information appears at the end
of the run:
User time
System time
Elapsed time
These timing figures are easier to understand with a little background. As your program runs, it
switches back and forth between two fundamentally different modes: user mode and kernel mode.
The normal operating state is user mode. It is in user mode that the instructions the compiler
generated on your behalf get executed, in addition to any subroutine library calls linked with your
program.20 It might be enough to run in user mode forever, except that programs generally need
other services, such as I/O, and these require the intervention of the operating system the kernel. A
kernel service request made by your program, or perhaps an event from outside your program,
causes a switch from user mode into kernel mode.
Time spent executing in the two modes is accounted for separately. The user time figure describes
time spent in user mode. Similarly, system time is a measure of the time spent in kernel mode. As far
as user time goes, each program on the machine is accounted for separately. That is, you won't be
charged for activity in somebody else's application. System time accounting works the same way, for
the most part; however, you can, in some instances, be charged for some system services performed
on other people's behalf, in addition to your own. Incorrect charging occurs because your program
may be executing at the moment some outside activity causes an interrupt. This seems unfair, but
take consolation in the fact that it works both ways: other users may be charged for your system
activity too, for the same reason. Taken together, user time and system time are called CPU time.
Generally, the user time is far greater than the system time. You would expect this because most
applications only occasionally ask for system services. In fact, a disproportionately large system time
probably indicates some trouble.
For instance, programs that are repeatedly generating exception conditions, such as page faults,
misaligned memory references, or floating-point exceptions, use an inordinate amount of system
time. Time spent doing things like seeking on a disk, rewinding a tape, or waiting for characters at
the terminal doesn't show up in CPU time. That's because these activities don't require the CPU; the
CPU is free to go on and execute other programs. The third piece of information (corresponding to
the third set of hands on the watch), elapsed time, is a measure of the actual (wall clock) time that
18
has passed since the program was started. For programs that spend most of their time computing,
the elapsed time should be close to the CPU time. Reasons why elapsed
time might be greater are:
You are timesharing the machine with other active programs
5
Your application performs a lot of I/O.
Your application requires more memory bandwidth than is available on the machine.
Your program was paging or swapped.
People often record the CPU time and use it as an estimate for elapsed time. Using CPU time is okay
on a single CPU machine, provided you have seen the program run when the machine was quiet and
noticed the two numbers were very close together. But for multiprocessors, the total CPU time can
be far different from the elapsed time. Whenever there is a doubt, wait until you have the machine to
your- self and time your program then, using elapsed time. It is very important to produce timing
results that can be verified using another run when the results are being used to make important
purchasing decisions.
If you are running on a Berkeley UNIX derivative, the C shell's built-in time command can report a
number of other useful statistics. Check with your csh manual page for more possibilities. In addition
to figures for CPU and elapsed time, csh time command produces information about CPU utilization,
page faults, swaps, blocked I/O operations (usually disk activity), and some measures of how much
physical memory our pro- gram occupied when it ran. We describe each of them in turn.
2.4.1 Timing a Portion of the Program
For some benchmarking or tuning efforts, measurements taken on the outside of the program tell you
everything you need to know. But if you are trying to isolate performance figures for individual loops
or portions of the code, you may want to include timing routines on the inside too. The basic
technique is simple enough:
1. Record the time before you start doing X.
2. Do X.
3. Record the time at completion of X.
4. Subtract the start time from the completion time.
If, for instance, X's primary job is to calculate particle positions, divide by the total time to obtain a
number for particle positions/second. You have to be careful though; too many calls to the timing
routines, and the observer becomes part of the experiment. The timing routines take time too, and
their very presence can increase instruction cache miss or paging. Furthermore, you want X to take
a significant amount of time so that the measurements are meaningful. Paying attention to the time
between timer calls is really important because the clock used by the timing functions has a limited
resolution. An event that occurs within a fraction of a second is hard to measure with any accuracy.
2.4.2 Getting Time Information
In this section, we discuss methods for getting various timer values during the execution of your
program. For FORTRAN programs, a library timing function found on many machines is called etime,
which takes a two-element REAL*4 array as an argument and fills the slots with the user CPU time
and system CPU time, respectively. The value returned by the function is the sum of the two. Here's
how etime is often used:
real*4 tarray(2), etime
5
The uptime command gives you a rough indication of the other activity on your machine. The last three fields
tell the average number of processes ready to run during the last 1, 5, and 15 minutes, respectively.
19
real*4 start, finish
start = etime(tarray)
finish = etime(tarray)
write (*,*) 'CPU time: ', finish start
Not every vendor supplies an etime function; in fact, one doesn't provide a timing routine for
FORTRAN at all. Try it first. If it shows up as an undefined symbol when the program is linked, you
can use the following C routine. It provides the same functionality as etime:
#include <sys/times.h>
#define TICKS 100.
float etime (parts)
struct {
float user;
float system;
} *parts;
{
struct tms local;
times (&local);
parts->user= (float) local.tms_utime/TICKS;
parts->system = (float) local.tms_stime/TICKS;
return (parts->user + parts->system);
}
There are a couple of things you might have to tweak to make it work. First of all, linking C routines
with FORTRAN routines on your computer may require you to add an underscore (_) after the
function name. This changes the entry to float etime (parts). Furthermore, you might have to adjust
the TICKS parameter. We assumed that the system clock had a resolution of 1/100 of a second (true
for the Hewlett-Packard machines that this version of etime was written for). 1/60 is very common.
On an RS-6000 the number would be 1000. You may find the value in a file named
/usr/include/sys/param.h on your machine, or you can determine it empirically. A C routine for
retrieving the wall time using calling get time of day is shown below:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
void hpcwall(double *retval)
{
static long zsec = 0;
static long zusec = 0;
double esec;
struct timeval tp;
struct timezone tzp;
gettimeofday(&tp, &tzp);
if ( zsec == 0 ) zsec = tp.tv_sec;
if ( zusec == 0 ) zusec = tp.tv_usec;
*retval = (tp.tv_sec - zsec) + (tp.tv_usec - zusec ) * 0.000001 ;
}
void hpcwall_(double *retval) { hpcwall(retval); } /* Other convention */
20
Given that you will often need both CPU and wall time, and you will be continually computing the
difference between successive calls to these routines, you may want to write a routine to return the
elapsed wall and CPU time upon each call as follows:
2.5 Subroutine Profiling
Sometimes you want more detail than the overall timing of the application. But you don't have time
to modify the code to insert several hundred etime calls into your code. Profiles are also very useful
when you have been handed a strange 20,000-line application program and told to figure out how it
works and then improve its performance. Most compilers provide a facility to automatically insert
timing calls into your code at the entry and exit of each routine at compile time. While your program
runs, the entry and exit times are recorded and then dumped into a file. A separate utility summarizes
the execution patterns and produces a report that shows the percentage of the time spent in each of
your routines and the library routines. The profile gives you a sense of the shape of the execution
profile. That is, you can see that 10% of the time is spent in subroutine A, 5% in subroutine B, etc.
Naturally, if you add all of the routines together they should account for 100% of the overall time
spent. From these percentages you can construct a picture a profile of how execution is distributed
when the program runs. Though not representative of any particular profiling tool, the histograms in
Figure 2.5.1-left (Sharp profile) and Figure 2.5.1-right (Flat profile ) depict these percentages,
sorted from left to right, with each vertical column representing a different routine. They help
illustrate different profile shapes. A sharp profile says that most of the time is spent in one or two
procedures, and if you want to improve the program's performance you should focus your efforts on
tuning those procedures. A minor optimization in a heavily executed line of code can sometimes have
a great effect on the overall runtime, given the right opportunity. A at profile on the other hand, tells
you that the runtime is spread across many routines, and effort spent optimizing any one or two will
have little benefit in speeding up the program. Of course, there are also programs whose execution
profile falls somewhere in the middle.
We cannot predict with absolute certainty what you are likely to find when you profile your
programs, but there are some general trends. For instance, engineering and scientific codes built
around matrix solutions often exhibit very sharp profiles. The runtime is dominated by the work
performed in a handful of routines. To tune the code, you need to focus your efforts on those routines
to make them more efficient. It may involve restructuring loops to expose parallelism, providing hints
to the compiler, or rearranging memory references. In any case, the challenge is tangible; you can see
Figure 2.5.1 Sharp Profiling (right) vs. Flat Profiling (right)
21
the problems you have to x. There are limits to how much tuning one or two routines will improve
your runtime, of course. An often quoted rule of thumb is Amdahl's Law, derived from remarks made
in 1967 by one of the designers of the IBM 360 series, and founder of Amdahl Computer, Gene
Amdahl. Strictly speaking, his remarks were about the performance potential of parallel computers,
but people have adapted Amdahl's Law to describe other things too. For our purposes, it goes like
this: Say you have a program with two parts, one that can be optimized so that it goes infinitely fast
and another that can't be optimized at all. Even if the optimizable portion makes up 50% of the initial
runtime, at best you will be able to cut the total runtime in half. That is, your runtime will eventually
be dominated by the portion that can't be optimized. This puts an upper limit on your expectations
when tuning.
Even given the finite return on effort suggested by Amdahl's Law, tuning a program with a sharp
profile can be rewarding. Programs with _at profiles are much more difficult to tune. These are often
system codes, nonnumeric applications, and varieties of numerical codes without matrix solutions. It
takes a global tuning approach to reduce, to any justifiable degree, the runtime of a program with a
at profile. For instance, you can sometimes optimize instruction cache usage, which is complicated
because of the program's equal distribution of activity among a large number of routines. It can also
help to reduce subroutine call overhead by folding call lees into callers. Occasionally, you can find a
memory reference problem that is endemic to the whole program and one that can be fixed all at
once.
When you look at a profile, you might find an unusually large percentage of time spent in the library
routines such as log, exp, or sin. Often these functions are done in software routines rather than
inline. You may be able to rewrite your code to eliminate some of these operations. Another
important pattern to look for is when a routine takes far longer than you expect. Unexpected
execution time may indicate you are accessing memory in a pattern that is bad for performance or
that some aspect of the code cannot be optimized properly. In any case, to get a profile, you need a
profiler. One or two subroutine profilers come standard with the software development
environments on all UNIX machines. We confer two of them: prof and gprof. In addition, we mention
a few line-by-line profilers. Subroutine profilers can give you a general overall view of where time is
being spent. You probably should start with prof, if you have it (most machines do). Otherwise, use
gprof. After that, you can move to a line-by-line profiler if you need to know which statements take
the most time.
2.6 Loop Optimizations
In nearly all high performance applications, loops are where the majority of the execution time is
spent. In this chapter we focus on techniques used to improve the performance of these _clutter-free_
loops. Sometimes the compiler is clever enough to generate the faster versions of the loops, and other
times we have to do some rewriting of the loops ourselves to help the compiler. It's important to
remember that one compiler's performance enhancing modifications are another compiler's clutter.
When you make modifications in the name of performance you must make sure you're helping by
testing the performance with and without the modifications. Also, when you move to another
architecture you need to make sure that any modifications aren't hindering performance. For this
reason, you should choose your performance-related modifications wisely. You should also keep the
original (simple) version of the code for testing on new architectures. Also if the benefit of the
modification is small, you should probably keep the code in its most simple and clear form. The
different loop optimization techniques, includes:
Loop unrolling
Nested loop optimization
Loop interchange
Memory reference optimization
22
Blocking
Out-of-core solutions
Someday, it may be possible for a compiler to perform all these loop optimizations automatically.
Typically loop unrolling is performed as part of the normal compiler optimizations. Other
optimizations may have to be triggered using explicit compile-time options. As you contemplate
making manual changes, look carefully at which of these optimizations can be done by the compiler.
Also run some tests to determine if the compiler optimizations are as good as hand optimizations.
2.6.1 Operation Counting
Before you begin to rewrite a loop body or reorganize the order of the loops, you must have some
idea of what the body of the loop does for each iteration. Operation counting is the process of
surveying a loop to understand the operation mix. You need to count the number of loads, stores,
floating-point, integer, and library calls per iteration of the loop. From the count, you can see how
well the operation mix of a given loop matches the capabilities of the processor. Of course, operation
counting doesn't guarantee that the compiler will generate an efficient representation of a loop. But
it generally provides enough insight to the loop to direct tuning efforts. Bear in mind that an
instruction mix that is balanced for one machine may be imbalanced for another. Processors on the
market today can generally issue some combination of one to four operations per clock cycle. Address
arithmetic is often embedded in the instructions that reference memory. Because the compiler can
replace complicated loop address calculations with simple expressions (provided the pattern of
addresses is predictable), you can often ignore address arithmetic when counting operations. Let's
look at a few loops and see what we can learn about the instruction mix:
DO I = 1,N
A(I, J, K) = A(I, J, K) + B( J, I, K)
ENDDO
This loop contains one floating-point addition and three memory references (two loads and a store).
There are some complicated array index expressions, but these will probably be simplified by the
compiler and executed in the same cycle as the memory and floating-point operations. For each
iteration of the loop, we must increment the index variable and test to determine if the loop has
completed. A 3:1 ratio of memory references to floating-point operations suggests that we can hope
for no more than 1/3 peak floating-point performance from the loop unless we have more than one
path to memory. That's bad news, but good information. The ratio tells us that we ought to consider
memory reference optimizations first. The loop below contains one floating-point addition and two
memory operations a load and a store. Operand B(J) is loop-invariant, so its value only needs to be
loaded once, upon entry to the loop:
DO I=1,N
A(I) = A(I) + B(J)
ENDDO
Again, our floating-point throughput is limited, though not as severely as in the previous loop. The
ratio of memory references to floating-point operations is 2:1. The next example shows a loop with
better prospects. It performs element-wise multiplication of two vectors of complex numbers and
assigns the results back to the first. There are six memory operations (four loads and two stores) and
six floating-point operations (two additions and four multiplications):
for (i=0; i <n; i++)
{
xr[i] = xr[i] * yr[i] - xi[i] * yi[i];
xi[i] = xr[i] * yi[i] + xi[i] * yr[i];
23
}
It appears that this loop is roughly balanced for a processor that can perform the same number of
memory operations and floating-point operations per cycle. However, it might not be. Many
processors perform a floating-point multiply and add in a single instruction. If the compiler is good
enough to recognize that the multiply-add is appropriate, this loop may also be limited by memory
references; each iteration would be compiled into two multiplications and two multiply-adds. Again,
operation counting is a simple way to estimate how well the requirements of a loop will map onto
the capabilities of the machine. For many loops, you often find the performance of the loops
dominated by memory references, as we have seen in the last three examples. This suggests that
memory reference tuning is very important.
2.6.2 Basic Loop Un-Rolling
The most basic form of loop optimization is loop un-rolling. It is so basic that most of today's
compilers do it automatically if it looks like there's a benefit. There has been a great deal of clutter
introduced into old dusty-deck FORTRAN programs in the name of loop unrolling that now serves
only to confuse and mislead today's compilers. We're not suggesting that you unroll any loops by
hand. The purpose of this section is twofold. First, once you are familiar with loop unrolling, you
might recognize code that was unrolled by a programmer (not you) some time ago and simplify the
code. Second, you need to understand the concepts of loop unrolling so that when you look at
generated machine code, you recognize unrolled loops. The primary benefit in loop unrolling is to
perform more computations per iteration. At the end of each iteration, the index value must be
incremented, tested, and the control is branched back to the top of the loop if the loop has more
iterations to process. By unrolling the loop, there are less _loop-ends_ per loop execution. Unrolling
also reduces the overall number of branches significantly and gives the processor more instructions
between branches (i.e., it increases the size of the basic blocks). For illustration, consider the
following loop. It has a single statement wrapped in a do-loop:
DO I=1,N
A(I) = A(I) + B(I) * C
ENDDO
You can unroll the loop, as we have below, giving you the same operations in fewer iterations with
less loop overhead. You can imagine how this would help on any computer. Because the computations
in one iteration do not depend on the computations in other iterations, calculations from different
iterations can be executed together. On a superscalar processor, portions of these four statements
may actually execute in parallel:
DO I=1,N,4
A(I) = A(I) + B(I) * C
A(I+1) = A(I+1) + B(I+1) * C
A(I+2) = A(I+2) + B(I+2) * C
A(I+3) = A(I+3) + B(I+3) * C
ENDDO
However, this loop is not exactly the same as the previous loop. The loop is unrolled four times, but
what if N is not divisible by? If not, there will be one, two, or three spare iterations that don't get
executed. To handle these extra iterations, we add another little loop to soak them up. The extra loop
is called a preconditioning loop:
II = IMOD (N,4)
DO I=1,II
A(I) = A(I) + B(I) * C
24
ENDDO
DO I=1+II,N,4
A(I) = A(I) + B(I) * C
A(I+1) = A(I+1) + B(I+1) * C
A(I+2) = A(I+2) + B(I+2) * C
A(I+3) = A(I+3) + B(I+3) * C
ENDDO
The number of iterations needed in the preconditioning loop is the total iteration count modulo for
this unrolling amount. If, at runtime, N turns out to be divisible by, there are no spare iterations, and
the preconditioning loop isn't executed. Speculative execution in the post-RISC architecture can
reduce or eliminate the need for unrolling a loop that will operate on values that must be retrieved
from main memory. Because the load operations take such a long time relative to the computations,
the loop is naturally unrolled. While the processor is waiting for the first load to finish, it may
speculatively execute three to four iterations of the loop ahead of the first load, effectively unrolling
the loop in the Instruction Reorder Buffer.
2.6.3 Loops with Low Trip Counts
To be effective, loop unrolling requires a fairly large number of iterations in the original loop. To
understand why, picture what happens if the total iteration count is low, perhaps less than 10, or
even less than. With a trip count this low, the preconditioning loop is doing a proportionately large
amount of the work. It's not supposed to be that way. The preconditioning loop is supposed to catch
the few leftover iterations missed by the unrolled, main loop. However, when the trip count is low,
you make one or two passes through the unrolled loop, plus one or two passes through the
preconditioning loop. In other words, you have more clutter; the loop shouldn't have been unrolled
in the first place. Probably the only time it makes sense to unroll a loop with a low trip count is when
the number of iterations is constant and known at compile time. For instance, suppose you had the
following loop:
PARAMETER (NITER = 3)
DO I=1,NITER
A(I) = B(I) * C
ENDDO
Because NITER is hardwired to 3, you can safely unroll to a depth of 3 without worrying about a
preconditioning loop. In fact, you can throw out the loop structure altogether and leave just the
unrolled loop innards:
PARAMETER (NITER = 3)
A(1) = B(1) * C
A(2) = B(2) * C
A(3) = A(3) * C
Of course, if a loop's trip count is low, it probably won't contribute significantly to the overall runtime,
unless you find such a loop at the center of a larger loop. Then you either want to unroll it completely
or leave it alone.
2.6.4 Fat Loops
Loop unrolling helps performance because it fattens up a loop with more calculations per iteration.
By the same token, if a particular loop is already fat, unrolling isn't going to help. The loop overhead
is already spread over a fair number of instructions. In fact, unrolling a fat loop may even slow your
program down because it increases the size of the text segment, placing an added burden on the
25
memory system (we'll explain this in greater detail shortly). A good rule of thumb is to look
elsewhere for performance when the loop innards exceed three or four statements.
2.6.5 Loops Containing Procedure Calls
As with fat loops, loops containing subroutine or function calls generally aren't good candidates for
unrolling. There are several reasons. First, they often contain a fair number of instructions already.
And if the subroutine being called is fat, it makes the loop that calls it fat as well. The size of the loop
may not be apparent when you look at the loop; the function call can conceal many more instructions.
Second, when the calling routine and the subroutine are compiled separately, it's impossible for the
compiler to intermix instructions. A loop that is unrolled into a series of function calls behaves much
like the original loop, before unrolling. Last, function call overhead is expensive. Registers have to be
saved; argument lists have to be prepared. The time spent calling and returning from a subroutine
can be much greater than that of the loop overhead. Unrolling to amortize the cost of the loop
structure over several calls doesn't buy you enough to be worth the effort. The general rule when
dealing with procedures is to first try to eliminate them in the remove clutter phase, and when this
has been done, check to see if unrolling gives an additional performance improvement.
2.6.6 Loops with Branches
We presented you how to eliminate certain types of branches, but of course, we couldn't get rid of
them all. In cases of iteration-independent branches, there might be some benefit to loop unrolling.
The IF test becomes part of the operations that must be counted to determine the value of loop
unrolling. Below is a doubly nested loop. The inner loop tests the value of B(J,I):
DO I=1,N
DO J=1,N
IF (B(J,I) .GT. 1.0) A(J,I) = A(J,I) + B(J,I) * C
ENDDO
ENDDO
Each iteration is independent of every other, so unrolling it won't be a problem. We'll just leave the
outer loop undisturbed:
II = IMOD (N,4)
DO I=1,N
DO J=1,II
IF (B(J,I) .GT. 1.0)+ A(J,I) = A(J,I) + B(J,I) * C
ENDDO
DO J=II+1,N,4
IF (B(J,I) .GT. 1.0)+ A(J,I) = A(J,I) + B(J,I) * C
IF (B(J+1,I) .GT. 1.0)+ A(J+1,I) = A(J+1,I) + B(J+1,I) * C
IF (B(J+2,I) .GT. 1.0)+ A(J+2,I) = A(J+2,I) + B(J+2,I) * C
IF (B(J+3,I) .GT. 1.0)+ A(J+3,I) = A(J+3,I) + B(J+3,I) * C
ENDDO
ENDDO
This approach works particularly well if the processor you are using supports conditional execution.
As described earlier, conditional execution can replace a branch and an operation with a single
conditionally executed assignment. On a superscalar processor with conditional execution, this
unrolled loop executes quite nicely.
26
2.6.7 Nested Loops
When you embed loops within other loops, you create a loop nest. The loop or loops in the center are
called the inner loops. The surrounding loops are called outer loops. Depending on the construction
of the loop nest, we may have some flexibility in the ordering of the loops. At times, we can swap the
outer and inner loops with great benefit. In the next sections we look at some common loop nesting’s
and the optimizations that can be performed on these loop nests. Often when we are working with
nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays
can lead to non-unit-stride memory access. Many of the optimizations we perform on loop nests are
meant to improve the memory access patterns. First, we examine the computation-related
optimizations followed by the memory optimizations.
2.6.8 Outer Loop Un-Rolling
If you are faced with a loop nest, one simple approach is to unroll the inner loop. Unrolling the
innermost loop in a nest isn't any different from what we saw above. You just pretend the rest of the
loop nest doesn't exist and approach it in the nor- mal way. However, there are times when you want
to apply loop unrolling not just to the inner loop, but to outer loops as well or perhaps only to the
outer loops. Here's a typical loop nest:
for (i=0; i<n; i++)
for (j=0; j<n; j++)
for (k=0; k<n; k++)
a[i][j][k] = a[i][j][k] + b[i][j][k] * c;
To unroll an outer loop, you pick one of the outer loop index variables and replicate the innermost
loop body so that several iterations are performed at the same time. The difference is in the index
variable for which you unroll. In the code below, we have unrolled the middle (j) loop twice:
for (i=0; i<n; i++)
for (j=0; j<n; j+=2)
for (k=0; k<n; k++)
{
a[i][j][k] = a[i][j][k] + b[i][k][j] * c;
a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1] * c;
}
We left the k loop untouched; however, we could unroll that one, too. That would give us outer and
inner loop unrolling at the same time:
for (i=0; i<n; i++)
for (j=0; j<n; j+=2)
for (k=0; k<n; k+=2)
{
a[i][j][k] = a[i][j][k] + b[i][k][j] * c;
a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1] * c;
a[i][j][k+1] = a[i][j][k+1] + b[i][k+1][j] * c;
a[i][j+1][k+1] = a[i][j+1][k+1] + b[i][k+1][j+1] * c;
}
We could even unroll the i loop too, leaving eight copies of the loop innards. (Notice that we
completely ignored preconditioning; in a real application, of course, we couldn't.)
27
2.6.9 Loop Interchange to Move Computations to the Center
When someone writes a program that represents some kind of real-world model, they often structure
the code in terms of the model. This makes perfect sense. The computer is an analysis tool; you aren't
writing the code on the computer's behalf. However, a model expressed naturally often works on one
point in space at a time, which tends to give you insignificant inner loops at least in terms of the trip
count. For performance, you might want to interchange inner and outer loops to pull the activity into
the center, where you can then do some unrolling. Let's illustrate with an example. Here's a loop
where KDIM time-dependent quantities for points in a two-dimensional mesh are being updated:
PARAMETER (IDIM = 1000, JDIM = 1000, KDIM = 3)
...
DO I=1,IDIM
DO J=1,JDIM
DO K=1,KDIM
D(K,J,I) = D(K,J,I) + V(K,J,I) * DT
ENDDO
ENDDO
ENDDO
In practice, KDIM is probably equal to 2 or 3, where J or I, representing the number of points, may be
in the thousands. The way it is written, the inner loop has a very low trip count, making it a poor
candidate for unrolling. By interchanging the loops, you update one quantity at a time, across all of
the points. For tuning purposes, this moves larger trip counts into the inner loop and allows you to
do some strategic un-rolling:
PARAMETER (IDIM = 1000 , JDIM = 1000 , KDIM = 3)
DO I = 1 , KDIM
DO J = 1 , JDIM
DO K = 1 , IDIM
D (K , J , I) = D (K , J , I) + V (K , J , I) * DT
ENDDO
ENDDO
ENDDO
This example is straightforward; it's easy to see that there are no inter-iteration dependencies. But
how can you tell, in general, when two loops can be interchanged? Interchanging loops might violate
some dependency, or worse, only violate it occasionally, meaning you might not catch it when
optimizing. While it is possible to examine the loops by hand and determine the dependencies, it is
much better if the compiler can make the determination. Very few single-processor compilers
automatically perform loop interchange. However, the compilers for high-end vector and parallel
computers generally interchange loops if there is some benefit and if interchanging the loops won't
alter the program results.
2.7 Matrix Multiplication
Matrix multiplication is a common operation we can use to explore the options that are available in
optimizing a loop nest. A programmer who has just finished reading a linear algebra textbook would
probably write matrix multiply as it appears in the example below:
DO I=1,N
DO J=1,N
28
SUM = 0
DO K=1,N
SUM = SUM + A(I,K) * B(K,J)
ENDDO
C(I,J) = SUM
ENDDO
ENDDO
The problem with this loop is that the A (I,K) will be non-unit stride. Each iteration in the inner loop
consists of two loads (one non-unit stride), a multiplication, and an addition. Given the nature of the
matrix multiplication, it might appear that you can't eliminate the non-unit stride. However, with a
simple rewrite of the loops all the memory accesses can be made unit stride:
DO J = 1 , N
DO I =1 , N
C( I , J) = 0.0
ENDDO
ENDDO
DO K=1,N
DO J=1,N
SCALE = B(K,J)
DO I=1,N
C(I,J) = C(I,J) + A(I,K) * SCALE
ENDDO
ENDDO
Now, the inner loop accesses memory using unit stride. Each iteration performs two loads, one store,
a multiplication, and an addition. When comparing this to the previous loop, the non-unit stride loads
have been eliminated, but there is an additional store operation. Assuming that we are operating on
a cache-based system, and the matrix is larger than the cache, this extra store won't add much to the
execution time. The store is to the location in C(I,J) that was used in the load. In most cases, the store
is to a line that is already in the in the cache. The B(K,J) becomes a constant scaling factor within the
inner loop.
2.7.1 Matrix Optimization
In the matrix multiplication code, we encountered a non-unit stride and were able to eliminate it with
a quick interchange of the loops. Unfortunately, life is rarely this simple. Often you find some mix of
variables with unit and non-unit strides, in which case interchanging the loops moves the damage
around, but doesn't make it go away. The loop to perform a matrix transpose represents a simple
example of this dilemma:
DO I =1 , N
DO J =1 , M
A (J , I) = B (I , J)
ENDDO
ENDDO
Whichever way you interchange them, you will break the memory access pattern for either A or B.
Even more interesting, you have to make a choice between stride loads vs. stride stores: which will
it be? We really need a general method for improving the memory access patterns for both A and B,
not one or the other. We'll show you such a method later.
29
2.7.2 Blocking to Ease Memory Access Patterns
Blocking is another kind of memory reference optimization. As with loop interchange, the challenge
is to retrieve as much data as possible with as few cache misses as possible. We'd like to rearrange
the loop nest so that it works on data in little neighborhoods, rather than striding through memory
like a man on stilts. Given the following vector sum, how can we rearrange the loop?
DO I=1,N
DO J=1,N
A(J,I) = A(J,I) + B(I,J)
ENDDO
ENDDO
This loop involves two vectors. One is referenced with unit stride, the other with a stride of N. We
can interchange the loops, but one way or another we still have N-step array references on either A
or B, either of which is undesirable. The trick is to block references so that you grab a few elements
of A, and then a few of B, and then a few of A, and so on _ in neighborhoods. We make this happen by
combining inner and outer loop unrolling. The difference is in the way the processor handles updates
of main memory from cache.
DO I=1,N,2
DO J=1,N,2
A(J,I) = A(J,I) + B(I,J)
A(J+1,I) = A(J+1,I) + B(I,J+1)
A(J,I+1) = A(J,I+1) + B(I+1,J)
A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
ENDDO
ENDDO
Use your imagination so we can show why this helps. Usually, when we think of a two-dimensional
array, we think of a rectangle or a square (Arrays A and B)). Remember, to make programming easier,
the compiler provides the illusion that two-dimensional arrays A and B are rectangular plots of
memory (Arrays A and B). Actually, memory is sequential storage. In FORTRAN, a two-dimensional
array is constructed in memory by logically lining memory strips up against each other, like the
pickets of a cedar fence. (It's the other way around in C: rows are stacked on top of one another.)
Array storage starts at the upper left, proceeds down to the bottom, and then starts over at the top of
the next column. Stepping through the array with unit stride traces out the shape of a backwards N
repeated over and over, moving to the right.
2.8 Shared-Memory Parallel Processors
As we move into programming on multiprocessors, we must increase our understanding of
parallelism in order to understand how to effectively program these systems. In short, as we gain
more parallel resources, we need to find more parallelism in our code. Some examples of parallelism
in order of increasing grain size are:
When performing a 32-bit integer addition, using a carry look ahead adder, you can partially
add bits 0 and 1 at the same time as bits 2 and 3.
On a pipelined processor, while decoding one instruction, you can fetch the next instruction.
On a two-way superscalar processor, you can execute any combination of an integer and a
floating-point instruction in a single cycle.
30
On a multiprocessor, you can divide the iterations of a loop among the four processors of the
system.
You can split a large array across four workstations attached to a network. Each workstation
can operate on its local information and then exchange boundary values at the end of each
time step.
We start at instruction-level parallelism (pipelined and superscalar) and move toward thread level
parallelism, which is what we need for multiprocessor systems. It is important to note that the
different levels of parallelism are generally not in conflict. Increasing thread parallelism at a coarser
grain size often exposes more fine-grained parallelism. The following is a loop that has plenty of
parallelism:
DO I = 1 , 16000
A (I) = B (I) * 3.14159
ENDDO
We have expressed the loop in a way that would imply that A (1) must be computed first, followed
by A (2), and so on. However, once the loop was completed, it would not have mattered if A (16000),
were computed first followed by A (15999), and so on. The loop could have computed the even values
of I and then computed the odd values of I. It would not even make a difference if all 16,000 of the
iterations were computed simultaneously using a 16,000-way superscalar processor. If the compiler
has flexibility in the order in which it can execute the instructions that make up your program, it can
execute those instructions simultaneously when parallel hardware is available.
One technique that computer scientists use to formally analyze the potential parallelism in an
algorithm is to characterize how quickly it would execute with an infinite way superscalar processor.
Not all loops contain as much parallelism as this simple loop. We need to identify the things that limit
the parallelism in our codes and remove them whenever possible. Helping the compiler recognize
parallelism is one of the basic approaches specialists take in tuning code. A slight rewording of a loop
or some supplementary information supplied to the compiler can change a we don't know answer
into an opportunity for parallelism. To be certain, there are other facets to tuning as well, such as
optimizing memory access patterns so that they best suit the hardware, or recasting an algorithm.
And there is no single best approach to every problem; any tuning effort has to be a combination of
techniques.
2.8.1 Dependencies
Imagine a symphony orchestra where each musician plays without regard to the conductor or the
other musicians. At the first tap of the conductor's baton, each musician goes through all of his or her
sheet music. Some finish far ahead of others, leave the stage, and go home. The cacophony wouldn't
resemble music (come to think of it, it would resemble experimental jazz) because it would be totally
uncoordinated. Of course this isn't how music is played. A computer program, like a musical piece, is
woven on a fabric that unfolds in time (though perhaps woven more loosely). Certain things must
happen before or along with others, and there is a rate to the whole process. With computer
programs, whenever event A must occur before event B can, we say that B is dependent on A. We call
the relationship between them a dependency. Sometimes dependencies exist because of calculations
or memory operations; we call these data dependencies. Other times, we are waiting for a branch or
do-loop exit to take place; this is called a control dependency. Each is present in every program to
varying degrees. The goal is to eliminate as many dependencies as possible. Rearranging a program
so that two chunks of the computation are less dependent exposes parallelism, or opportunities to
do several things at once.
31
2.8.1.1 Control Dependencies
Just as variable assignments can depend on other assignments, a variable's value can also depend on
the flow of control within the program. For instance, an assignment within an if-statement can occur
only if the conditional evaluates to true. The same can be said of an assignment within a loop. If the
loop is never entered, no statements inside the loop are executed. Interestingly, this is not as far-
fetched as it might seem. On a single instruction multiple data (SIMD) computer such as the
Connection CM-2 with 16,384 processors, it would take three instruction cycles to process this entire
loop. When calculations occur as a consequence of the flow of control, we say there is a control
dependency, as in the code below and shown graphically in Figure 2.8.1 (a). The assignment located
inside the block-if may or may not be executed, depending on the outcome of the test X .NE. 0. In other
words, the value of Y depends on the flow of control in the code around it. Again, this may sound to
you like a concern for compiler designers, not programmers, and that's mostly true.
But there are times when you might want to move control-dependent instructions around to get
expensive calculations out of the way (provided your compiler isn't smart enough to do it for you).
For example, say that Figure 2.8.1 (b) represents a little section of your program. Flow of control
enters at the top and goes through two branch decisions. Furthermore, say that there is a square root
operation at the entry point, and that the flow of control almost always goes from the top, down to
the leg containing the statement A = 0.0.
This means that the results of the calculation A = SQRT(B) are almost always discarded because A
gets a new value of 0.0 each time through. A square root operation is always _expensive_ because it
takes a lot of time to execute. The trouble is that you can't just get rid of it; occasionally it's needed.
However, you could move it out of the way and continue to observe the control dependencies by
making two copies of the square root operation along the less traveled branches, as shown in Figure
2.8.1-(c). This way the SQRT would execute only along those paths where it was actually needed.
This kind of instruction scheduling will be appearing in compilers (and even hardware) more and
more as time goes on. A variation on this technique is to calculate results that might be needed at
times when there is a gap in the instruction stream (because of dependencies), thus using some spare
cycles that might otherwise be wasted.
2.8.1.2 Data Dependencies
A calculation that is in some way bound to a previous calculation is said to be data dependent upon
that calculation. In the code below, the value of B is data dependent on the value of A. That's because
you can't calculate B until the value of A is available:
A = X + Y + COS(Z)
Figure 2.8.1 (a) Control Dependency; (b) A section of your program; (c) Expensive Operation Moved
so that it's Rarely Executed
32
B = A * C
This dependency is easy to recognize, but others are not so simple. At other times, you must be careful
not to rewrite a variable with a new value before every other computation has finished using the old
value. We can group all data dependencies into three categories:
(1) flow dependencies,
(2) anti-dependencies, and
(3) output dependencies.
Figure 2.8.2 contains some simple examples to demonstrate each type of dependency. Here, we use
an arrow that starts at the source of the dependency and ends at the statement that must be delayed
by the dependency.
2.8.2 Forming a Flow Graph
To illustrate, suppose that we have the flow graph (flow chart) in Figure 2.8.3. Beside each basic
block we've listed the variables it uses and the variables it defines. What can data flow analysis tell
us? Notice that a value for A is defined in block X but only used in block Y. That means that A is dead
upon exit from block Y or immediately upon taking the right-hand branch leaving X; none of the other
basic blocks uses the value of A. That tells us that any associated resources, such as a register, can be
freed for other uses. Looking at Figure 2.8.3 we can see that D is defined in basic block X, but never
used. This means that the calculations defining D can be discarded. Something interesting is
happening with the variable G. Blocks X and W both use it, but if you look closely you'll see that the
two uses are distinct from one another, meaning that they can be treated as two independent
variables. A compiler featuring advanced instruction scheduling techniques might notice that W is
the only block that uses the value for E, and so move the calculations defining E out of block Y and
into W, where they are needed. In addition to gathering data about variables, the compiler can also
keep information about subexpressions. Examining both together, it can recognize cases where
redundant calculations are being made (across basic blocks), and substitute previously computed
values in their place. If, for instance, the expression H*I appears in blocks X, Y, and W, it could be
calculated just once in block X and propagated to the others that use it.
Figure 2.8.2 Types of Data Dependencies
33
2.8.2.1 Loop Dependencies
Loops are the center of activity for many applications, so there is often a high payback for simplifying
or moving calculations outside. Early compilers for parallel architectures used pattern matching to
identify the bounds of their loops. This limitation meant that a hand-constructed loop using if-
statements and go to-statements would not be correctly identified as a loop. Because modern
compilers use data flow graphs, it's practical to identify loops as a particular subset of nodes in the
flow graph. To a data flow graph, a hand constructed loop looks the same as a compiler-generated
loop. Optimizations can therefore be applied to either type of loop. Once we have identified the loops,
we can apply the same kinds of data-flow analysis we applied above. Among the things we are looking
for are calculations that are unchanging within the loop and variables that change in a predictable
(linear) fashion from iteration to iteration. How does the compiler identify a loop in the flow graph?
Fundamentally, two conditions have to be met:
A given node has to dominate all other nodes within the suspected loop. This means that all
paths to any node in the loop have to pass through one particular node, the dominator. The
dominator node forms the header at the top of the loop.
There has to be a cycle in the graph. Given a dominator, if we can find a path back to it from
one of the nodes it dominates, we have a loop. This path back is known as the back edge of
the loop.
The flow graph in Figure 2.8.4 contains one loop and one red herring. You can see that node B
dominates every node below it in the subset of the flow graph. Therefore, the nodes B, C, D, and E
form a loop. The loop goes through an array of linked list start pointers and traverses the lists to
determine the total number of nodes in all lists. Letters to the extreme right correspond to the basic
block numbers in the flow graph. At first glance, it appears that the nodes C and D form a loop too.
Figure 2.8.3 Flow Graph for Data Flow Analysis
34
The problem is that C doesn't dominate D (and vice versa), because entry to either can be made from
B. Generally, the flow graphs that come from code segments written with even the weakest
appreciation for a structured design offer better loop candidates. After identifying a loop, the
compiler can concentrate on that portion of the flow graph, looking for instructions to remove or
push to the outside. Certain types of subexpressions, such as those found in array index expressions,
can be simplified if they change in a predictable fashion from one iteration to the next. In the
continuing quest for parallelism, loops are generally our best sources for large amounts of
parallelism. However, loops also provide new opportunities for those parallelism-killing
dependencies.
2.8.2.2 Loop-Carried Dependencies
The notion of data dependence is particularly important when we look at loops, the hub of activity
inside numerical applications. A well-designed loop can produce millions of operations that can all
be performed in parallel. However, a single misplaced dependency in the loop can force it all to be
run in serial. So the stakes are higher when looking for dependencies in loops. Some constructs are
completely independent, right out of the box. The question we want to ask is Can two different
iterations execute at the same time, or is there a data dependency between them? Consider the
following loop:
DO I=1,N
A(I) = A(I) + B(I)
ENDDO
For any two values of I and K, can we calculate the value of A(I) and A(K) at the same time? Below,
we have manually unrolled several iterations of the previous loop, so they can be executed together:
A(I) = A(I) + B(I)
Figure 2.8.4 Flow Graph including a loop
35
A(I+1) = A(I+1) + B(I+1)
A(I+2) = A(I+2) + B(I+2)
You can see that none of the results are used as an operand for another calculation. For instance, the
calculation for A(I+1) can occur at the same time as the calculation for A(I) because the calculations
are independent; you don't need the results of the first to determine the second. In fact, mixing up
the order of the calculations won't change the results in the least. Relaxing the serial order imposed
on these calculations makes it possible to execute this loop very quickly on parallel hardware.
2.8.2.3 Flow Dependencies
For comparison, look at the next code fragment:
DO I=2,N
A(I) = A(I-1) + B(I)
ENDDO
This loop has the regularity of the previous example, but one of the subscripts is changed. Again, it's
useful to manually unroll the loop and look at several iterations together:
A(I) = A(I-1) + B(I)
A(I+1) = A(I) + B(I+1)
A(I+2) = A(I+1) + B(I+2)
In this case, there is a dependency problem. The value of A (I+1) depends on the value of A (I), the
value of A (I+2) depends on A (I+1), and so on; every iteration depends on the result of a previous
one. Dependencies that extend back to a previous calculation and perhaps a previous iteration (like
this one), are loop carried flow dependencies or backward dependencies. You often see such
dependencies in applications that perform Gaussian elimination on certain types of matrices, or
numerical solutions to systems of differential equations. However, it is impossible to run such a loop
in parallel (as written); the processor must wait for intermediate results before it can proceed. In
some cases, flow dependencies are impossible to x; calculations are so dependent upon one another
that we have no choice but to wait for previous ones to complete. Other times, dependencies are a
function of the way the calculations are expressed. For instance, the loop above can be changed to
reduce the dependency. By replicating some of the arithmetic, we can make it so that the second and
third iterations depend on the first, but not on one another. The operation count goes up we have an
extra addition that we didn't have before but we have reduced the dependency between iterations:
DO I=2,N,2
A(I) = A(I-1) + B(I)
A(I+1) = A(I-1) + B(I) + B(I+1)
ENDDO
The speed increase on a workstation won't be great (most machines run the recast loop more slowly).
However, some parallel computers can trade off additional calculations for reduced dependency and
chalk up a net win.
2.8.2.4 Output Dependencies
The third class of data dependencies, output dependencies, is of particular interest to users of parallel
computers, particularly multiprocessors. Output dependencies involve getting the right values to the
right variables when all calculations have been completed. Otherwise, an output dependency is
violated. The loop below assigns new values to two elements of the vector A with each iteration:
36
DO I=1,N
A(I) = C(I) * 2.
A(I+2) = D(I) + E
ENDDO
As always, we won't have any problems if we execute the code sequentially. But if several iterations
are performed together, and statements are reordered, then incorrect values can be assigned to the
last elements of A. For example, in the naive vectored equivalent below, A (I+2) takes the wrong value
because the assignments occur out of order:
A(I) = C(I) * 2.
A(I+1) = C(I+1) * 2.
A(I+2) = C(I+2) * 2.
A(I+2) = D(I) + E Output dependency violated
A(I+3) = D(I+1) + E
A(I+4) = D(I+2) + E
Whether or not you have to worry about output dependencies depends on whether you are actually
parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate
legal code and possibly even fast code, if it's clever enough. But output dependencies occasionally
become a problem for programmers.
2.8.2.5 Dependencies Within an Iteration
We have looked at dependencies that cross iteration boundaries but we haven't looked at
dependencies within the same iteration. Consider the following code fragment:
DO I = 1,N
D = B(I) * 17
A(I) = D + 14
ENDDO
When we look at the loop, the variable D has a flow dependency. The second statement cannot start
until the first statement has completed. At first glance this might appear to limit parallelism
significantly. When we look closer and manually unroll several iterations of the loop, the situation
gets worse:
D = B(I) * 17
A(I) = D + 14
D = B(I+1) * 17
A(I+1) = D + 14
D = B(I+2) * 17
A(I+2) = D + 14
Now, the variable D has flow, output, and anti-dependencies. It looks like this loop has no hope of
running in parallel. However, there is a simple solution to this problem at the cost of some extra
memory space, using a technique called promoting a scalar to a vector. We define D as an array with
N elements and rewrite the code as follows:
DO I = 1,N
D(I) = B(I) * 17
37
A(I) = D(I) + 14
ENDDO
Now the iterations are all independent and can be run in parallel. Within each iteration, the first
statement must run before the second statement.
2.9 Pointer Ambiguity in C
FORTRAN compilers depend on programmers to observe aliasing rules. That is, programmers are
not supposed to modify locations through pointers that may be aliases of one another. They can
become aliases in several ways, such as when two dummy arguments receive pointers to the same
storage locations:
CALL BOB (A,A)
...
END
SUBROUTINE BOB (X,Y) X,Y become aliases
C compilers don't appreciate the same restrictions on aliasing. In fact, there are cases where aliasing
could be desirable. Additionally, C is blessed with pointer types, increasing the opportunities for
aliasing to occur. This means that a C compiler has to approach operations through pointers more
conservatively than a FORTRAN compiler would. The following loop nest looks like a FORTRAN loop
cast in C. The arrays are declared or allocated all at once at the top of the routine, and the starting
address and leading dimensions are visible to the compiler. This is important because it means that
the storage relationship between the array elements is well known. Hence, you could expect good
performance:
#define N ...
double *a[N][N], c[N][N], d;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j] = a[i][j] + c[j][i] * d;
Now imagine what happens if you allocate the rows dynamically. This makes the address calculations
more complicated. The loop nest hasn't changed; however, there is no guaranteed stride that can get
you from one row to the next. This is because the storage relationship between the rows is unknown:
#define N ...
double *a[N], *c[N], d;
for (i=0; i<N; i++)
{
a[i] = (double *) malloc (N*sizeof(double));
c[i] = (double *) malloc (N*sizeof(double));
}
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j] = a[i][j] + c[j][i] * d;
In fact, your compiler knows even less than you might expect about the storage relationship. For
instance, how can it be sure that references to a and c aren't aliases? It may be obvious to you that
they're not. You might point out that malloc never overlaps storage. But the compiler isn't free to
38
assume that. Who knows? You may be substituting your own version of malloc! Let's look at a
different example, where storage is allocated all at once, though the declarations are not visible to all
routines that are using it. The following subroutine bob performs the same computation as our
previous example. However, because the compiler can't see the declarations for a and c (they're in
the main routine), it doesn't have enough information to be able to overlap memory references from
successive iterations; the references could be aliases:
#define N...
main( )
{
double a[N][N], c[N][N], d;
...
bob (a,c,d,N);
}
bob (double *a,double *c,double d,int n)
{
int i,j;
double *ap, *cp;
for (i=0;i<n;i++)
{
ap = a + (i*n);
cp = c + i;
for (j=0; j<n; j++)
*(ap+j) = *(ap+j) + *(cp+(j*n)) * d;
}
}
To get the best performance, make available to the compiler as many details about the size and shape
of your data structures as possible. Pointers, whether in the form of formal arguments to a subroutine
or explicitly declared, can hide important facts about how you are using memory. The more
information the compiler has, the more it can overlap memory references. This information can come
from compiler directives or from making declarations visible in the routines where performance is
most critical.
39
3 Performance of CFD Codes as Related to Hardware (CPU vs. GPU)
The method used to assess the performance of a parallel CFD solver is becoming a topic for debate.
While some implementations use a fixed number of outer iterations to assess the performance of the
parallel solver regardless of whether a solution has been obtained or not, other implementers use a
fixed value for the residual as a basis for evaluation. Ironically, a large amount of implementers do
not mention the method used in their assessment! The reason for this discrepancy is that the first
group (who uses a fixed number of outer iterations) believes that the evaluation of the parallel
performance should be done using exactly the same algorithm which justifies the use of a fixed
number of outer iterations. This can be acceptable from an algorithmic point of view. The other group
(who uses a fixed value for the maximum residual) believes that the evaluation of the parallel
performance should be done using the converged solution of the problem which justifies the use of
the maximum residual as a criterion for performance measurement. This is acceptable from an
engineering point of view and from the user point of view. In all cases, the parallel code will be used
to seek a valid solution! Now if the number of outer iterations is the same as that of the sequential
version, tant mieux! The problem becomes more complicated when an algebraic multigrid solver is
used. Depending on the method used in implementing the AMG solver, the maximum number of AMG
levels in the parallel version will usually be less than that of the sequential version which raises the
issue that one is not comparing the same algorithm. From an engineering point of view, the main
concern is to obtain a valid solution for a given problem in a reasonable amount of time and thus, a
user will not actually perform a sequential run and then a parallel run; rather, she will require the
code to use as many AMG levels as possible.
3.1 CFD for Next Generation High Performance Computing
High Performance Computing (HPC) is moving towards large scale parallelism. The Jaguar
supercomputer, which is currently the fastest computer in the world, has over 200,000 processing
cores. On chip parallelism has been increasing in regular processors (dual core, quad core, etc.) since
2001, but now larger scales of parallelism are being seen on a single chip. The introduction of
Graphics Processing Units (GPUs), which have hundreds of cores on a single chip, into HPC
represents a large change in the architectures being used for scientific computing. The scale of
parallelism and new architectures requires novel numerical solvers to be written and optimized for
solving CFD problems.
3.2 Hardware Consideration and CPU vs. GPU Technology
According to website WEKA blog, CPU is a generalized processor that is designed to carry out a wide
variety of tasks. GPU (graphics processing unit) is a specialized processing unit with enhanced
mathematical computation capability, ideal for computer graphics and machine-learning tasks. GPUs
have been traditionally used for rendering graphics, in which several relatively simple operations are
performed identically on all parts of the input to produce an output image
6
. The nature of this work
makes graphics rendering a good candidate for parallelization, GPUs reflect this in their architecture
by having many cores. The differences between CPU and GPU architecture can be summarized in
four points, and visualize at Figure 3.2.1:
CPUs contain few cores but have a relatively large cache (several Mbs).
CPUs have many advanced features such as predictive branching, out of order execution and
deep pipelines to improve throughput.
GPUs have hundreds of cores split into groups which share control hardware and high speed
6
Mark Mawson, Alistair Revell & Robert Prosser, Computational Fluid Dynamics Codes For Next Generation
High Performance Computing”, Computational Fluid Dynamics Group, University of Manchester.
40
memory (equivalent to cache). High speed memory is very small (a few Kbs).
GPU cores are 'lightweight', i.e. they lack the advanced features of CPU cores.
GPUs are similar in function to CPU: they contain cores, memory, and other components. Instead of
emphasizing context switching to manage multiple tasks, GPU acceleration emphasizes parallel data
processing through a large number of cores [WEKA blog].
3.2.1 Case Study 1 2D Laplace Equation
A V-Cycle Multigrid method was written for GPUs to solve a 2D Laplace problem (Δu = 0). The
Multigrid Method solves a system of linear equations by restricting the error of the initial fine grid
solution to increasingly coarse grids,
and performing smoothing functions
on them. It has been shown that
higher frequency errors are more
susceptible to smoothing operations
than low frequency errors. By
restricting the solution to coarser
grids the relative frequency of the
errors increases, making them
susceptible to smoothing. Each level
of coarseness will allow a low
frequency error component to be
smoothed, the solution at each grid
level is then summed with the level
above and further smoothing carried
out to remove any errors introduced
by the summation. The result is a
solution that is smoothed across all
frequencies of error.
3.2.1.1 Results
GPU and CPU implementations of the
V-Cycle Multigrid Method were
Figure 3.2.1 Architecture Differences Between CPU and GPU
Figure 3.2.2 Result for a V cycle Multigrid
41
tested on grids of size up to 4097×4097 elements (see Figure 3.2.2). The maximum grid size was
limited by the size of RAM on the GPU (4GB), future work will include allowing partition as of larger
grids to be moved to and from the GPU. The GPU implementation performed up to 12× faster than
the CPU version.
3.2.1.2 Future Work Heterogeneous
Computing
In the V-cycle method shown, the CPU is
idle while functions run on the GPU. The
principle of heterogeneous computing is
that the CPU will perform other tasks
while the GPU is being used. For CFD
applications this could include using the
CPU to assist with the solver or, for time
dependent problems, post processing of
the previous time step while the GPU
calculates the latest time step (see Figure
3.2.3).
3.2.2 Case Study 2 - Unstructured Grid
Based CFD Solvers on Modern
Graphics Hardware
The 3D Euler equations for inviscid,
compressible flow are considered by
[Corrigan et. al.]
7
. Effective memory
bandwidth is improved by reducing total
global memory access and overlapping redundant computation, as well as using an appropriate
numbering scheme and data layout. The applicability of per-block shared memory is also considered.
The performance of the solver is demonstrated on two benchmark cases: a missile and the NACA0012
wing. For a variety of mesh sizes, an average speed-up factor of roughly 9.5X is observed over the
equivalent parallelized OpenMP code running on a quad-core CPU, and roughly 33x over the
equivalent code running in serial.
3.2.2.1 Background and Literature Survey
Recently, GPUs (Graphics Processing Units) have seen a tremendous increase in performance, In
addition to this high computational performance, the latest modern graphics hardware offers
increasing memory capacity, as well as support for 64-bit floating point arithmetic. Parallel, multi-
core processors, GPUs offer tremendous potential for applications in computational fluid dynamics.
In order to fully exploit the computational power of such hardware, considerable care is required in
the coding and implementation, particularly in the memory access pattern. GPUs have general-
purpose global memory, which is not automatically cached and exhibits high latency in comparison
with the instruction throughput of GPUs. Furthermore, with earlier CUDA-enabled GPUs, there were
stringent requirements for achieving optimal effective memory bandwidth, with a large loss of
performance when these requirements went unmet. With the data-dependent memory access of
unstructured grid based solvers, this loss of performance is almost assured. However, with due care,
structured grid based solvers can meet these requirements due to the regular memory access
patterns of such solvers, as described in the work of (Brandvik & Pullan), and (Tolke). Further work
on regular grid solvers includes that of (Phillips et al.) who have developed a 2D compressible Euler
7
Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD
Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.
Figure 3.2.3 Heterogeneous Computing using CPUs and
GPUs
42
solver on a cluster of GPUs, and [Thibault et al.]
8
who have implemented a 3D incompressible Navier
- Stokes solver for multi-GPU systems. So far, the implementation of optimized unstructured grid
based solvers for modern graphics hardware has been relatively rare, perhaps due to these stringent
requirements. In fact, just prior to its first release, [Owens et al.]
9
comprehensively surveyed the field
of general-purpose computation on graphics hardware (GPGPU), which included a number of
primarily structured grid based solvers, such as those of [Harris]
10
, [Scheidegger et al.]
11
, and [Hagen
et al.]
12
However, the architecture has changed substantially and many of the limitations of GPGPU
via traditional graphics APIs such as OpenGL are no longer an issue.
The most recent CUDA-enabled GPUs have looser requirements for achieving high effective memory
bandwidth. Roughly speaking, memory no longer needs to be accessed in a specific order by
consecutive threads. Rather, high effective memory bandwidth can be achieved as long as
consecutive threads access nearby locations in memory, which is called coalescing. Thus, if an
appropriate memory access pattern is obtained, one can expect that modern GPUs will be capable of
achieving high effective memory bandwidth and in general high performance for unstructured grid
based CFD solvers. The purpose of this work is to study techniques which achieve this.
3.2.2.2 Implementation on Graphics Hardware
The performance-critical portion of the solver consists of a loop which repeatedly computes the time
derivatives of the conserved variables [see the Corrigan et. al.]
13
. The conserved variables are then
8
Thibault1, J. and Senocak, I., “CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms
for Incompressible Flows," 47th AIAA Aerospace Sciences Meeting Including The New Horizons Forum and
Aerospace Exposition, No. AIAA 2009-758, January 2009.
9
Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A. E., and Purcell, T. J., “A Survey of General-
Purpose Computation on Graphics Hardware," Computer Graphics Forum, Vol. 26, No. 1, 2007.
10
Harris, M., “Fast Fluid Dynamics Simulation on the GPU," GPU Gems, chap. 38, Addison-Wesley, 2004.
11
C. Scheidegger, J. Comba, R. C., “Practical CFD simulations on the GPU using SMAC." Computer Graphics Forum,
Vol. 24, 2005.
12
12Hagen, T., Lie, K.-A., and Natvig, J., “Solving the Euler Equations on Graphics Processing Units," Proceedings
of the 6th International Conference on Computational Science, Vol. 3994 of Lecture Notes in Computer Science,
Springer, May 2006.
13
Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD
Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.
Figure 3.2.4 Pressures at the Surface and Plane for the NACA 00012 (Left) and at the Surface for the
Missile (Right)
43
updated using an explicit Runge-Kutta time-stepping scheme. The most expensive computation
consists of accumulating flux contributions and artificial viscosity across each face when computing
the time derivatives. Therefore, the performance of the CUDA kernel which implements this
computation is crucial in determining whether or not high performance is achieved, and is the focus
of this section.
3.2.2.3 Test Case
The performance of the GPU code was measured on a prototype NVIDIA Tesla GPU, supporting
compute capability 1.3, with 24 multi-processors. A NACA 0012 wing in supersonic (M = 1.2 , α = 0)
flow was used as a test case where the pressure contours are plotted in Figure 3.2.4 (Left). Timing
measurements when running in single-precision and for a variety of meshes, showing an average
performance scaling factor of 9.4X in comparison to the OpenMP code running on four cores and
32.6X in comparison to the OpenMP code on one core. Furthermore, the code running on graphics
hardware is faster by a factor of 3.9X using redundant computation in comparison to pre-computed
flux contributions. Timing measurements when running in double-precision are given in Figure
3.2.5-(Top) for a variety of meshes, showing an average performance scaling factor of 1.56X in
comparison to the OpenMP code running on four cores and 4.7X in comparison to the OpenMP code
on one core. Furthermore, the code running on graphics hardware is faster by a factor of 1.1X using
redundant computation in comparison to pre-computed flux contributions.
A missile in supersonic (M = 1.2 , α = 8) flow was used as an additional test case. The pressure
contours are plotted in Figure 3.2.4-(right). Timing measurements when running in double-
precision are given in Figure 3.2.5-(bottom) for a variety of meshes, showing an average
performance scaling factor of 2.5X in comparison to the OpenMP code running on four cores and 7.4X
in comparison to the OpenMP code on one core. Furthermore, the code running on graphics
hardware is faster by a factor 1.63X using redundant computation in comparison to pre-computed
flux contributions. For additional, consult the [Corrigan et. al.]
14
.
14
Andrew Corrigan, Fernando Camelli, Rainald Lohner, and John Wallin, “Running Unstructured Grid Based CFD
Solvers on Modern Graphics Hardware”, 19th AIAA Computational Fluid Dynamics, 2009.
Figure 3.2.5 Running Times in Double Precision Per Element Per Iteration for the NACA 0012
44
3.2.3 Case Study 3 - Accelerating CFD Simulation With High Order Finite Difference Method on
Curvilinear Coordinates for Modern GPU Clusters
Authors : Chuangchao Ye, Pengjunyi Zhang, Rui Yan, Dejun Sun, Zhenhua Wan
Affiliations : Department of modern mechanics, University of Science and Technology of China , Hefei,
230027, China
Title : Accelerating CFD simulation with high order finite difference method on curvilinear coordinates
for modern GPU clusters
Citation : Ye, C., Zhang, P., Yan, R., Sun, D., & Wan, Z. (2020). Accelerating CFD simulation with high
order finite difference method on curvilinear coordinates for modern GPU clusters. arXiv:
Computational Physics.
Source : arXiv:2006.07964v1
A high fidelity flow simulation for complex geometries for high Reynolds number (Re) flow is still
very challenging, which requires more powerful computational capability of HPC system [Ye et al.]
15
.
However, the development of HPC with traditional CPU architecture suffers bottlenecks due to its
high power consumption and technical difficulties. Heterogeneous architecture computation is
raised to be a promising solution of difficulties of HPC development. GPU accelerating technology has
been utilized in low order scheme CFD solvers on structured grid and high order scheme solvers on
unstructured meshes. The high order finite difference methods on structured grid possess many
advantages, e.g. high efficiency, robustness and low storage, however, the strong dependence among
points for a high order finite difference scheme still limits its application on GPU platform. In present
work, we propose a set of hardware-aware technology to optimize the efficiency of data transfer
between CPU and GPU, and efficiency of communication between GPUs. An in-house multi-block
structured CFD solver with high order finite difference methods on curvilinear coordinates is ported
onto GPU platform, and obtain satisfying performance with speedup maximum around 2000x over a
single CPU core. This work provides efficient solution to apply GPU computing in CFD simulation
with certain high order finite difference methods on current GPU heterogeneous computers. The
test shows that significant accelerating effects can been achieved for different GPUs.
Keywords: Hardware-aware, High Order, Finite Difference Methods, Curvilinear Coordinates, GPU,
M=million.
3.2.3.1 Introduction
CFD is one of the most important research methods in fluid mechanics, which highly relies on the
computational capability of computer, especially for accurately simulating realistic engineering
flows. The deficiency of computational capability of computer has now became one of the biggest
obstacles for the future CFD development. High order methods have been adopted in complex flow
simulations like turbulence simulation, computational aeroacoustics (CAA), on the merit that it can
obtain more accurate results with less grid than low order methods due to their low dissipation
property. However, high order methods are more computational expensive comparing to low order
methods. The real flows in nature are usually with large Reynolds number. Chapman estimated in
1979 the grid-point requirements (N) for wall-modeled large eddy simulation (LES) to be N Re2/5L
, for wall-resolving LES, N Re 9/5L , and for direct numerical simulation (DNS), N Re9/4L [1]. In 2012,
Choi & Moin [2] revised Chapman’s estimation to be ReL, Re 13/7L and Re 37/14 L respectively.
Since Gordon E. Moore, the co-founder of Intel, observed that the number of transistors on a
microchip doubles every two years though the cost of computers is halved, the Moore’s law has
governed the development of CPU for decades. However, in the past few years, there are many voices
about gradual failure of this golden law, because the high temperatures of transistors eventually
15
Chuangchao Ye, Pengjunyi Zhang, Rui Yan, Dejun Sun, Zhenhua Wan, “Accelerating CFD simulation with high
order finite difference method on curvilinear coordinates for modern GPU clusters”, arXiv:2006.07964v1, 2020.
45
would make it harder to create smaller circuits. In the meantime, there are many heterogeneous
computational architectures arise, such as field-programmable gate array (FPGA), automata
processor (AP) and graphics processing unit (GPU). Among those heterogeneous architectures, the
GPU computing has grown to be the most popular one in high performance computing (HPC). GPU,
originally designed for graphics rendering, releases great computational capability due to its "many
cores" architecture.
Compared to traditional CPU which contains only several cores, GPU contains thousands of cores in
a chip, which provide more powerful computation capability. In additions, GPU has lower power and
is cheaper for equivalent computation capability as CPU. In the latest TOP500 list of HPC, half of the
10 fastest HPC systems have been equipped with GPU devices. Summit, the fastest supercomputer in
the world located in the Oak Ridge National Laboratory (ORNL) in USA, consists of 4608 compute
nodes, and each node contains 2 IBM POWER9 processors and 6 NVIDIA Volta V100 GPUs. The GPUs
deliver over 95% performance of Summit. As to traditional CPU architecture supercomputer Tianhe-
2A, Summit delivers double peak performance (200 PetaFlOPS) with nearly half power consumption
(10KW).
The merits of GPU computing have attracted many attempts to port CFD simulation to GPU platform.
Tutkun, et al. [3] solve compressible Navier-Stokes equations with compact scheme combining with
filter on a Tesla C1060 GPU, and get speedup of 9x to 16x compared to a AMD Phenom CPU.
Esfahanian, et al. [4] perform 3rd and 5th order WENO scheme on two dimensional uniform mesh
solving Euler equations on GPU, and get maximum speedup of 316x. Karantasis, et al. [5] perform 5th
, 7th and 9th order WENO scheme on uniform grid with 6.9 M grid points, accelerating with Tesla
S1070 GPU, and get maximum speedup of 90x. Xu, et al. [6] perform high order finite difference
WCNS scheme and Hybrid cell-edge and cell-node Dissipative Compact Scheme (HDCS) on curvilinear
coordinates.
The GPU-only approach achieves a speedup of about 1.3 on a Tesla M2050 GPU compared with two
Xeon X5670 CPUs, while the hybrid CPU and GPU approach achieves maximum 2.3x speedup. Lai, et
al. [7, 8] develop a multiple GPU algorithm in hypersonic flow computations with the second order
finite volume method on curvilinear coordinates, and obtain 20x to 40x speedup accelerating with a
Nvidia GTX1070 GPU compared to an 8 cores Intel Xeon E2670 CPU.
Elsen, et al. [9] develop a first order to sixth order finite difference method code for complex
geometry on GPU platform, and obtain speedup 40x for a simple geometry and 20x for a complex
geometry on a Nvidia 8800GTX GPU compared to an Intel-Core-2-Duo. Very recently, Lei, et al. [10]
accelerate a compressible flow solver based on a second order MUSCL/NND code on cartesian
coordinates with GPU. Tesla P100 GPU is used for accelerating and gets speedup 260x for MUSCL
scheme and 144x for NND scheme over a E5-2640v4 CPU core.
In general, for flow solvers in cartesian coordinates and with low order schemes, the application of
GPU gets conspicuous acceleration. But for a complicated coordinate system and high order schemes,
the flow solvers get inferior accelerating effect, which is far from the computation requirement for
the higher fidelity and larger scale simulation.
High order methods based on unstructured mesh, e.g. flux reconstruction (FR) method and
discontinuous Galerkin (DG) method, are popular in accurately simulating flow with complex
geometry. GPU computing achieves great success in this branch of the high order method due to its
computational independence of elements. Crabill, et al. [11] simulated the flow over a spinning golf
ball at a Reynolds number of 150,000 with flux reconstruction (FR) solver on GPU. Witherden, et
al.[12] developed a solver based on the high order flux reconstruction (FR) which has been used for
a variety of flows. Although it is easy to improve precision, the numerical stability of this type of
method in solving high speed flows is poor, which limits its application. Furthermore, this method
requires a large amount of storage which also limits its application in large scale simulations in GPU
computing. In contrast, the finite difference method is more storage saving and more robust when
46
strong discontinuity exists. In general, the finite difference method is still an important method in
high order CFD simulations, and deserves further development.
In the past few years, the hardware performance of GPU has been greatly improved, and a series of
technologies were developed, which make the heterogeneous computer more effective and mature.
Current GPU computer is not simply a traditional CPU computer equipped with one or two GPUs, it
is a highly integrated hardware and software system. One computer has multiple GPUs, and GPUs
become the main power of computing. To fully exploit the performance of current GPU computer,
programs must adjust themselves to adapt to the complicated computer architectures.
In present work, based on architecture analysis of modern GPU servers, we propose a set of
hardware-aware technology to optimize the performance of data transfer between CPU and GPU, and
communication efficiency between GPUs. An in-house code is porting onto GPU with careful memory
planning and high-efficiency kernel design. Moreover, a modified alternative formulation of WENO
scheme which saves computation and is more efficient in GPU computing is proposed.
3.2.3.2 Numerical Methods
The discussion regarding the RANS formulation can be obtained from source paper [Ye et al.]. In
addition, the derivatives of inviscid fluxes using the high order Alternative Formulation of Weighted
Essentially Non-Oscillatory scheme (AFWENO) [13], are also covered, and will not be repeated here.
Readers are also advised that present study is based on an inhouse code "HiResX (High Resolution
flow Solver)", which aims to simulate compressible flows with complex geometries in high order
accuracy with finite difference methods.
3.2.3.3 Hardware Environment
A traditional CPU-based supercomputing cluster usually consists of hundreds of thousands of
computational nodes, and each node contains dozens of cores. These nodes are interconnected with
high speed network. Up to now, the number of cores in a single CPU is still limited. Due to this
limitation, the large task must be divided into a lot of partitions and distributed to many nodes. One
obvious drawback is that the complex communication between nodes may lead to high latency.
Furthermore, the price/performance ratio of traditional CPU-based cluster is low, due to the fact that
Figure 3.2.6 Framework of modern GPU cluster. The computational nodes are connected with high
speed network. GPUs deliver the majority of performance of modern GPU cluster
47
you have to pay for the frame of each computation node by increasing CPU cores, which does not
directly contribute to computational capabilities. The computational capability of a computing
device with specific space occupation can be represented as "computing density". In the past few
years, supercomputing has come to heterogeneous computing era, and higher computing density
supercomputers have been developed. Those computers are accelerated with some devices called
many-core processor accelerators, such as GPU and FPGA. In the realm of heterogeneous computing,
GPU is the most popular one. Figure 3.2.6 shows the framework of modern GPU server cluster. Each
node contains two CPUs and much more GPUs than before. The GPUs in the same server communicate
with each other through PCIe slots, and further communicate with GPUs in other servers by network.
More sophistic technology called Nvlink has developed by NVIDIA to enable direct data transfer
between GPUs in the same server at faster speed than PCIe’s.
The two GPU servers utilized in this work are equipped with 10 NVIDIA RTX 2080 Ti GPUs and 10
NVIDIA Titan V GPUs, respectively. The RTX 2080 Ti GPU utilizes the newest Turing architecture of
NVIDIA GPU, while Titan V utilizes the architecture Volta. The RTX 2080 Ti GPU is a consumer-
oriented product aiming at video and game, though it has up to 4352 CUDA cores, the double
precision (DP) performance is only 1/32 of
single precision (SP) performance.
However, the core GV100 of Titan V
contains 5120 CUDA cores, and ratio of
DP/SP is up to 1/2, which makes it still the
most powerful GPU core up to now.
We also utilize the Tesla P100 GPU, based
on much older architecture Pascal.
Although it contains less CUDA cores than
RTX 2080 Ti, it has much higher
theoretical DP performance. The detailed
specs of GPUs mentioned above can be
found in Figure 3.2.1. The CPU of the
server is Intel Xeon E5 2680 v3, with 12
cores in a single CPU, and the spec is given
in Table 3.2.2. The ideal computational
performance of dual-CPU in the server is 960 GFlops (FP64 performance). With GPU acceleration,
Table 3.2.1 Present GPU specs. Titan V has highest double precision operation performance. RTX 2080
Ti utilizes the newest architecture and provides highest single precision operation performance
Table 3.2.2 Present spec of CPU. The double precision
operation performance of E5-2680v3 is slightly higher
than RTX 2080 Ti’s.
48
10 RTX 2080 Ti GPUs deliver 134.5 TFlops SP float performance and 4.2 TFlops DP float performance,
while 10 Titan V GPUs deliver incredible 149.0 TFlops SP performance and 74.5 TFlops DP
performance. In this work, no Nvlink is applied.
3.2.3.4 Programming Implementation and Optimization of HiResX
3.2.3.4.1 Code Introduction
HiResX is originally written in Fortran 90, aiming at simulating various compressible flows with
complex domains in high order schemes. HiResX solver is equipped with various turbulence models
such as Reynolds-Averaged Navier-Stokes (RANS) models, LES model and detached eddy simulation
(DES) model to simulate engineering realistic flows with high Reynolds numbers. Message Passing
Interface (MPI) is utilized to distribute computation tasks to different CPU cores inter- or intra- node.
There are several strategies of performing computation on GPU devices. OpenACC offers a user-
driven lightweight solution that the user needs simply adding some directive-based clauses in the
code segment which needs to accelerate on GPU with significantly less programming effort than
required with a low-level model. CUDA Fortran based on PGI Fortran is another solution provided by
PGI company, which offers programming flexibility using Fortran language. However, CUDA Fortran
highly relies on the PGI compiler. CUDA C/C++ is a programming language based on standard C/C++
and developed by NVIDIA, and it provides better affinity and flexibility to operate NVIDIA’s GPUs at
low-level programming.
However, programming with CUDA C/C++ needs much higher skill. To port an existing complex code
onto GPU by completely rewriting it with another programming language is not a wise choice. In
Figure 3.2.7 Several typical PCIe root architectures of GPU server. For type (a), each NUMA node has
PCIe switch attached on them, and there are two GPUs mounted on each PCIe switch. For type (b), only
one PCIe switch is mounted on each NUMA node, and there are 4 GPUs attached on one PCIe switch. For
type (c), all GPUs are mounted on one NUMA node by two PCIe switches. For type (d), all GPUs are
mounted on one NUMA node.
49
order to fully exploit
the performance of
current GPU server
without rewriting
whole the solver,
mixed programming
of Fortran and C/C++
is applied, and GPU
computing is achieved
with CUDA C/C++.
GPU computing for
each process is
independent with
each other when
HiResX is running in
parallel, thus HiResX
supports three
parallel modes: CPU-
only mode, GPU-only
mode and
collaborating (or
hybrid) CPU/GPU
mode. In CPU-only
parallel mode, all
processes perform
computation on CPU
only, whether if the
computation node has
GPU devices or not.
This parallel mode is
suitable for most of
the traditional
supercomputers
which support CPU
computing only. In
GPU-only parallel
mode, all processes are running on GPU devices, hence for supercomputing cluster, every
computational node running the code must equipped with a GPU device. The hybrid CPU/GPU mode
is a parallel mode that can make fully use of all CPU and GPU resource in the cluster, as mentioned in
Ref.[15]. For this parallel mode, the computational capability of CPU and GPU in one node should be
roughly comparative.
The procedures of GPU computing of HiResX are illustrated in Figure 3.2.10. After the initialization
of program, the process which runs computation on GPU uploads the conservative variables Q,
primitive variables Qv, Jacobian J and metrics at cell node to GPU device. To avoid frequent data
transfer between CPU and GPU, all these variables in computation domain are uploaded to GPU and
stored in GPU’s global memory. The Jacobian J and metrics at cell edge which are computed by CPU
at startup would not be uploaded with consideration of three aspects. Firstly, if all these variables at
cell edges in three directions are stored in memory of GPU, additional memory space for 12 variables
is needed, which reduces the maximum grid number that a single GPU can deal with. Secondly, if only
memory space of one direction is allocated on GPU, though the memory space is reduced to a third of
Figure 3.2.8 Main structure of HiResX summarized by pseudo code
50
the former case, frequent data transfer during direction switching is time-consuming. Thirdly, for
GPU, computational capability is much stronger than memory bandwidth, hence direct interpolation
from Jacobian and metrics at cell node is more time-saving. The calculation of residual dQ on right
hand side, time advancement and primitive variables updating from conservative variables are
completely running on GPU. When physical boundary condition is applied, very little data is needed
to upload onto GPU, such as the ID number of boundary points, which takes very little time that can
be ignored.
For traditional CPU computer, the user may not concern on where and how their program is running.
When the program runs in parallel, the user needs only to distribute the computation to different
CPU processes with MPI, or to different threads by tools (e.g. OpenMP). Users rarely care about the
Figure 3.2.10 A global glance of execution procedures of HiResX running on GPU. Except for
initialization and data performed on CPU, all computations are performed on GPU while CPU is used to
schedule kernels of GPU only. "Data 0" represents the initial data computed in CPU and uploaded to GPU.
"Data 1" represents data exchanged between processes. "Data 3" represents flow field data to be written,
and it should be downloaded to CPU and written with a new thread.
Figure 3.2.9 Framework of communication of HiResX without GPU peer to peer communication
technology support. For communication between processes that are all running on CPU, if processes are
located in the same node, they exchange data within RAM with MPI-3 shared memory technique, see red
path. If processes are located in different nodes, the standard MPI communication is utilized, see yellow
path that connects triangle pair. For processes with GPU acceleration, data on GPU should be
downloaded back to CPU, and then processes communicate in the same way as processes without
acceleration.
51
hardware information when they develop the program. However in current GPU computing, data
exchanges at block connecting boundary are more complex, due to the complex environment in
heterogeneous computational architecture. Figure 3.2.9 shows the communication structures of
our solver. When the solver runs in CPU-only mode, data exchange among blocks in the same process
is straightforward by simply copying data without communication with other process, as is depicted
in the figure with the yellow pentagon pairs.
For blocks in different processes in the same node, data exchange among them is achieved by use of
shared memory communication technique in MPI-3 standard, as is depicted in the figure with the red
five-pointed star pairs. For blocks in different processes in different nodes, data exchange is achieved
by standard MPI communication, which is marked as the brown triangle pairs.
In GPU computing, the data must be uploaded to GPU from CPU first, and downloaded back to CPU
when GPU computation finished. The data has to be transferred between memories of CPU and GPU
through PCIe bus. In the early stage of development of GPU computing, each GPU is installed in
different PCIe slots and works independently, and communicates with CPU only. If the program is
running in parallel with multiple processes on GPUs, the data to be sent has to be downloaded from
GPU and then communication is fully performed in CPU through MPI, when communications among
processes are involved. After the data is exchanged, the received data has to be uploaded to GPU. It
is obvious in this procedure that processes in different GPUs communicate by use of CPU memory
and explicit data copying between CPU and GPU. When data on GPU is downloaded to CPU, the
communication restores to general communication in CPU, and all techniques can be applied, as is
depicted with four-pointed star pair and seven-pointed star pair.
However, the Unified Virtual Addressing (UVA) technology and CUDA-aware MPI technology make
this procedure simpler and more efficient. There is no need to perform explicit data copying because
the MPI interfaces can recognize the location of buffer data and then find the optimal path to transfer
data. For blocks in the same process with GPU accelerating, data can be exchanged on GPU without
leaving device.
When the program running in hybrid CPU/GPU mode, the communications among processes without
GPU accelerating (CPU-only mode) are the same as the communications among processes with GPU
accelerating (GPU-only mode). The only thing to deal with is the communication between processes
with or without GPU accelerating, as is marked with diamond pair in Figure 3.2.9. For the process
Figure 3.2.11 Domain decomposition strategies of HiResX. The domain is divided into several blocks
according to performance of CPU and GPU in order to balance workloads of each process. For the blocks
computed by CPU, the whole block is computed by a CPU process. For the blocks computed by GPU, each
grid cell is computed by one CUDA thread.
52
with accelerating, data to be sent should be downloaded to CPU memory explicitly if CUDA-aware
MPI is not applied, while the process without accelerating does nothing with GPU. When data is
downloaded to CPU, the communication can be performed similar to that in CPU-only mode.
However, with CUDA-aware MPI technology and UVA, the data in GPU and CPU can be exchanged
directly.
3.2.3.4.2 Domain Decomposition
The HiResX solver is developed to simulate flows with complex domains with multi-block structure
grid. Figure 3.2.11 illustrates the domain decomposition strategies of HiResX. The whole
computation domain is splitted into a lot of connecting blocks, and these blocks are distributed to
different processors. For hybrid CPU/GPU parallel computing, the sizes of grid blocks can be
different. The size of grid block for CPU is usually much smaller than that for GPU, since the latter has
higher computational capability. For HiResX, in order to balance the computational loads of
processes of CPUs and GPUs, the whole domain can be splitted arbitrarily without the limitation of
one-to-one block connection, and the block connecting faces and physical boundary faces can be also
arbitrarily
defined on blocks.
3.2.3.4.3 Hardware Technique
In current GPU server, there are more GPUs installed in a single machine. Moreover, a variety of PCIe
root architectures have been designed for different performance needs, as demonstrated in Figure
3.2.7. The whole hardware system needs to match the increasing performance of current GPUs, and
the solver also need to adapt to the hardware system for maximum computational performance.
Complex PCIe root architecture will affect efficiencies of memory access of CPU-GPU and even
process unit (PU) to RAM. Figure 3.2.12 (a) and Figure 3.2.12 (b) show the memory access
models of GPU to CPU and GPU to GPU. The server has two CPUs interconnected with QPI, and it is a
dual root system. In Figure 3.2.12 (a), a process is running in process core PU0 in CPU-0, and selects
Figure 3.2.12 CPU-GPU and GPU-GPU memory access models. In (a), red path is the optimal one, while
pink path is inferior because CPU to GPU memory access across NUMA nodes is worse than local access.
In (b), if GPUs support peer-to-peer (P2P) communication technology, green path indicates that two
GPUs communicate by Nvlink, which is the fast path. Purple path indicates that two GPUs communicate
by PCIe switch (PLX), which is most efficient communication way without Nvlink. Yellow path means that
two GPUs communicate by host (CPU). Red path is the worst one because P2P is not supported across
NUMA nodes, and memory access between two GPU in this way must be transferred by both CPUs.
53
GPU0 for acceleration. The data of this process is allocated in the memory (RAM) in the same root
with CPU-0 and GPU0. As a result, the process visits its data in RAM with optimal bandwidth, and
GPU0 also transfers data to CPU with optimal bandwidth. This is the ideal situation. However, in fact,
the PU in which the process is running is not settled, and is managed by the system by default. For
pink path in Figure 3.2.12 (a), the process is running in PU2 in CPU-1, but selects GPU2 as
accelerator. Though it visits data allocated in the closest RAM with optimal bandwidth, GPU2 is not
in the same root with CPU-1. Data transfer between CPU and GPU must traverse QPI because of
unsupported direct memory access, so it is very inefficient. The worst case is the blue path. The data
is allocated in RAM close to CPU-0 and GPU1, but the process is running in PU of CPU-1, resulting in
poor efficiency of both GPU memory access and RAM access. PCIe root architecture will also affect
GPU to GPU communication or memory access. Figure 3.2.12 (b) gives several possible models of
GPU to GPU memory access in current GPU server.
The red path is the worst case since it traverses QPI, and the topology of GPU2 and GPU3 is usually
marked as "SYS". For brown path, GPU4 and GPU5 are attached to the same IOH chip, which provides
better communication efficiency than "SYS". The topology is usually called "PHB". The purple path
connects two GPUs in the same PCIe switch, which provides better efficiency than "PHB", and its
topology is marked as "PIX". The green path is the best, GPUs interconnect each other directly with
Nvlink, which is the fast way of GPU to GPU communication so far, labeled as "NV".
To exploit the optimal efficiency of the hardware system, users should know the topology of the
hardware system. The Hardware locality (HWLOC) software package provides an approach to gather
hierarchical topology information about modern increasingly complex parallel computing platforms,
including NUMA (Non-Unified Memory Access) memory node, shared cache, cores and
multithreading, as well as I/O devices such as GPUs, InfiniBand HCAs. With the topology information,
users can optimize their own programs to obtain the best performance. It is should be noted that the
optimizations of CPU-GPU communication and GPU-GPU communication are interdependent, which
will be introduced next.
3.2.3.4.4 CPU-GPU Communication Optimization
To optimize the performance of memory access between CPU and GPU, the only way is to get them
"closer". The word "close" is an iconic description of two device that has larger data transfer
bandwidth. As aforementioned, the best strategy is to make the RAM, PU and GPU be utilized by a
process located in the same root. There are many ways to bind a process to specified PU, including
process binding technique of HWLOC, APIs of NVIDIA management Library (NVML) and process
binding technique in Linux system.
Many commercial and open source software packages provide memory binding. MVAPICH allows
data allocation when calls "MPI_Init()", and OpenMPI allocates it after calling "MPI_Init()". The
problem is that, the size of memory space to allocate before program calling "MPI_Init()" and reading
input files and grids is unknown, and we have to run MPI initialization to get the process ID, so that
process can determine which grid files to read. What is more, to optimize the GPU-GPU
communication, we should know the communication data load of each process, then we can decide
which GPU to be selected for acceleration.
After the GPU of a process is chosen, we can bind process to the CPU that the selected GPU attached
to. Finally we allocate memory and bind it to the CPU. The steps of the procedure are briefly outlined:
Run the program first until the processes get its GPU ID. When the solver run in parallel, each
process gets its process ID, and accordingly reads its input files. Process calculates the
communication data load according to block connection information. Based on the
communication data load and the topology of GPUs of local machine, the GPU-GPU
communication optimization will select the optimal GPU for the process.
Find the NUMA node (CPU) near given GPU ID, and bind current process or thread to it. Here
we utilize HWLOC software package to do it. Firstly, set the cpuset of given GPU ID with API
54
function hwloc_cudart_get_device_cpuset(). Secondly, bind current process or thread to this
cpuset with hwloc_set_cpubind(). Finally, get the NUMA node of this cpuset with
hwloc_get_obj_covering_cpuset().
Allocate memories in RAM for data of the process. If there is allocated memory of data, free it
or migrate it to the CPU selected above. To allocate memory for data in the RAM attached to
the selected CPU, the API function hwloc_alloc_membind() in HWLOC is utilized. This API
function will allocate memory in given size in RAM of local NUNA node, and return address
of the this memory space. For memory of data that has been allocated before GPU ID is got,
we can migrate it to the right NUMA node with API function hwloc_set_area_membind() when
the address and NUMA node ID are provided.
3.2.3.4.5 GPU-GPU Communication Optimization
GPU-GPU communication is relatively easier to perform, which is efficient if GPUDirect for Peer-to-
Peer (P2P) technology developed by Nvidia is applied. There are many Nvidia GPUs support
GPUDirect P2P, such as Tesla series and Quadro series GPUs. GPUDirect P2P technique includes P2P
memory access and P2P transfer and synchronization, and it is an optimization technique of GPU-
GPU communication in the same system. With it, buffers between memories of two GPUs in the same
computer can be copied directly through PCIe lane or Nvlink. The efficiencies of GPU-GPU
communication by use of GPUDirect P2P technique vary with the path the peers is connected with.
Nvlink offers maximum bandwidth, thus it is the most efficient path. Nvidia has developed new
Nvlink technology which enables interconnection of any pair of GPUs in a single system that has
multiple GPUs.
This strategy works well. However, this is a highly customized product which needs specified
motherboards and only supports Tesla P100 SXM and Tesla V100 SXM2. There are many GPU servers
do not support Nvlink and GPUs are still install in PCIe stlots. Therefore, the GPU-GPU
communication optimization is directed to those GPUs communicating through PCIe. To optimize
the GPU-GPU communication through PCIe, we need the topology of local system and the sizes of
buffers to be exchanged among processes located in this server. Then we can distribute process
groups which include dense communication load to GPU groups whose communication is more
efficient, thus improve global communication efficiency. The details of optimization are introduced
below:
The solver runs in parallel, and each process reads its input file and gets its communication
buffer sizes to other processes. Each process may communicate with zero or one process, or
multiple processes. Those processes may be distributed in one or multiple servers.
Gether processes located in the same server and let them know each other. MPI software
package offers APIs to achieve it. Firstly, new communicator in local server can be created
with API fucntion MPI_Comm_split_type() with split type "MPI_COMM_TYPE_SHARED", for
better distinguish below we call it "MPI_SHARED_LOCAL". The size of new communicator and
local rank (or ID) in new communicator of current global process can be obtained with APIs
MPI_Comm_size and MPI_Comm_rank respectively. Secondly, get the groups of global
communicator "MPI_COMM_WORLD" and local communicator "MPI_SHARED_LOCAL".
Thirdly, gather the processes that current process communicates with, and translate it into
local communicator. A global process ID that is not located in local communicator will be
marked as "MPI_UNDEFINED", so that we filter out processes to communicate with in local
server.
Each process in local communicator shares its information with all other local processes. The
information here is the sizes of communication buffers to target processes in local
communicator. Then each local process knows the local communication network and buffer
size of each connecting line. This procedure can be easy to implement with MPI API function
MPI_Allgather().
55
Each local process inquires the topology of GPU pairs in local server. We can obtain the
connection type of two GPUs by use of NVML API function
nvmlDeviceGetTopologyCommonAncestor( device1, device2, &topo_type) according to the
returned variable topo_type. The connection type of two GPUs in the same system should be
"NV", "PIX", "PXB", "PHB", or "SYS", which range in the order of efficiency decrease. We will
not consider "NV" as mentioned above. Then we organize GPUs in groups. Firstly, each CPU
(NUMA node) gathers GPUs that are attached to it. Any two GPUs in this group communicate
with each other with efficiency not worse than "PHB". Secondly, in each CPU, gather the GPUs
that in the branch in which any two GPUs communicate with each other with efficiency not
worse than "PXB". Thirdly, in each "PXB", gather the GPUs that are in the same PCIe switch.
GPUs in a single PCIe switch have connection type "PIX" with each other, and there may be
multiple "PIX" groups attached to a GPU.
Filter out the busy GPUs in groups above, and get groups that contain idle GPUs only. The
utilization of GPU can be inquired with NVML API function nvmlDeviceGetUtilizationRates().
Partition local processes according to number of GPUs of each CPU. If all local processes is
less than GPUs of any CPU, choose the CPU that can hold all local process. If no single CPU can
hold all processes, divide local processes into partitions according number of GPUs in each
CPU, and make sure that these partitions have minimal total communication data with each
other.
In each CPU that gets a partition, divide the processes into partitions for "PXB" groups in the
fashion above. The "PIX" groups should be dealt with in similar fashion too. With the
procedure above, each process in local server is binded to the GPU that makes it communicate
efficiently with other GPUs attached to its target communicating processes. It is worth to
mention that, in modern GPU cluster, servers usually interconnect with infiniband network.
With GPUDirect Remote Direct Memory Access (RDMA) technology, buffers can be directly
sent from the GPU memory to a network adapter without staging through host memory. In
this case, the network adapter should be included during optimization above, and we can
simply regard it as a communication target like process.
3.2.3.4.6 Memory Utilization
How to utilize memory of GPU is one of key points in programming GPU solver. Strategies of memory
utilization of GPU vary from code to code in different fields according to algorithms applied and
computing scales. In the early stage of GPU computing, GPUs were mainly designed for image and
video processing which usually involves simple algorithms like matrix operations. Perform simple
algorithms on a bunch of single dataset would not occupy much memory of GPU, hence GPUs in
several years ago equipped with very low memory capacity, which is usually not higher than 4GB.
For scientific computing involves large scale computation, memory occupation is a challenge for CPU,
thus it is hard to hold all data in GPU memory. Due to limitation of memory capacity, GPU usually
works as local accelerator, where most of the data of program is fixed in memory of CPU, and upload
a part of data onto GPU for computation acceleration, and then download result back to CPU. This
strategy enables the solver to deal with much more data with limited memory of GPU. Data transfer
between CPU and GPU through PCIe lane is time-consuming, which must be compensated in GPU
computing, in order to get acceleration. However, with performance of GPU improving greatly,
massive and reduplicative data transfer gradually becomes bottleneck of GPU computing.
Bandwidth of PCIe has not increased significantly over the past few years, but the memory capacities
of GPUs increase a lot. In particular, some professional GPUs are designed for scientific computing,
which equipped with memory up to 32GB. Moreover, current GPU server contains more GPUs, which
make a single server equipped with powerful computational capacity. Hence, the strategy of memory
utilization should be adjusted to adapt to current GPU computing. We will introduce the detail of
memory utilization of our solver in the following.
56
There are two basic and vital principles of memory utilization in our solver. On the one hand, allocate
fixed permanent memory space for key dataset. On the other hand, avoid reduplicative memory
allocation and deallocation by allocating a piece of permanent memory space that is reasonably large
enough for potential public usage.
In the whole procedure of solving (Eq.(1) - see [Ye et al.]), we select several fundamental and
dominant variables from all variables. A variable is fundamental when it is used throughout the
whole computation and many other variables can be obtained with them. The metrics (ξ , η , ζ)x,y,z and
Jacobian J are utilized all over the computation and they are constants if the grid is stationary, so they
are fundamental variables. The conservative flow variable U󰋥 in (Eq.(2) - see [Ye et al.]) is fundamental
as well, because it plays a key role in time advancement, in calculating the interpolated variables in
WENO interpolation, moreover, it is also the source to compute primitive variable Qv = (u, v, w, p, T).
For convenience in utilization, we will not include the Jacobian J in it, and only store Q = (ρu, ρv, ρw,
e). Although Qv can be obtained from Q, it is always used directly during the whole computation. So
we have to sacrifice memory to avoid unnecessary computation time. For viscous flow computations,
molecular viscosity μ is unavoidable, which is used in computing viscous flux and local time step. In
time advancement with Runge-Kutta scheme, as shown in (Eq.(17) - see [Ye et al.]), Qn is a variable
that should be stored. For time advancement in local time step, the time step Δt at each grid point
should also be stored.
R(Q) represents the term of right hand side and contains contribution of derivatives of inviscid flux
and viscous flux, and is used in time advancement and residual computation. Thus R(Q) should be
stored during the whole computation. In the following, we denote the size of original grid of current
process with N, and the size of grid containing ghost cell with M. According to above analysis, we
need to allocate memory for fundamental and dominant variables with M * elements.
During computations of derivatives of inviscid and viscous fluxes, there are many temporary
variables at each grid point. According to numerical methods applied in inviscid terms, we need
memory space to store the physical flux fi and numerical flux fi1/2 , as shown in (see [Ye et al.] -
Eq.(16). The total size of temporary variables required is 10 * M. In the viscous term, the derivatives
of (u, v, w, T) with respect to (x, y, z) are widely used. Moreover, viscous flux has to be stored
according to our kernel mentioned in the next section. In total, we need to allocate memory space for
temporary variables in the viscous term with size of 17 * M. However, the inviscid term and viscous
term are computed successively, and there is no need to allocate memory space for temporary
variables in inviscid term and viscous term, respectively. Thus we allocate memory space with size
of 17 * M elements, which is enough for current public usage and potential usage elsewhere.
In short, we need permanent memory space with a size of 49 * M elements. The variables mentioned
above are all in float type. In our solver, arrays in integer type are widely applied as well, such as in
blocks communication and physical boundary conditions. We allocate memory for arrays for public
usage in the way mentioned above too. Our solver can avoid unnecessary data transfers between CPU
and GPU. Meanwhile, we save memory of GPU up to the hilt, and increases the maximum grid size the
solver can deal with. The maximum grid capacities of our solver in different single GPU are shown in
Table 3.2.3. The data is stored in double precision float type. The result is satisfying for a single
Table 3.2.3 Maximum grid capacity of HiResX solver in different GPUs
57
GPU. The computation speeds in full load are also well acceptable in general applications, because
much more grid will lead to unacceptably computation speed. Therefore, the memory strategy
mentioned above is well matched with our programming algorithm strategy.
3.2.3.4.7 CUDA Kernels
The efficiency of GPU computing relies heavily on the algorithm and programming skills. To fully
exploit the parallel capability of GPU, one needs deep understanding of hardware of GPU and its
execution mechanism. The GPU architecture consists of a scalable array of Streaming
Multiprocessors (SM). Figure 3.2.13 shows the structure of GPU in software and hardware layers.
Each SM in a GPU is designed to allow concurrent execution of hundreds of threads, and generally
there are multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently
on a single GPU. In CUDA, all the threads are organized in group called thread block. A thread block
can only be scheduled on one SM, and remains on that SM until execution finishes. Hence, how to
utilize threads is one of the primary things in kernel design. In CUDA memory model, there are many
types of memory that are programmable to user, as shown in Figure 3.2.13. Global memory is the
largest, highest-latency, and most commonly used memory on a GPU. It is accessible to any SM
throughout the lifetime of the application.
Registers are the fastest memory space on a GPU, and are partitioned among active warps in an SM,
so register variables are private to each thread. Registers store automatic variables declared in a
kernel without any other type qualifiers. Arrays declared in a kernel with constant that can be
determined at compile time, are also stored in registers. Local memory in nature is a part of memory
resides in global memory, and is used to store variables in a kernel that are eligible for registers but
cannot fit into the register space allocated for that kernel, including arrays referenced with indices
that cannot be determined when code is compiled.
Shared memory is a type of on-chip memory space that has much higher bandwidth and much lower
latency than local or global memory. Registers and shared memory are scarce resource, and these
limited resources impose a strict restriction on the number of active warps in an SM, thus affect the
parallel performance in an SM. Therefore, how to utilize these types of memory is another thing that
is vital to performance in kernel design.
Figure 3.2.13 Organization of CUDA threads for NVIDIA’s GPU and its mapping to hardware structure
58
In GPU computing, CPU is called host, and GPU is called device. Kernel function is a piece of code that
runs on device. Design of kernel functions is the core of GPU computing. In current computation, most
of the computation in solving (see [Ye et al.] - Eq.(1)) is spent on evaluation of inviscid term and
viscous term. Evaluating inviscid term and viscous term involves much more complex operations
than other sections of computation. Therefore we give will more details of them in the following.
3.2.3.4.8 Inviscid Fluxes
We consider a three dimensional grid block with sizes of (NI,NJ,NK) in I, J, K directions. In CPU source
code, derivatives of convective fluxes with respect to _, _, _ directions are evaluated in sequence, and
in each direction we perform procedure from ([Ye et al.]) see Eq.(7) to Eq.(16)). Take the derivative
in I direction as example, in the JK face of grid block, NI * NJ one dimensional problems in I direction
are solved. For each pipeline in I direction, conservative variables at half-point or edge location (i +
1/2) are evaluated with fifth order WENO interpolation as is shown in ([Ye et al.] see Eq.(10) to
Eq.(14)). To minimize oscillation, interpolation is performed in characteristic space. There are two
Figure 3.2.14 Code Structure For Derivative of Inviscid Flux Computed With Original AFWENO
59
interpolated values, u+i+1/2 and ui+1/2, at each half-point location, which are approximated with values
of left-biased point stencil (i−2, i−1, i , i+1, i+2) and right-biased point stencil (i−1, i , i + 1, i + 2, i + 3)
respectively. Then at each half point, Steger-Warming flux is used to construct numerical flux f i+1/2.
The physical flux at the grid point can be simply obtained with (Eq.(3) - see [Ye et al.]). Finally, the
derivative at each point can be approximated with ([Ye et al.] see Eq.(16)). Derivatives in the other
directions can be done in the same way, and the results are accumulated during the loop of direction
to construct the term R(Q) in ([Ye et al.] see Eq.(17)) for Runge-Kutta integration.
All above computation is done by a single process when no multi-threads technique like OpenMP is
applied. The code running on GPU device should do all the work as it does on CPU. The structure of
GPU and its execution mechanism have been introduced earlier, and it is noted that we can call tens
of thousands of threads to complete certain computations.
To port a code onto GPU, the first work to do is how many threads to be called and how to arrange
their works. In our early version of our GPU code, a thread is assigned to complete a 1D problem, so
the number of threads to be called is the size of the sweeping face of grid block, NJ * NK for JK face.
However, the acceleration performance is poor. The reason for this problem is that, the number of
active threads is severely restricted by the high occupation of register in the kernel function. In
addition, the number of threads to be called is not large enough and is limited by sweeping faces of
grid block which usually vary with sweeping directions.
To solve this problem, two improvement measures are introduced. The first improvement measure
is that atomic level operation is applied. Other than assigning each thread to complete a 1D problem
of the whole grid line, we assign each thread to complete the work at only a grid point, it is so called
"atomic level operation". Thus the number of thread to be called increases up to the number of total
grid points, and it does not vary with sweeping directions of derivatives. The CUDA kernel is
therefore configured as 
󰇧󰇛󰇜
 󰇨
 󰇛󰇜
Eq. 3.2.1
Generally, optimizing the workloads to fit within the boundaries of a warp, which is a group of 32
threads, will lead to more efficient utilization of GPU compute resources. Thus, the value of
BLOCK_SIZE_INV ISC should be divided evenly by 32. Improper value that is too small or large may
cause performance reduction. In practice applications, 256 is usually proved to get satisfying
performance. But the best value should be determined by testing.
The second improvement measure is utilization of kernel decomposition technique. By decomposing
a big kernel function into several small ones, the number of registers needed in kernel function
reduces and the number of active threads increases. Figure 3.2.14 shows the pseudo code of
computation of derivatives of convective flux. The procedure of evaluating derivative is spitted into
three parts, and the computation of each part is done with a kernel function. In kernel
cuda_afweno_split_flux(...), numerical flux f i+1/2 is computed at each half point by a thread, and the
result is written into public memory space. After the kernel finishes, numerical fluxes at all half-point
locations in that direction are already known.
Then, the second kernel function cuda_phys_flux(...) computes physical flux fi at each point by a
thread, and the result is written into public array in global memory too, and at the end of it, physical
fluxes are all known. At last, in the third kernel function cuda_cflux_df(...), numerical flux and physical
flux are obtained from global memory to compute derivatives at each point by a thread, and the result
is stored in array R(Q) in global memory. Theoretically, the second kernel can be merged with the
first kernel. But the merged kernel function occupies too much registers and the number of active
60
warps is strictly restricted due to limited registers. By separating the independent part out and
computing it with a small kernel, more warps are active, thus improve parallel performance.
However, we do not divide the kernel cuda_afweno_split_flux(...) into two parts to calculate
interpolation of conservative variables and numerical fluxes as is done in [6], because delivering data
from one kernel to another kernel involves access to global memory, which brings high latency.
Moreover, this two parts are complex as well, and the kernels still occupy too much registers. So
number of active threads does not increase apparently.
Our test shows that the time saved by increasing active threads does not compensate the latency of
access of global memory. It is worth mentioning that, metrics and Jacobian at half point which are
needed to compute numerical flux are neither stored in global memory, nor transferred from CPU. If
we save them in global memory permanently, more memory is required for solver to deal with a
certain grid size. Although the public memory space is large enough to hold them, we do not transfer
them from CPU either, because our test suggests that computing is much faster than transferring.
Thus they are interpolated from point values directly when they are needed.
We have known that in GPU, as processor-to-memory latency increases, the capacity of that memory
increases. Global memory is the largest, highest-latency memory space on a GPU. So frequent access
to global memory will increase the execution time of a kernel. Therefore, reducing access to global
memory provides possibility of optimization during programming and numerical algorithm design.
In current work, we also propose an improved WENO interpolation, which reduces access to global
memory and also reduces
computation price.
Figure 3.2.15 shows the
schematic diagrams of
original WENO interpolation
and improved WENO
interpolation. In original
method, computing numerical
flux at i + 1/2 needs points
stencil (i−2, i−1, i, i+1, i+2,
i+3), and performs two
complete WENO
interpolations from (Eq.(10) -
to Eq.(14) see [Ye et al.]).
The two WENO interpolations
share the same characteristic
space located at i + 1/2. In
CUDA kernel function
cuda_afweno_split_flux(...)
mentioned above, for Steger-
Warming splitting flux
method, f+i+1/2 and fi+1/2 are
obtained simultaneously in
the same thread, thus
numerical flux is obtained
immediately with f i+1/2 = f+i+1/2
+ fi+1/2. In our improved
method, we build the
characteristic space at i, and we utilize the points stencil (i−2, i−1, i, i+1, i+2) to perform WENO
interpolation to get u+i−1/2 and ui+1/2. We do not need to perform two complete independent WENO
interpolations, because the characteristic variables and the smoothness indicators are shared by two
Figure 3.2.15 Comparison of WENO interpolations. In original
WENO interpolation, the characteristic space is built at i + 1/2, and u±
i+1/2 are computed simultaneous in a loop. In modified WENO
interpolation, the characteristic space is built at i, and ui+1/2 and
u+i−1/2 are computed simultaneous in a loop. In modified WENO
interpolation, point stencils are the same, so the characteristic
variables and smoothness indicators are shared, thus it reduces
calculation and global memory access.
61
WENO interpolations. Since we cannot get f+i+1/2 and fi+1/2 in the same thread, thus we have to
compute numerical flux f i+1/2 with a new kernel function cuda_sum_split_flux(...). The procedure
is shown in Figure 3.2.16.
3.2.3.4.9 Viscous Fluxes
Evaluation of viscous terms is slightly easier than convective terms. According to (see [Ye et al.] - Eq.
(4) and (5)), we have to compute (u, v, w, T)x,y,z before computing viscous fluxes. Here, we
approximate the derivative of viscous flux with fluxes defined at point locations, so we need to
compute (u, v, w, T)x,y,z once and share them during derivatives in all directions. Computing (u, v, w,
T)x,y,z is easy by chain rule. In general curvilinear coordinate system, derivatives with respect to (x, y,
z) can be obtained with chain rule as follow










Figure 3.2.16 Code structure of derivative of inviscid flux computed with modified AFWENO
62





Eq. 3.2.2
(u, v, w, T)ξ,η,ζ can be approximated directly with sixth order central scheme. In CPU code, the
computation strategy of viscous term is similar with inviscid term. After (u, v, w, T)x,y,z are computed,
viscous flux and its can be obtained in each direction.
We port the code to GPU similar to convective term. Figure 3.2.17 shows the procedure of
evaluation of viscous term. There are two parts and each part involves loop in directions. In the first
part, contribution of (u, v, w, T)x,y,z in each direction are computed, accumulated, and stored in public
array in global memory. For example, in ξ direction, ξx * ∂(u ,v, w, T)/∂ξ , ξy * ∂(u ,v, w, T)/∂ξ and ξz *
∂(u ,v, w, T)/∂ξ are computed. This part is complete by only one kernel function
cuda_vars_derv_xyz(...). We have two strategies of thread schedule for this kernel. In the first
strategy, the kernel is designed to compute contribution in each direction for four variables (u, v,w,
T) at one point by one thread. So the CUDA kernel cuda_vars_derv_xyz(...) is therefore configured
as
int blocksize = BLOCK_SIZE_V ISC;
int nblock = (( NI * NJ * NK − 1) / blocksize ) + 1;
cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);
In the second strategy, the kernel is designed to compute contribution in each direction for only one
variable at one point by one thread. So the number of threads to be called is four times of that of the
first strategy, and the CUDA kernel cuda_vars_derv_xyz(...) is therefore configured as
int blocksize = BLOCK_SIZE_V ISC;
int nblock = (( NI * NJ * NK * 4 − 1) / blocksize ) + 1;
cuda_vars_derv_xyz <<< nblock, blocksize >>> (...);
Each thread must determine the indexes of grid point and the variable according to its thread ID. The
indexes of grid point is needed because of one-side scheme for derivative near boundaries of grid
block. In convective term, we do not need to care about derivatives near boundaries of grid block. In
the second part, two kernels are applied which are similar to convective terms. In the first kernel,
viscous flux at each grid point is computed by one thread. After the first kernel completes, the second
Figure 3.2.17 Code structure of the derivative of viscous flux summarized by a pseudo code
63
kernel cuda_vflux_df(...) is ready to compute the derivative at each grid point by one kernel, and the
result is accumulated into R(Q) which has already included the contribution of convective terms.
3.2.3.4.10 Kernels in Other Section of The Solver
We have introduced the kernels of the convective and viscous terms which account for the majority
of total computation, and they are completely performed on GPU. In order to achieve the whole
computation performed on GPU, other parts of the solver have to be ported onto GPU as well.
Fortunately, these parts involve only simple algorithms and operations, which are independent on
grid. Therefore, we apply the "atomic level operation" when design the kernels. For computation of
residual, in which summation of all inner grid points is necessary, although there is atomic operation
in CUDA, we utilize "reduced-dimensional operation" to get better efficiency. "Reduced-dimensional
operation" here we mean that, for 3D grid block, CUDA threads are mapped into grid points of the
maximum face of grid block, and each thread performs specified operation like accumulation over
points of that grid line. When the "3D 2D" procedure is complete, the "2D 1D" and "1D
SingleValue" are performed in similar fashion. The "reduced-dimensional operation" is also applied to
find the minimum time step for unsteady simulation controlled with the CFL number.
3.2.3.5 Performance Result
3.2.3.5.1 Speedup Varies with Grid Size
In this section we present the performance of HiResX running on a single GPU with different gird
sizes. We compare performances on different GPUs as well. Three different GPUs are employed for
test: RTX 2080 Ti, Tesla P100 and Titan V. The specs and the characteristics of them are introduced
Figure 3.2.18 Global performance varying with grid sizes. The speedup is defined as the ratio of
elapsed time of running on CPU with one core to lapsed time of running with a GPU
64
before. The maximum grid capability of the solver on these GPUs are also listed in Table 3.2.3. Since
the memory capability of RTX 2080 Ti GPU is the smallest one, the maximum grid size tested here is
bounded to 25 M. The CPU in the test is Intel Xeon E5-2680v3. For CPU computing, solver runs on
single CPU core and all computation is performed on CPU. For GPU computing, because GPU code
cannot run independently of CPU, the present solver uses one CPU core and one GPU device, but
nearly all the computation is performed on GPU device. The speedup here is defined as the ratio of
time of CPU computing to time of GPU computing for the same case. Global speedups of the optimal
GPU version of the code compared to the serial CPU version for grid sizes arranging from one million
to 25 million are shown in Figure 3.2.18. We get significant speedups for different GPUs for all grid
sizes. Generally, the speedups for all GPUs increase with the grid size when grid size is smaller than
15 million, but after that RTX 2080 Ti and Tesla P100 gradually reach their limits. Thus the maximum
speedups for RTX 2080 Ti and Tesla P100 are about 640 and 1500 respectively. Titan V has not
showed a drop when the grid size increases.
Due to the limitation of memory capacity and the grid capacity, the maximum speedup of Titan V is
predicted to be about 1950. The trend of actual measured performances are matched with their ideal
DP performances for these three GPUs, though the amplitudes deviate much from ideal DP ratios.
According to the specs of these GPUs in the bandwidths of RTX2080 Ti and Titan V is close, but actual
measured performance of Titan V is nearly three times as much as that of RTX 2080 Ti, which benefits
from more CUDA cores and the higher ratio of number of double precision process units to single
precision process units (1:2) in Titan V. Tesla P100 has less CUDA cores than RTX 2080 Ti, but has
higher ratio of number of double precision process units to single precision process units, 1:2 for
Tesla P100 and 1:32 for RTX 2080 Ti, resulting in more double precision process units. Hence RTX
2080 Ti is mainly bounded to computation rather than memory bandwidth.
According to performance analysis tools of NVIDIA, the large deviation from ideal performances for
these three GPUs comes from the relatively low occupancy in convective fluxes and viscous fluxes
which account for the majority of global computation. Low occupancy is caused by the limitation of
registers. If there are not enough active threads, double precision capability would be excessive
which causes large deviation from ideal performance. Although we have decomposed the big kernel
into smaller ones and cut down number of registers in kernel to increase occupancy and number of
active threads, high latency of introduction of other type of memory (e.g. global memory) cancels the
accelerating effect of higher occupancy. So there is balance of occupancy and memory efficiency.
3.2.3.5.2 Performance of Kernels
Global performance relies on performances of kernels in each part of the solver. Figure 3.2.19
shows performances of several main parts in which all computations are located. As shown in the
figure, the part with massive computation gets apparent accelerating, such as evaluation of inviscid
term, viscous term and time step, and procedure of time advance and computation of primitive
variable. In these parts, there are a lot of complicated calculation at each grid point. Though
evaluation of residual is performed at each point, the kernel for residual computing is not designed
to run with "atomic operation", and there is no complicated calculation in the kernel. So the
performance of residual evaluation is relatively low.
For Runge-Kutta integration in time advancement, there is only a little calculation in this part, but the
calculation is performed at each grid point independently, so the "atomic operation" strategy brings
apparent acceleration. The speedup of Titan V is double of RTX 2080 Ti because Titan V has more
CUDA cores and higher bandwidth. For convective terms, in which a lot of complicated calculation is
involved, the speedup of Titan V is relatively high. But the speedup of RTX 2080 Ti is only one fifth of
Titan V, because both number of double precision process units of RTX 2080 Ti and number of CUDA
core are less than that of Titan V. For viscous terms, in which the computation is second only to the
convective term, the accelerating effect is the best. It is worth to mention that, the gap of speedups of
65
both GPUs is much smaller than that of the convective terms. It is the kernel decomposition used to
increase the occupancy and active thread that brings the effect. In each kernel in the viscous term,
the amount of calculation is medium, so for RTX 2080 Ti the number of double precision process
units is not that scarce, but for Titan V it may be excessive. The scenario is similar for time step.
3.2.3.5.3 Performance of Running with Multiple GPUs
In this section, the performance of our solver on multiple GPUs will be investigated. This test aims to
show the scalability of our solver when it is running on multiple GPUs. For traditional CPU code, when
it runs in parallel, the parallel efficiency is affected by communication among processes and
computation loads of processes. In early GPU computing code, besides communication on CPU and
computation load, data transfer between CPU and GPU affects parallel efficiency as well. For modern
GPU computing, communication between GPUs is a new factor.
The grid size of all cases in this test is fixed to be 25 M
16
due to limitation of the memory capacity of
RTX2080 Ti. A single block with 25 million grid points is divided nearly equally into several sub-
blocks and each GPU computes one sub-block. Each GPU is driven by a CPU core, so for multiple GPUs
cases, solver runs in parallel with MPI actually.
16
M = million
Figure 3.2.19 Performance of main parts of the solver. Generally, the parts that contain large amount of
computation get apparent acceleration. In the sections of time advance and viscous flux, due to the high
occupancies of kernels, both GPUs get higher speedups. For evaluation of time step and inviscid flux, Titan
V gets higher speedups than RTX 2080 Ti, because Titan V has more double precision operation
processing unit, which means higher double precision operation performance, see Error! Reference
source not found..
66
As we have mentioned that for GPU computing nearly all computation are performed on GPU, so the
contribution of CPU to global computation is negligible. The elapsed times of the solver running with
number of GPUs varies from 1 to 10 are shown in Figure 3.2.20. The scalability gradually deviates
from linear value when the number of GPUs increases, but the scalability efficiency is still larger than
75% when 10 GPUs are utilized. We observe that performance of scalability of Titan V drops faster
than RTX 2080 Ti. However, it is not the parallel efficiency of CPU that causes performance drop, but
the fact that when grid size decreases, the speedup of GPU drops as well. As illustrated in Figure
3.2.18, the variation of speedup of Titan V is greater than that of RTX 2080 Ti. Performance varies
with the computation load, which is a remarkable feature of GPU different from CPU. Therefore,
compared with CPU, the scalability performance of GPU is slightly lower.
The number in block presents the GPU ID in which computation is performed. Blocks with the same
color indicate that they are in the same PCIe switch. In cases I and II, computations are performed on
GPUs that are in two different PCIe switches respectively. In cases III to IV, computations are
performed on GPUs belong to both two PCIe switches. For case III, the maximum data transfers are
performed across PCIe switch. In case IV, the maximum data transfers are performed within the same
PCIe switch. In case V, all data transfers are performed across PCIe switch.
Figure 3.2.20 Strong scaling result of different GPU. Both GPUs’ scalabilities deviate from linear result,
but the efficiencies are all larger than 75% when 10 GPUs are utilized. The drops of scalability come from
the fact that when grid size decreases, the speedup drops too, see Figure 3.2.18. Scalability of Titan V
drops faster than RTX 2080 Ti’s, because the speedup of Titan V drops faster than RTX 2080 Ti’s when
the grid size decreases.
67
To investigate how the topology of GPUs in the server influences communication efficiency of GPUs,
a computation model with 4 blocks are assigned to be computed with 4 GPUs in the server. The
computation model is shown in Figure 3.2.21 (a), the size of exchange data between blocks in
horizontal direction is 1/8 of the size of exchange data between blocks in vertical direction. The
topology of GPUs in our test server is similar to type (c) in Figure 3.2.7.
There are 10 Titan V GPUs in our server, and each PCIe switch has 5 GPUs mounted on it. Hence there
are two "PIX" GPU groups, GPU 0 4 and 5 9. And GPUs between the "PIX" groups are in relation
of "PHB". As we have introduced before, GPUs in "PIX" relation communicate faster than GPUs in
"PHB" relation. So put processes that have large data transfer between each other in "PIX" group of
GPUs will get higher communication efficiency.
For better comparison, five cases are considered, as shown in Figure 3.2.21 (b). For all cases, the
blocks with the same color means that they are in the same PCIe switch. Case I and case II are used
to test the communication performance of GPUs in each PCIe switch. In case III, processes with largest
exchange data are not distributed to GPUs in the same PCIe. In case IV, processes with largest
exchange data are distributed to GPUs in the same PCIe, and it is expected to get higher efficiency
than case III. Case V is designed to be the worst situation. We only run the test on server with Titan
V GPU, because RTX 2080 Ti GPU does not support P2P communication technique without Nvlink,
while Titan V supports P2P by PCIe.
The result is shown in Figure 3.2.22. As expected, cases running in the same PCIe switch are the
best. Case IV also gets good performance as case I and II due to reasonable distribution of
communication load. Case V gets the worse performance because each GPU communicates with GPUs
that are all located in another PCIe switch, leading to the worst distribution of communication load.
The results indicate that the GPU-GPU communication optimization technique is crucial for programs
running on current GPU server in order to fully exploit the performance.
Figure 3.2.21 Cases configuration. In figure (a), there are 100 points in the spanwise direction. Data
exchange in vertical direction (red) is larger than data exchange in horizontal direction (blue). In figure (b).
five cases are set to test the communication performance of different strategies of computations assigned
to different GPUs. The number in block presents the GPU ID in which computation is performed. Blocks
with the same color indicate that they are in the same PCIe switch. In cases I and II, computations are
performed on GPUs that are in two different PCIe switches respectively. In cases III to IV, computations
are performed on GPUs belong to both two PCIe switches. For case III, the maximum data transfers are
performed across PCIe switch. In case IV, the maximum data transfers are performed within the same PCIe
switch. In case V, all data transfers are performed across PCIe switch.
68
3.2.3.5.4 Case Study
To further investigate
performance of our solver in
the practical application, a
planar supersonic jet is
simulated with implicit large
eddy simulation (ILES). The
configuration is similar to the
Ref.[16], and we simulated an
under-expanded planar jet
with the ratio of pressure
pe/p = 2.09. The
computational domain is
discretized by a Cartesian grid
with total 67.7 M
17
points. The
physical domain, which
excludes the sponge zone, has
dimensions 64h × 30h × 5h,
with a nozzle extending over
0.6h inside the domain. The jet
height is h = 3.5mm
corresponding to a Reynolds
number based on the jet height
and acoustic speed Re = ach = 8.15×104. The velocity profile inside the nozzle is a laminar Blasius
solution with boundary layer thickness of δ = 0.05h, and there are 12 points within the boundary
layer, 70 points inside the jet height. The constant time step Δt = 2.5 × 10−3 is implemented.
The non-dimensional time for the
whole computation is 360, and
only the last 210 is outputted for
analysis. Three Titan V GPUs are
utilized for computation, and the
execution time for the simulation
is only about 41 hours.
Figure 3.2.23 shows the shock-
cell spacing Ls/h as a function of
the fully expanded jet Mach
number Mj , and various
experimental results[18, 19], LES
data of Berland et al.[16] and
theoretical data[17] are plotted for
comparison for rectangular jets. It
is clear that present result is closer
to experimental results than that
of Berland et al.. In Figure 3.2.24,
the Strouhal number of the
fundamental discrete frequency is
given and plotted against the fully
expanded jet Mach number for
17
M = million
Figure 3.2.22 Performance of HiResX running on different GPU
topology. Cases in which all GPUs are in the same PCIe switch get best
communication efficiency, see case I and II. For case in which GPUs are
located in different PCIe switches, if GPU devices are optimally
assigned according to communication load, communication efficiency
can be also improved, see case IV. Without GPU to GPU optimization,
the communication efficiency is lower, see case III and V.
Figure 3.2.23 Shock-cell spacing Lh/h as a function of the fully
expanded jet Mach number Mj . Our result is more close to
theoretical[17] and experimental results [18, 19], compared to the
LES results of Berland et al.[16].
69
several experiments on
rectangular jets, where the
experimental data of Panda[18]
and LES data of Berland et al.[16]
are shown for comparison. Clearly,
our result is in good agreement
with experimental result. Figure
3.2.25 illustrates the flow
structure of the jet and its
generated acoustic field, which are
presented with spanwise vorticity
!z and dilatation in the plane z/h =
2.5. The above results prove the
reliability of our solver, and the
short execution time
demonstrates the high efficiency of
our solver running with GPUs.
3.2.3.6 Conclusions
Heterogeneous computing is
changing our way of scientific
computing. The boom of GPU
computing in the past several
years shows the power and potential of GPU on computing, and it attracts more and more researchers
to exploit its application in their fields, including CFD simulation. In the early years of GPU computing,
many attempts have been made to port CFD codes onto GPU, but the results are not that satisfying as
Figure 3.2.24 Strouhal number St = fh/Uj of the fundamental
screech tone as a function of the fully expanded jet Mach number
Mj . Our result is in good agreement with experimental result of
Ref.[18], which is better than the LES result of Ref.[16].
Figure 3.2.25 Instantaneous snapshot of spanwise vorticity !z and the dilatation in the plane z/h = 2.5 as
the background. The isosurface of vorticity is colored with the amplitude of velocity
70
expected. With the development of GPU computing, more powerful GPUs and relating technologies
are present, which prompt us to further exploit its application in CFD.
In this work we analysis the characteristics of architectures of current GPU servers, and propose a
set of techniques to improve efficiency of data transfer between CPU and GPU, and efficiency of
communication between GPUs. An in-house compressible flow solver based on high order finite
difference method on curvilinear coordinates are successfully ported to GPU with CUDA C. By
carefully memory planning, we save time of unnecessary data transfer between CPU and GPU without
significant sacrifice of capability of grid size of our solver.
The "atomic operation" technique and kernel decomposition technology are applied to design high-
efficiency kernels. A modified AFWENO is proposed, which saves computation and reduces memory
access, and works better on GPU compared with original WENO. Tests have shown that our solver
gets maximum speedups almost 2000x on a Titan V GPU, 1500x on a Tesla P100 and 650x on an RTX
2080 Ti GPU over a CPU core of E5-2680v3. The hardware aware technology is proved to be more
efficient than unoptimizable scenarios. A test case of a supersonic jet shows the practical application
ability of present solver. This work provides an systematic and efficient solution to apply GPU
computing in CFD simulation with certain high order finite difference methods on current GPU
heterogeneous computers.
3.2.3.7 References
[1] D. R. Chapman, Computational aerodynamics development and outlook, AIAA journal 17 (1979).
[2] H. Choi, P. Moin, Grid-point requirements for large eddy simulation: Chapman’s estimates
revisited, Physics of fluids 24 (2012) 011702.
[3] B. Tutkun, F. O. Edis, A GPU application for high-order compact finite difference scheme,
Computers & Fluids 55 (2012) 2935.
[4] V. Esfahanian, H. M. Darian, S. I. Gohari, Assessment of WENO schemes for numerical simulation
of some hyperbolic equations using GPU, Computers & Fluids 80 (2013) 260268.
[5] K. I. Karantasis, E. D. Polychronopoulos, J. A. Ekaterinaris, High order accurate simulation of
compressible flows on GPU clusters over software distributed shared memory, Computers & Fluids
93 (2014) 1829.
[6] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang, Z. Wang, et al.,
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the
tianhe-1a supercomputer, Journal of Computational Physics 278 (2014) 275297.
[7] J. Lai, Z. Tian, H. Li, S. Pan, A CFD heterogeneous parallel solver based on collaborating CPU and
GPU, in: IOP Conference Series: Materials Science and Engineering, volume 326, IOP Publishing, 2018.
[8] J. Lai, H. Li, Z. Tian, Y. Zhang, A multi-GPU parallel algorithm in hypersonic flow computations,
Mathematical Problems in Engineering 2019 (2019).
[9] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle using a
GPU, Journal of Computational Physics 227 (2008) 1014810161.
[10] J. Lei, D. l. Li, Y. l. Zhou, W. Liu, Optimization and acceleration of flow simulations for CFD on
CPU/GPU architecture, Journal of the Brazilian Society of Mechanical Sciences and Engineering
(2019).
[11] J. Crabill, F. D. Witherden, A. Jameson, A parallel direct cut algorithm for high-order overset
methods with application to a spinning golf ball, Journal of Computational Physics 374 (2018).
[12] F. D. Witherden, A. M. Farrington, P. E. Vincent, Pyfr: An open source framework for solving
advectiondiffusion type problems on streaming architectures using the flux reconstruction
approach, Computer Physics Communications 185 (2014) 30283040.
[13] Y. Jiang, C.-W. Shu, M. Zhang, An alternative formulation of finite difference weighted Eno
schemes with LaxWendroff time discretization for conservation laws, SIAM Journal on Scientific
Computing 35 (2013) A1137A1160.
71
[14] G.-S. Jiang, C.-W. Shu, Efficient implementation of weighted Eno schemes, Journal of
computational physics 126 (1996) 202228.
[15] G. Patnaik, A. Corrigan, K. Obenschain, D. Schwer, D. Fyfe, Efficient Utilization of a CPU-GPU
Cluster, American Institute of Aeronautics and Astronautics, 2012. URL:
https://arc.aiaa.org/doi/abs/10.2514/6.2012-563. doi:10.2514/6.
[16] J. Berland, C. Bogey, C. Bailly, Numerical study of screech generation in a planar supersonic jet,
Physics of fluids 19 (2007) 075105.
[17] C. Tam, The shock-cell structures and screech tone frequencies of rectangular and non-
axisymmetric supersonic jets, Journal of Sound and Vibration 121 (1988) 135147.
[18] J. Panda, G. Raman, K. Zaman, J. Panda, G. Raman, K. Zaman, Under expanded screeching jets
from circular, rectangular and elliptic nozzles, in: 3rd AIAA/CEAS Aeroacoustics Conference, 1997.
[19] G. Raman, E. J. Rice, Instability modes excited by natural screech tones in a supersonic
rectangular jet, Physics of fluids 6 (1994) 39994008.
72
3.2.4 Case Study 4 - A Matrix-free GMRES Algorithm on GPU Clusters for Implicit Large Eddy
Simulation
Authors : Eduardo Jourdan and Z. J. Wang
Affiliations : Department of Aerospace Engineering, University of Kansas, Lawrence, KS, 66045, USA
Citation : Eduardo Jourdan de Araujo Jorge Filho and Zhi J. Wang. "A Matrix-free GMRES Algorithm on
GPU Clusters for Implicit Large Eddy Simulation," AIAA 2021-1837. AIAA Scitech 2021 Forum. January
2021.
This paper investigates the performance of the matrix-free GMRES algorithm without a
preconditioner on GPU clusters for large eddy simulations (LES) with the high-order FR/CPR method.
This implicit solution algorithm does not need to store the Jacobian matrix, which is often too large
to store on GPUs. In addition to the residual operator, it is important to have an efficient global
reduction operation. The present work considers various solution approximate orders on mixed
unstructured meshes with four different element types: hexahedrons, prisms, pyramids and
tetrahedrons. Issues related to their difference in speedup are discussed and compared with the
explicit three-stage Runge-Kutta scheme on a GPU cluster. A strong scalability study is conducted on
the Summit supercomputer with up to 1200 NVIDIA V100 GPUs. A LES benchmark case is used to
evaluate different parameters and the convergence tolerance of the GMRES solver. Preliminary
performance results are also described for a test case using the High Lift Common Research Model
from the 4th AIAA High-Lift Prediction Workshop. In this case, a reduction of the simulation cost by a
factor of 15 to 35 is obtained compared to an explicit scheme on a GPU cluster.
3.2.4.1 Introduction
A series of International Workshops on High-Order CFD Methods have been organized since 2012 [1]
where benchmark vortex-dominated turbulent flow problems have been proposed and simulated
using high-order methods. One major conclusion from them is that high-order methods are more
efficient than low order ones for scale-resolving simulations such as large eddy simulation (LES) to
achieve the same accuracy. Moreover, NASA’s 2030 CFD vision states that LES will be increasingly
utilized in order to handle separated flow problems such as flow over high lift configurations and
through aircraft engines [2].
The use of high-order methods such as discontinuous Galerkin (DG) [3] and flux reconstruction (FR)
[4] for LES has grown significantly in the last decade [510] because of the capability of these
methods in handling complex geometries with unstructured meshes. Another advantage is the fact
that they achieve high-order accuracy by increasing the number of degrees of freedom (DOFs) inside
each element, remaining a compact scheme and hence improving its parallel performance. However,
some challenges still remain, such as the need to further reduce the cost and turnaround time of these
simulations.
At the same time, modern hardware is shifting towards the use of highly parallel accelerators, like
for instance NVIDIA’s Graphical Processing Units (GPUs). Several world flagship supercomputers
such as Summit (Oak Ridge National Laboratory), Sierra (Lawrence Livermore National Laboratory)
and Piz Daint (Swiss National Supercomputing Centre) are equipped with GPU-powered nodes. GPUs
are capable of achieving a much higher theoretical peak in floating-point operations per second
(FLOPs) by using a different parallelism strategy, executing concurrently the same instruction for a
large set of data. As of October 2020, 6 of the top 10 super-computers on the Top500 list [11] have a
node configuration with GPUs. Most recent GPU developments include the launch of the Nvidia A100
and AMD MI100 cards with increasingly higher memory bandwidth, memory resources and
computational power.
Typical challenges in the use of GPU for high-order CFD simulations involve [10, 1220] different
parallel architecture, data-communication and scalability issues, memory access pattern and
memory bandwidth limitation.
73
Nevertheless, high-order methods such as DG and FR are compact with intensive local operations,
with the DOFs inside an element increasing rapidly with the order of accuracy. This leads to
operations like polynomial interpolations and matrix vector multiplications being faster and more
efficient for this particular hardware architecture, making these methods a suitable choice to be used
on GPUs.
One the other hand, higher-order spatial operators are much stiffer than low order ones [21]. As a
result, the stability limit of explicit schemes is more severe for high-order methods, and matrix-based
implicit methods take much more memory than lower order ones [21]. The different memory
architecture of the GPU creates additional difficulties for the use of implicit time solvers with high-
order methods [18]. Implicit solvers also tend to require more communication and may have memory
bound algorithms like global reduction operations.
A typical implicit solver used in CFD is the Newton-Krylov method, in which Newton-type methods
for the nonlinear system are coupled with Krylov subspace methods for linear system solving [22].
Reference [18] shows some of the difficulties in accelerating implicit solvers like the GMRES method
[23] with matrix-based preconditioners like the inherently sequential incomplete lower-upper
factorization (ILU). The speedups between one Nvidia K40 GPU and one CPU core for explicit
schemes are on the order of 70, while it is around 5 for a GMRES+ILU method [18]. Some additional
works with acceleration techniques include Ref. [14] with a preconditioned conjugate gradient solver
for the Poisson equation and Ref. [10] with a p-multigrid method and explicit smoothers.
In this context, the objective of the current paper is to investigate the use of a matrix-free
implementation of the GMRES method on multi-GPU clusters for unsteady simulations. The matrix-
free GMRES method [22] has a much smaller memory footprint, and besides the residual operator, it
needs an efficient global reduction algorithm and internal product calculation. Since the focus of the
current work is on unsteady simulations, no preconditioner is used because this system of equations
has a better condition number than steady problems since a small time step is needed to capture the
turbulent dynamics.
First, the speedup between CPU and GPU implementations of the matrix-free GMRES method is
measured for several orders of accuracy and cell types. Different parts of the algorithm such as the
residual operator and the Gram-Schmidt orthogonalization process and their performance are
investigated. Strong scalability studies from p1 to p5 are done with up to 200 nodes and 1200 GPUs
on the Summit supercomputer, which is the number two computer in the TOP500 list of June 2020
[11].
The LES benchmark T106C turbine case is used to run parametric tests involving the nonlinear
unsteady residual convergence tolerance, the GMRES linear convergence tolerance and the Krylov
subspace size. The simulation accuracy and total turnaround time is compared between a multi-GPU
GMRES implementation, an explicit RK3 time scheme run on multi-GPUs and an implicit LU-SGS
solver on a CPU cluster. Finally, the GPU GMRES algorithm performance is tested on a mixed mesh of
the high-lift configuration version of the Common Research Model intended to be studied on the 4th
AIAA CFD High Lift Prediction Workshop.
The paper is organized as follows. Section 3.2.4.2 presents the numerical formulation of the high-
order FR/CPR method and section 3.2.4.3 describes the matrix-free GMRES algorithm. Finally, Sec.
3.2.4.4 presents the numerical results and Sec. 3.2.4.5 draws some concluding remarks.
3.2.4.2 A Brief Overview of the FR/CPR Method
The FR/CPR method was originally developed by Huynh [4] in 2007 for hyperbolic partial differential
equations, and later it was extended to hybrid unstructured meshes [24]. Further developments on
the FR/CPR method are reviewed in [25,26]. This method belongs to discontinuous finite element
methods, similar to the DG method, but also has some unique advantages. For example, FR contains
a larger family of schemes [4], which may allow larger time steps than the DG method [27]. Many
74
groups also reported that FR is more efficient than the DG method [28,29]. Here we present a brief
introduction of the FR/CPR method starting from a hyperbolic conservation law

 󰇛󰇜
Eq. 3.2.3
with non-overlapping elements, and introducing an arbitrary test function W in each element, the
weighted residual formulation of Eq. 3.2.3 on element Vi can be expressed as
 
 󰇛󰇜

Eq. 3.2.4
The conservative variables inside one element are assumed to be polynomials, and expressed by
nodal values at certain points called solution points (SPs). After applying integration by parts to the
divergence of flux, replacing the normal flux term with a common Riemann flux Fncom and integrating
back by parts, we obtain
 
  󰇛󰇜
󰇟
󰇛󰇜󰇠

Eq. 3.2.5
Here, the common Riemann flux is computed with a Riemann solver


󰇛󰇜
Eq. 3.2.6
where Ui+ stands for the solution outside the current element, and n denotes the outward normal
direction of the interface. For the current work, the Riemann solver used is the Roe scheme [30]. The
normal flux at the interface is: 󰇛󰇜󰇛󰇜
Eq. 3.2.7
Note that if the face integral in Eq. 3.2.5 can be transformed into an element integral then the test
function , will be eliminated. In order to do so, a “correction fieldδi is defined in each element as
 
 󰇟󰇠

Eq. 3.2.8
where Fn = Fncom - Fn(Ui) is the normal flux jump. Eq. 3.2.5 and Eq. 3.2.8 result in
 
 󰇛󰇜

Eq. 3.2.9
The final formulation for each solution point i is
75

 󰇟󰇛󰇜󰇠
 
Eq. 3.2.10
Where j denotes a projection to the polynomial space, and subscript j denotes the j-th solution point
in a certain element. The formulation used for the viscous flux and the solution gradient is the so
called BR2 scheme [31].
3.2.4.3 Matrix-free GMRES Algorithm
The GMRES algorithm developed by Saad and Schultz [23] is an iterative linear system solver. It is
widely used in the CFD community, including with high-order methods [21]. However, its efficiency
is highly influenced by the matrix condition number [22] and typically high-order discretization lead
to stiffer problems. Furthermore, the Jacobian matrix size derived from high-order methods grows
very quickly with the solution polynomial order ?. It is proportional to ?6 for 3-D simulations. The
matrix free version of the GMRES algorithm takes advantage of the fact that the Jacobian matrix is
used only on matrix-vector multiplications. Therefore, one do not need to form the matrix exactly.
Different approximations can be used for this operation [22], and in this work a simple first order
finite difference approach is used like in Eq. 3.2.15.
A robust and efficient preconditioner that scales well on massive parallel systems is still a research
challenge [21]. This current work does not use any preconditioner. The fact that only unsteady
simulations are performed helps to partially mitigate this issue. Previous works have shown the
importance of preconditioners for steady simulations starting the iterative process from the
freestream solution [21,22,32].
The framework used here is often called the Newton-Krylov method [22]. It has an outer non-linear
solver loop like the Newton method that is used to drive the unsteady residual to zero and an inside
iterative linear system solver like the GMRES method to solve the linear system that originates from
the Newton method. The following is the derivation of this framework. Assume that after the spatial
discretization the semi-discrete problem can be written as



Eq. 3.2.11
in which U󰀫 represents all the degrees of freedom in the problem and R is the residual operator given
by the spatial discretization. The problem can be rewritten so that in every physical time step the
following nonlinear system is solved:


 
Eq. 3.2.12
where R* is the unsteady residual. The second order backward differentiation formula (BDF2)
time scheme is used for the time derivative discretization in this work. A Newton solver can be used
to solve this system, with the Jacobian matrix J defined as
󰇛
󰇜


Eq. 3.2.13
The Newton method consists of a sequence of iterations, with the following linear systems to be
solved:
76





Eq. 3.2.14
with a given initial condition Uˆ0 equal to the current time step solution. The Newton iteration can be
terminated based on a required drop in the norm of the unsteady nonlinear residual defined in Eq.
3.2.12. This nonlinear convergence tolerance here named ’time-eps’ can affect the final solution
accuracy, and its value is studied in this paper. In order to solve the linear system defined in Eq.
3.2.14 we use the matrix-free GMRES method without a preconditioner. This linear system can be
solved up to some tolerance named ’linear-eps’, and this parameter’s effect on the method efficiency
and convergence is studied in this paper too. The matrix vector multiplication in the GMRES solver
is replaced by the following approximation:




Eq. 3.2.15
where ε is a small perturbation, here chosen to be ε = |U󰀫|k 10-7. The pseudocode of the GMRES
method for solving the problem Jx = b is shown in Algorithm 1 following the reference [33]. The
size of the Krylov subspace is given by the parameter m 15, which is very small compared to the
mesh size.
It is important to draw attention to some points in this algorithm. First, lines 4 and 5 are the classical
Gram Schmidt (CGS) orthogonalization algorithm. Typically the modified Gram Schmidt process
could be used here to provide robustness for ill-conditioned problems [33], but the CGS algorithm is
chosen since it is more suitable for GPU parallelism. No impact in the robustness and stability of the
algorithm was noticed by using the classical method instead of the modified one in the cases
considered in the current work. The CGS algorithm has a better performance on GPUs because it
involves matrix-vector operations instead of just vector-vector operations in the modified algorithm.
Secondly, note that the cost of line 3 is basically one residual evaluation. Therefore, it is important to
have an efficient implementation of the residual calculation [19].
The numerical results in Sec. 3.2.4.4 show that most of the runtime is spent on the residual
calculation in line 3 and in the CGS algorithm. Another important aspect of the GMRES solver is that
an efficient global reduction and dot product kernel on GPU is important for lines 1, 4 and 6 in the
77
Algorithm 1. The current implementation uses the warp shuffle reduction proposed by Nvidia to
reduce the local vector inside the GPU to a scalar, and then we use a CUDA aware MPI implementation
to obtain the final results across multi-GPUs.
The least square problem defined in line 10 has a very small size, of the order of the number m of
Krylov vectors used. Moreover, in the current work a common technique is used to solve it, that is,
to transform the Hessenberg matrix into upper triangular form by using plane rotations [33]. This
technique allows to check the convergence of the method in each iteration l without extra work, and
the final solution update ym is calculated on the CPU by a backward substitution algorithm. In
addition, note that if the linear system is not solved up to the tolerance requested, the current
implementation uses the GMRES output to update the solution anyway and moves on to the next non-
linear iteration. The restarted version of the GMRES is not used to continue to find the linear system
solution.
Following this description of the Newton-Krylov approach, there are three major parameters: the
unsteady residual convergence tolerance ’time-eps’, the linear system convergence tolerance ’linear-
eps’ and the maximum size < of the search space. Their impact are numerically investigated in the
following section. The parameter ’time-eps’ is related to the accuracy of the time discretization, and
both ’linear-eps’ and m are related to the performance of the GMRES. In addition, two other methods
are used for performance
comparison: the LU-SGS [34, 35]
solver with BDF2 time scheme
and the third order explicit
strong stability-preserving
Runge-Kutta scheme (RK3) [36].
The LU-SGS solver is
implemented only for CPUs and
its performance for the current
code is shown in Ref. [8]. The
RK3 scheme is also implemented
on GPUs, with its performance
described in Ref. [8,19].
3.2.4.4 Numerical Results
3.2.4.4.1 Multi-GPU
Performance and
Scalability Results
In this section, results related to
single and multi-GPU
performance are described and
discussed. The goal is to
investigate the difference in
performance between
simulations with four cell types
and different solution orders.
Figure 3.2.26 shows the
speedup obtained with respect
to a single CPU core of an Intel
Xeon CPU E5-2620 chip. This
Intel chip has 6 cores and hyper-
threading capability, but the
comparison here is with respect
Figure 3.2.26 NVIDIA V100 GPU Speedup compared to a single
CPU core of an Intel Xeon CPU E5-2620 chip
78
to just a single core performance.
The GPU speedup for the explicit scheme with hexahedral cells is around 400 for all orders
considered. It increases slightly for p3 and p5, when it becomes 475 and 455 respectively. On the
other hand, the speedup results for tetrahedral cells increase with the solution polynomial order. It
grows from 450 to 1600 for p1 and p5 respectively. This is discussed by Ref. [19], and one of the
reasons is that a tetrahedral cell is more arithmetic intense and it has a much bigger ratio between
floating point operations and global memory access as the polynomial degree increases [19]. As also
discussed in Ref. [19], a pyramid cell follows closely the tetrahedron performance and the prism
speedup is between a hexahedron and a tetrahedron.
The GMRES results in Figure 3.2.26 show a similar trend between different cell types as the explicit
scheme, but with a slightly lower speedup. For hexahedral cells, it is 390 and 380 for p3 and p5
respectively, while for tetrahedral cells it is 380 and 1300 for p1 and p5 respectively. This slightly
lower values can be understood by the fact that almost all of cost of the explicit iteration comes from
the residual operator, which is very arithmetic intense as discussed before [19] and involves matrix-
vector and matrix-matrix operations.
On the other hand, the GMRES solver has additional operations other than the residual calculation.
As described in Sec. 3.2.4.3, they are mainly vector-vector operations and global reductions, which
have a lower parallel speedup in the GPU architecture than matrix-matrix operations. Table 3.2.4
shows the percentage
measured of a typical
simulation iteration
that is spent in the
residual operator
(line 3 of Algorithm
1) and in the classical
Gram-Schmidt (CGS)
algorithm (line 4 and
5 of Algorithm 1).
We can see that the
CGS method account
for 7% to 18% of the
iteration cost.
Interestingly, this
percentage increases
with the polynomial
order for the hex and
prism elements,
which have
Table 3.2.4 Results of the percentage of a typical simulation iteration that is spent in different parts of
the GMRES algorithm. Results for different solution order and cell types
Figure 3.2.27 Strong scalability study with GPU simulations on the KU cluster
for a mesh with 149k hex elements
79
considerably higher number of degrees per cell than the pyramid and tet elements for higher order.
Another important aspect investigated is the strong scalability performance of the matrix-free
GMRES solver. Strong scalability tests are done with two different grids: a coarse hexahedral mesh
with 149k cells and a fine hexahedral mesh with 15.6M elements. The coarse mesh is used on the KU
cluster. In this test, 14 nodes are used, and each node has 2 V100 GPUs. Each node also has one
Intel(R) Xeon(R) CPU E5-2680 v3 chip with 24 CPU cores after hyper threading. The network
connection uses InfiniBand but the CUDA-aware MPI is not available. The results are shown in
Figure 3.2.27.
We can see that the parallel efficiency grows with the polynomial order. Moreover, the difference
between the scalability of the explicit scheme and the GMRES solver is not so significant for p3 and
p5 as it is for p1. With 28 GPUs the number of cells per partition is 5.3k and the GMRES parallel
efficiency is 43%, 67% and 87% for p1, p3 and p5 respectively.
The strong scalability results for the fine hex mesh on Summit are shown in Figure 3.2.28. The
Summit supercomputer is a leadership computing facility operated by the Oak Ridge National
Laboratory and the United States Department of Energy. From June of 2018 to 2020, Summit was the
most powerful
supercomputer in
the TOP500 list
[11]. Each node on
Summit has 6 V100
GPUs and 2 IBM
POWER9 CPU chips
totaling 44 CPU
cores. Here, CUDA
aware MPI
communication is
used.
The results in
Figure 3.2.28
indicate again that
the GMRES parallel
efficiency grows
with the solution
order. Other than
the p1 case, the
scalability of the
explicit scheme and the GMRES solver seem similar. More than 88% of parallel efficiency is achieved
with 240 GPUs on all simulations considered. With 1200 GPUs, the number of cells per partition is
13k and the GMRES parallel efficiency is 21%, 45% and 79% for p1, p3 and p5 respectively.
Figure 3.2.28 Strong scalability study with GPU simulations on Summit for a
mesh with 15.6M hex elements
80
3.2.4.4.2 Transitional Flow Over the T106C Low Pressure Turbine Cascade
This case setup is taken from the 4th International Workshop on High-Order CFD Methods [37]. The
blade has a chord of C = 0.093 m, a pitch to chord ratio of 0.95, and a span to chord ratio of 10%. The
inlet condition for the cascade is chosen such that the isentropic exit Mach number is Mis = 0.65 and
the Reynolds number is Re = 80,000 based on the isentropic exit velocity. The inlet flow angle for this
case is 32.7. As the inlet turbulence is very low, the flow has a laminar separation and a relatively
slow natural transition. A direct numerical simulation (DNS) of this case is established in Ref. [38]
with an extensive hp-refinement study. The solver used in this reference [38] is the CPU version of
the same solver of the
current work, namely,
the hpMusic solver. In
addition, the fine mesh
used in this reference
was made available for
the current GPU work.
This test case is used to
evaluate the effect of the
parameters ’time-eps’,
’linear-eps’ and Krylov
space size, as well as to
compare the
turnaround time of
simulations with three
different time solvers:
the LU-SGS solver [8, 34,
35] on CPUs, the matrix-
free GMRES on GPUs
and the explicit RK3
Figure 3.2.30 Mesh provided by Alhawwary et al. [38] and time-averaged solution contours of the p2
simulation with GPU
Figure 3.2.29 Results of GPU GMRES speedup relative to the GPU RK3 for
running one characteristic time Tc using different physical time steps
81
scheme [36] on GPUs. A closeup view of the mesh around the blade is shown in Figure 3.2.30 (a).
It has a total of 148,704 hexahedral elements. Periodic boundaries are used in the spanwise and pitch
wise directions. For a p2 polynomial order, the total number of degrees of freedom is 4
million/equation. We use the same definition of characteristic time as Ref. [38], )Tc = C/U with C
being the blade cord and U the free stream velocity based on the isentropic exit conditions.
Following the analysis in Ref. [38], all simulations are run for 180Tc, with the last 100Tc being used
for time averaging. Figure 3.2.30 (b) displays
the average pressure contours around the blade.
The average values for the equivalent y+, x+ and
z+ for all simulations are 0.5, 10 and 11
respectively.
The first test is related to the physical time step.
The GMRES method is tested with different time
steps, from a factor of 1 to 100 times the
maximum stable explicit RK3 time step. The
performance results are shown as the speedup
relative to the GPU RK3 simulation when
comparing the cost to run one characteristic time
Tc. Figure 3.2.29 shows the results for p1, p3
and p5 simulations using the ’linear-eps’ as 0.2,
the Krylov subspace size as 10 and the ’time-eps’
as 0.01. The results in Figure 3.2.29 indicate
that there is an optimum time step to be used, and
for this case it is around 10 to 15 times the time
step from the RK3 stability limit. For a bigger time
step, the GMRES solver convergence rate starts to
deteriorate as the Jacobian matrix condition
number increases. For this case, the calculated improvement by using GPU-GMRES in the turnaround
time of the simulations when compared to the GPU-RK3 is about 1.5 for p1 and p3 and 2.0 for p5. The
second study is related to the linear system convergence tolerance parameter. Simulations are run
varying the ’linear-eps’ between 0.2 and 0.01. In order for the GMRES solver to converge with 0.01
as convergence tolerance, a Krylov subspace size of 30 is used in these tests. The p2 results with a
Table 3.2.5 CPU and GPU simulation final settings for the T106C test case
Figure 3.2.31 Results of GPU GMRES speedup
relative to the GPU RK3 for running one
characteristic time Tc using different GMRES
linear convergence tolerance
82
time step 10 times bigger than the RK3 time step and ’time-eps’ of 0.01 are shown in Figure 3.2.31.
It is noticeable that for this case the final speedup value does not change significantly varying the
’linear-eps’, at least when there are enough Krylov vectors to converge. Moreover, the unsteady
residual tolerance of 0.01 is relatively high when compared to the convergence tolerance for steady
problems. Since GPU memory is a scarce resource, the ’linear-eps’ of 0.2 is chosen with a Krylov
subspace of 10, which is already enough to converge in this problem.
Considering the results of the parametric tests, two GPU GMRES simulations are run for the duration
of 180Tc with ’linear-eps’ equal to 0.2 and 10 Krylov vectors. One converges the unsteady residual
tolerance to 0.01 and the other to 0.001. The complete settings are shown in Table 3.2.5 including
the one for the CPU LU-SGS run and the GPU RK3 simulation. The speedup when using the GPU
GMRES algorithm over the GPU RK3 is a factor of 1.7 and 1.2 for unsteady residual tolerances of 0.01
Figure 3.2.32 Comparison of the time- and spanwise-averaged pressure coefficient between the CPU
simulation and different GPU simulations
Figure 3.2.33 Comparison of the time- and spanwise-averaged coefficient of streamwise friction
between the CPU simulation and different GPU simulations
83
and 0.001 respectively. For this test case and mesh, the ratio between the maximum and minimum
cell Jacobian is 104, not so dramatic as the case in the next section. In order to understand the effect
of the parameter ’time-eps’ in the solution accuracy, a comparison is made in terms of time- and
spanwise-averaged coefficients. Four simulations are considered: CPU, GPU-RK3, GPU-GMRES-eps2
and GPU-GMRES-eps3. The one named ’GPU-GMRES-eps2’ has ’time-eps’ equal to 0.01 and ’GPU-
GMRES-eps3’ has ’time-eps’ equal to 0.001. Figure 3.2.32 shows the results for time- and spanwise-
averaged pressure coefficient Cp. The CPU and GPU-RK3 curves are visually on top of each other,
which helps to verify the GPU implementation. When the GMRES time solver is used, the run with
’time-eps’ equal to 0.001 is also on top of the explicit time run and the CPU simulation.
Some differences start to appear in the suction side of the blade, close to the trailing edge, when the
unsteady convergence tolerance for the GMRES is just 0.01. It is at this location that the laminar flow
separation happens with vortex breakdown and transition into turbulent structures. This region is
difficult to accurately resolve. The closer view at the trailing edge better shows the difference
between the results. It is interesting that at least for this case the GMRES solver needs to have a ’time-
eps’ equal to 0.001 to
match the results of the
LU-SGS solver with a
’time-eps’ of 0.01.
Another comparison is
made for the time- and
spanwise-averaged
streamwise friction
coefficient Cf in Figure
3.2.33. This coefficient
is harder to converge
and harder to reach a
statistically steady state
[38]. The CPU, GPU-RK3
and GPU-GMRES-eps3
hardly show any visual
difference on the global
view figure. Similar to
the pressure coefficient,
the blade section where
the difference is large for
the GPU-GMRES-eps2 is
in the suction side, close
Table 3.2.6 Comparison of mean lift and drag coefficients between the CPU simulation and different
GPU simulations. Relative error is calculated with respect to the CPU simulation
Figure 3.2.34 Comparison of the PSD of pressure at wake point(1)
84
to the trailing edge as shown in the enlarged view. The GPU-GMRES-eps2 slightly under-predicts the
Cf peak at 0.95 of the axial.
The mean lift coefficient Cl󰂸 and drag coefficient Cd are also calculated and compared between each
other in Table 3.2.6. The relative error with respect to the CPU simulation is calculated and also
shown in the table. Note that even the GPU-RK3 and CPU simulations have a difference in 0.1% for
the Cl; and 4 drag counts for the Cd. Possible explanations are the differences in force calculation
sampling frequency, time step and time scheme. The GPU-GMRES-eps3 simulation has very similar
results as the GPU-RK3 case as well, with a difference of 2 drag counts . Interestingly, the GPU-
GMRES-eps2 run has an error of only 5 drag counts when compared with the CPU run, but the Cl
relative error
A final comparison is made in Figure 3.2.34 with respect to the power spectral density (PSD) of the
pressure at a point in the wake. Following reference [38], the point is located close to the trailing
edge with coordinates (0.8591, -0.5137)C. The same tool [39] is used in the PSD calculation, with
more details provided in Ref. [38]. We conduct a spanwise average using four locations in the
spanwise direction, Z/C ϵ (0.0. 0.025. 0.05. 0.075). The Strouhal number St = fc/U is calculated
based on the chord C, the exit velocity U and frequency 5 in Hz. Due to differences in the sampling
frequency, the Strouhal number cutoff is different between the GPU-GMRES simulations and the CPU
run. All simulations are able to capture the first fundamental peak and agree in its frequency and
magnitude. Moreover, the CPU, GPU-RK3 and GPU-GMRES-eps3 curves seem to be on top of each
other. The GPU-GMRES-eps2 appears to capture some of the 2nd and 3rd harmonic frequencies of
this first peak, but it slightly damps their magnitudes. The high-frequency slope decay is also slightly
damped by the GPU-GMRES-eps2 simulation.
3.2.4.4.3 Flow Over a High-Lift Configuration of the Common Research Model
This test case is from the most recent 4th AIAA High-Lift Prediction Workshop
(https://hiliftpw.larc.nasa.gov/). Both the geometry and flow conditions are provided on the
workshop web site. The High Lift Common Research Model (CRM-HL) is an open-source, publicly-
available wing-body high lift configuration being utilized for CFD validation within a broad
international CRM-HL ecosystem. Geometry associated with the NASA 10% scale, semi-span model
configuration tested in the QinetiQ 5-metre wind tunnel is used for this study. This selected problem
is case 2a at the highest angle of attack of 21.46 degrees. The high-order (Q2) computational mesh
was generated by Steve Karman of Pointwise. This mesh contains over 4 million mixed elements.
Selected views of the computational meshes are displayed in Figure 3.2.35. The flow conditions
are: free-stream Mach number of 0.2, and Reynolds number based on the mean aerodynamic chord
of 5.49 million. The far-field boundary is placed 1,000-chords away from the geometry. The smallest
and largest element sizes differ by more than 6 orders of magnitude. We report preliminary wall-
clock times for both explicit and implicit schemes for the p2 simulations on KU’s NVIDIA V100 cluster.
For the explicit three-stage SSP Runge-Kutta scheme, the maximum time step for stability is about
1.39e-9s because of some small cells near wall boundaries. The simulation used 30 V100 GPUs, and
one time step took roughly 0.053s. For the implicit run, the optimized 2nd-order BDF scheme was
used together with a GMRES solver without a preconditioner. We considered two inner convergence
tolerance of 0.01 and 0.001. Much bigger time steps of 5e-7s and 1e-6s can be used for the implicit
simulation. In fact, for dt = 1.e-6s, we employed an inner tolerance of 0.01. One implicit time-
marching step costs 1.1s of wall clock time. Considering that the time-step for the implicit scheme
used a much bigger time step, the speedup in this case is roughly 35. When we used an inner tolerance
of 0.001, it took roughly twice the wall-clock time for the implicit scheme to converge, resulting in a
speed-up of 15. We are still in the process of running this case, and generating time-converged mean
and Reynolds stresses for the workshop. But the simulation is proceeding as expected. A preliminary
85
flow field is displayed in Figure 3.2.36 which shows the iso-surfaces of the Q-criterion colored by
the stream-wise velocity. This picture clearly depicts a vortex-dominated turbulent flow filed.
3.2.4.5 Concluding Remarks
The present paper evaluates the performance of an implementation of the matrix-free GMRES
algorithm with the high-order FR/CPR method on GPU clusters for industrial large eddy simulation.
Difference in speedups are discussed for four cell types: hexahedrons, prisms, pyramids and
tetrahedrons. The effect of increasing the polynomial order from p1 to p5 is also studied. This implicit
solver does not need to form the Jacobian matrix and uses less memory, a scarce resource on GPUs,
and no matrix preconditioner is considered.
The speedup from one CPU core to a GPU card is measured for NVIDIA V100 GPUs. It is about 380 for
hexahedral cells and it grows from 380 to 1300 for tetrahedral cells when increasing the polynomial
order from p1 to p5. Depending on the solution order and cell type, the GMRES solver spends around
Figure 3.2.35 Views of the high-order (Q2) hybrid mesh generated by Pointwise with over 4 million
elements
86
70% in the residual
operator and 15% in
the classical Gram-
Schmidt algorithm.
This last part is
important in order
keep the implicit solver
with similar CPU/GPU
speedups and
scalability
characteristics as the
explicit RK3 scheme.
A benchmark case of
the flow over the T106c
turbine blade is used
for parametric tests. A
linear convergence
tolerance of 0.2 and a
Krylov subspace size of
10 seems to give the
best performance in
this case. Two GMRES
simulations are done
with different unsteady
residual convergence
tolerances, and a value
of 0.001 seems to be necessary in order to match the explicit simulation results. In this case, the GPU
GMRES solver speedup over the GPU RK3 simulation is 1.2.
A test case from the most recent 4th AIAA High-Lift Prediction Workshop is used to measure the
performance of this method on mixed meshes and with a higher Reynolds number. Preliminary
results on a GPU cluster indicate that the GMRES solver when compared to the RK3 scheme reduces
the simulation cost by factor of 15 when the unsteady convergence tolerance is 0.001 and by 35 when
the tolerance is 0.01.
Acknowledgements
The present research has been supported by GE Global Research and the Army Research Office under
Award Number W911NF-20-1-0065. This research used resources of the Oak Ridge Leadership
Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of
the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
3.2.5 References
[1] Wang, Z. J., Fidkowski, K., Abgrall, R., Bassi, F., Caraeni, D., Cary, A., Deconinck, H., Hartmann, R.,
Hillewaert, K., Huynh, H. T., Kroll, N., May, G., Persson, P., van Leer, B., and Visbal, M., “High-Order CFD
Methods: Current Status and Perspective,”
[2] Slotnick, J., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D., “CFD
Vision 2030 study: a path to revolutionary computational aero sciences,” NASA CR-2014-218178,
NASA, 2014.
[3] Cockburn, B., and Shu, C. W., “TVB Runge-Kutta Local Projection Discontinuous Galerkin Finite
Element Method for Conservation Laws V: Multidimensional Systems,” Journal of Computational
Physics, Vol. 141, 1998, pp. 199224.
Figure 3.2.36 Instantaneous flow field showing iso-surfaces of the Q-
criterion colored by the stream-wise velocity for a p2 simulation
87
[4] Huynh, H. T., “A flux reconstruction approach to high-order schemes including discontinuous
Galerkin methods,” 18rd AIAA Computational Fluid Dynamics Conference, Miami, FL, 2007.
[5] Uranga, A., Persson, P.-O., Drela, M., and Peraire, J., “Implicit Large Eddy Simulation of transition
to turbulence at low Reynolds numbers using a Discontinuous Galerkin method,” International
Journal for Numerical Methods in Engineering, Vol. 87, No. 1-5, 2011, pp. 232261.
doi:10.1002/nme.3036.
[6] Beck, A. D., Bolemann, T., Flad, D., Frank, H., Gassner, G. J., Hindenlang, F., and Munz, C.-D., “High-
order discontinuous Galerkin spectral element methods for transitional and turbulent flow
simulations,” International Journal for Numerical Methods in Fluids, Vol. 76, No. 8, 2014, pp. 522548.
doi:10.1002/fld.3943.
[7] Wang, Z., Li, Y., Jia, F., Laskowski, G. M., Kopriva, J., Paliath, U., and Bhaskaran, R., “Towards
industrial large eddy simulation using the FR/CPR method,” Computers & Fluids, Vol. 156, 2017.
[8] Jia, F., Wang, Z., Bhaskaran, R., Paliath, U., and Laskowski, G. M., “Accuracy, efficiency and
scalability of explicit and implicit FR/CPR schemes in large eddy simulation,” Computers & Fluids, Vol.
195, 2019, p. 104316. doi:https://doi.org/10.1016/j.compfluid.2019.104316,
URL http://www.sciencedirect.com/science/article/pii/S0045793019302762.
[9] Pazner, W., Franco, M., and Persson, P.-O., “High-order wall-resolved large eddy simulation of
transonic buffet on the OAT15A airfoil,2019 AIAA Aerospace Sciences Meeting, AIAA Paper No. 2019-
1152, San Diego, CA, 2019.
[10] Loppi, N. A., Witherden, F. D., Jameson, A., and Vincent, P. E., “A high-order cross-platform
incompressible NavierStokes solver via artificial compressibility with application to a turbulent jet,”
Computer Physics Communications, Vol. 223, 2018, pp. 193205.
[11] Meuer, H., Strohmaier, E., Dongarra, J., Simon, H., and Meuer, M., “TOP500 high performance
computing list.” https: //www.top500.org/lists/2019/11/, 2019.
URL https://www.top500.org/lists/2019/11/.
[12] Vermeire, B., Witherden, F., and Vincent, P., “On the utility of GPU accelerated high-order
methods for unsteady flow simulations: A comparison with industry-standard tools,” Journal of
Computational Physics, Vol. 334, 2017, pp. 497521.
[13] Xu, C., Deng, X., Zhang, L., Rang, J., Wang, G., Jiang, Y., Cao, W., Che, Y., Wang, Y., Wang, Z., Liu, W.,
and Cheng, X., “Collaborating CPU and GPU for large-scale high-order CFD simulations with complex
grids on the TianHe-1A supercomputer,”Journal of Computational Physics, Vol. 278, 2014.
[14] Khajeh-Saeed, A., and Perot], J. B., “Direct numerical simulation of turbulence using GPU
accelerated supercomputers,”Journal of Computational Physics, Vol. 235, 2013, pp. 241 257.
doi:https://doi.org/10.1016/j.jcp.2012.10.050,
URL http://www.sciencedirect.com/science/article/pii/S0021999112006547.
[15] Romero, J., Crabill, J., Watkins, J., Witherden, F., and Jameson, A., “ZEFR: A GPU-accelerated high-
order solver for compressible viscous flows using the flux reconstruction method,” Computer Physics
Communications, Vol. 250, 2020, p. 107169. doi:https://doi.org/10.1016/j.cpc.2020.107169.
[16] Zolfaghari, H., Becsek, B., Nestola, M. G., Sawyer, W. B., Krause, R., and Obrist, D., “High-order
accurate simulation of incompressible turbulent flows on many parallel GPUs of a hybrid-node
supercomputer,” Computer Physics Communications, Vol. 244, 2019, pp. 132 142.
doi:https://doi.org/10.1016/j.cpc.2019.06.012, URL http://www.sciencedirect.com/
science/article/pii/S0010465519301997.
[17] Oyarzun, G., Chalmoukis, I. A., Leftheriotis, G. A., and Dimas, A. A., “A GPU-based algorithm for
efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers,” Applied
Mathematical Modelling, Vol. 85, 2020, pp. 141 156.
doi:https://doi.org/10.1016/j.apm.2020.04.010.
[18] Aissa, M., Verstraete, T., and Vuik, C., “Toward a GPU-aware comparison of explicit and implicit
CFD simulations on structured meshes,” Computers & Mathematics with Applications, Vol. 74, No. 1,
2017, pp. 201 217. doi:https://doi.org/10.1016/j.camwa.2017.03.003,
88
URL http://www.sciencedirect.com/science/article/pii/S0898122117301438, 5th European
Seminar on Computing ESCO 2016.
[19] Jourdan, E., and Wang, Z. J., “Efficient Implementation of the FR/CPR Method on GPU Clusters for
Industrial Large Eddy Simulation,” AIAA Aviation 2020 Forum, Reno, NV, 2020.
[20] Franco, M., Camier, J.-S., Andrej, J., and Pazner, W., “High-order matrix-free incompressible flow
solvers with GPU acceleration and low-order refined preconditioners,” Computers & Fluids, Vol. 203,
2020, p. 104541. doi:https://doi.org/10.1016/j.compfluid.2020.104541.
[21] May, G., and Jameson, A., “Efficient Relaxation Methods for High-Order Discretization of Steady
Problems,” Advances in Computational Fluid Dynamics, Vol. 2, 2011.
doi:10.1142/9789814313193_0013.
[22] Knoll, D., and Keyes, D., “Jacobian-free NewtonKrylov methods: a survey of approaches and
applications,” Journal of Computational Physics, Vol. 193, No. 2, 2004, pp. 357 397. doi:
https://doi.org/10.1016/j.jcp.2003.08.010, URL
http://www.sciencedirect.com/science/article/pii/S0021999103004340.
[23] Saad, Y., and Schultz, M. H., “GMRES: A Generalized Minimal Residual Algorithm for Solving
Nonsymmetric Linear Systems,” SIAM Journal on Scientific and Statistical Computing, Vol. 7, No. 3,
1986, pp. 856869. doi:10.1137/0907058, URL https://doi.org/10.1137/0907058.
[24] Wang, Z. J., and Gao, H., “A unifying lifting collocation penalty formulation including the
discontinuous Galerkin, spectral volume/difference methods for conservation laws on mixed grids,”
Journal of Computational Physics, Vol. 2228, No. 21, 2009, pp. 81618186.
[25] Huynh, H. T.,Wang, Z. J., and Vincent, P. E., “High-order methods for computational fluid
dynamics: a brief review of compact differential formulations on unstructured grids,” Computers &
Fluids, Vol. 98, 2014, pp. 209220.
[26] Wang, Z. J., “A perspective on high-order methods in computational fluid dynamics,” Science
China Physics, Mechanics & Astronomy, Vol. 59, No. 1, 2016, p. 614701.
[27] Vincent, P. E., Castonguay, P., and Jameson, A., “A new class of high-order energy stable flux
reconstruction schemes,” Journal of Scientific Computing, Vol. 47, No. 1, 2011, pp. 5072.
[28] Park, J. S., You, H., and Kim, C., “Higher-order multi-dimensional limiting process for DG and
FR/CPR methods on tetrahedral meshes,” Computers & Fluids, Vol. 154, 2017, pp. 322334.
[29] Yu, M., Wang, Z. J., and Liu, Y., “A numerical method for solving incompressible viscous flows
problems,” Journal of Computational Physics, Vol. 259, 2014, pp. 7095.
[30] Roe, P. L., “Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes,” Journal
of Computational Physics, Vol. 43, No. 2, 1981, pp. 357372.
[31] Bassi, F., and Rebay, S., “A high order discontinuous Galerkin method for compressible turbulent
flows.” Discontinuous Galerkin Methods. Theory, Computation and Applications, edited by B. Cockburn,
G. Karniadakis, and C. Shu, Springer, 2000, pp. 7788.
[32] Saad, Y., and van der Vorst, H. A., “Iterative solution of linear systems in the 20th century,” Journal
of Computational and Applied Mathematics, Vol. 123, No. 1, 2000, pp. 1 33. doi:
https://doi.org/10.1016/S0377-0427(00)00412-X, URL
http://www.sciencedirect.com/science/article/pii/S037704270000412X,
numerical Analysis 2000. Vol. III: Linear Algebra.
[33] Saad, Y., Iterative Methods for Sparse Linear Systems, 2nd ed., Society for Industrial and Applied
Mathematics, 2003.
doi:10.1137/1.9780898718003, URL https://epubs.siam.org/doi/abs/10.1137/1.9780898718003.
[34] Yoon, S., and Jameson, A., “Lower-upper Symmetric-Gauss-Seidel method for the Euler and
Navier-Stokes equations,” AIAA Journal, Vol. 26, No. 9, 1988, pp. 10251026. doi:10.2514/3.10007.
[35] Chen, R. F., and Wang, Z. J., “Fast, block lower-upper symmetric Gauss-Seidel scheme for
arbitrary grids.” AIAA Journal, Vol. 38, No. 12, 2000, pp. 2038, 2245.
[36] Gottlieb, S., and Shu, C. W., “Total variation diminishing Runge-Kutta schemes,” Mathematics of
Computation of the American Mathematical Society, Vol. 67, No. 221, 1998, pp. 7385.
89
[37] Vincent, P.,Wang, Z. J., Ekaterinaris, J., Huynh, H. T., Kroll, N., and Hillewaert, K., “4th International
Workshop on High-Order CFD Methods,” https://how4.cenearo.be, 2016.
URL https://how4.cenearo.be.
[38] Alhawwary, M., and Wang, Z. J., “On the mesh resolution of industrial LES based on the DNS of
flow over the T106C turbine,” Advances in Aerodynamics, Vol. 1, No. 21, 2019. doi:10.1186/s42774-
019-0023-6.
[39] Alhawwary, M., “A C++ toolbox for computing Discrete and Fast Fourier Transforms (DFT,FFT),
Power Spectral Density (PSD) estimates, and the sound pressure level (SPL) in (dB).”
https://github.com/mhawwary/FFTpsd, 2019. URL, https://github.com/mhawwary/FFTpsd.
90
4 CFD and HPC Trends Forecasted for 2030
CFD codes utilize High Performance Computing (HPC) systems, so understanding where HPC
technology might be in the 2030 timeframe is an important component of creating a vision for CFD
codes in 2030
18
. Of course, forecasting where HPC technologies will be in the future requires a
significant amount of extrapolation, which is especially hard in such a fast changing area as HPC. The
fastest current systems can perform more than
19
peta-FLOPS (1 petaFLOPS is 1015 floating point
operations per second) and the HPC community is working toward systems capable of 1018 FLOPS
(exaFLOPS), which are expected sometime between 2018 and 2023. Some work is even looking at
1021 FLOPS (zetaFLOPS). However, reaching that level of performance is unlikely without radically
new technologies. A common, though controversial, measure of HPC systems is the total number of
floating point operations a given system can perform in a second while solving a large linear system
of equations using Gaussian elimination; this is the HP LINPACK benchmark. Twice a year a list of the
top 500 systems in the world against which those numbers are measured is published by the Top500
organization. The current list (June 2013) is topped by the Tianhe-2 system, developed by China’s
National University of Defense Technologies, which achieved 33.86 petaFLOPS on the LINPACK
benchmark. Here, we will estimate only the peak floating-point performance in terms of the
maximum number of operations that can be performed per second. We note that the performance of
many applications, including CFD applications, may be more accurately estimated by using sustained
memory bandwidth; for the purposes, provided that other aspects of system performance remains
the same.
4.1 Relationship Between Semiconductors, SMT, and Microelectronics
Today’s mainstream electronics manufacturing consists mainly of semiconductor packages and
surface mount technology processes [Powell]
20
. Together, the two make up the vast majority of
readily acknowledged devices from cell phones to PCs, tablets, and laptops. Even smart high-end
toasters leverage these two technologies. The resulting products also end up in automobiles, stereos,
TVs, and remote controls. A lesser known technology, however, is working its way into both of these
markets microelectronics. Below are some abbreviated definitions, meant to gain some common
understanding before diving in further:
SMT (Surface Mount Technology) examples include a motherboard in a PC, or an FR4 board
with integrated circuits (ICs), resistors, and capacitors reflow soldered making a completed
electronic product. The number of products built with this technology is staggeringly large.
Semiconductor consider lead frame packages with a single IC that is die-attached, wire-
bonded, and over-molded. These packages are made by the millions, can be surface mounted,
and soldered into through hole substrates/FR4 boards. The number of products built with
the technology is staggeringly large.
Hybrid Microelectronics multiple ICs/packages incorporated into a space-saving package.
This can include SMT and semiconductor technologies and can be very complexmixed
technologies that allow for powerful and/or small final products. The number of products
built with this technology is small compared to SMT and semiconductor packaging, but it is a
growing market.
18
J, Slotnick and A, Khodadoust, J, Alonso , D, Darmofal , W, Gropp , E, Lurie , Dimitri Mavriplis , “CFD Vision 2030
Study: A Path to ,Revolutionary Computational Aero sciences” , NASA/CR2014-218178.
19
Kraft, E. M., “Integrating Computational Science and Engineering with Testing to Re-engineer the Aeronautical
Development Process”, AIAA Paper 2010-0139, 48th AIAA Aerospace Sciences Meeting, January 2010,
10.2514/6.2010-139.
20
PTI Blog , Posted by Janine Powell on Tue, Aug 20, 2013
91
While SMT technology generally incorporates semiconductor products, semiconductor products do
not typically include SMT components. Microelectronics can incorporate both, and the result is
smaller and usually faster final products.
Through-hole technologies are becoming
rare. Microelectronics is a growing
packaging method and is working its way
into both of these mainstream markets
but in a different way. Both
semiconductor and microelectronic
packages can utilize the same (or
similar) manufacturing techniques. Both
can mount bare die with conductive or
non-conductive epoxies, or eutectic
solders. Both generally employ
traditional wire ball bonding for the first
level interconnects. These markets are
distinct, but they do mix.
Just as SMT technologies incorporate
semiconductor packages but rarely the
other way around, microelectronics
can incorporate both semiconductor and SMT technologies to create complex packages and
products (see Figure 4.1.1).
4.2 Comparison of Semiconductor Fabrication Sizes in HPC
A significant measure of a processor is the feature size of its components. The smaller the features,
the more elements can be placed in the same area, and hence the more powerful a processor becomes.
Feature size also has a direct impact on power consumption, and heat generation, with smaller sizes
being better. Thus, forecasting feature sizes of future processors is very important. Unfortunately,
the industry has not always been
good in that forecasting, which is
one reason why predicting
where HPC technology will be in
2030 is particularly hard. For
example, in 2005, the
International Technology
Roadmap for Semiconductors
(ITRS) forecasted a 22-nm
(nm=10-9m) gate length by
2008; that is, the structures in a
modern processor were forecast
to have features with sizes
around 22 nm. However, in 2008
the forecast date moved to 2011
and in 2011, it moved again to
2012. A similar slip occurred for
other (smaller) gate lengths (see
Figure 4.2.1). Note that the forecasts of the ITRS combine inputs from all major chip manufacturers,
equipment suppliers, and research communities and collections, so it represents the combined
wisdom of the industry. Nevertheless, as Figure 4.2.1 shows, forecasting a key feature of even this
basic component of processors is hard. Another critical component of HPC capability in 2030 is the
Figure 4.2.1 Changing Predictions About Semiconductor Sizes
Figure 4.1.1 Relationship between Microelectronics,
SMT and Semiconductor
Microelectronics
SMT
Semiconductors
92
advances in software infrastructure and programming methodologies that will be necessary to take
advantage of these future HPC systems. The ultimate purpose for these systems is to solve the most
pressing problems in academia and industry. In particular, industrial users pursue this technology
because of the large impact on future product designs, and the ability to avoid or minimize the use of
other, more costly methods such as wind tunnels or other types of physical tests.
4.3 Current Status of CFD
At present, CFD is used extensively in the aerospace industry for the design and analysis of air and
space vehicles and components. However, the penetration of CFD into aerospace design processes is
not uniform across vehicle types, flight conditions, or across components. CFD often plays a
complementary role to wind tunnel and rig tests, engine certification tests, and flight tests by
reducing the number of test entries and/or reducing testing hours
21
-
22
. But, in many circumstances,
CFD provides the only affordable or available source of engineering data to use in product design due
to limitations either with model complexity and/or wind tunnel capability, or due to design
requirements that cannot be addressed with ground-based testing of any kind. As a result, CFD
technology development has been critical in not only minimizing product design costs, but also in
enabling the design of truly novel platforms and systems. Generally, the design process is composed
of three key phases: conceptual design, preliminary and detailed design, and product validation. The
current usage of CFD tools and processes in all three design phases is summarized below.
4.3.1 Conceptual Design
CFD is often used in the early, conceptual design of products where it has been both previously
calibrated for similar applications using data-morphing techniques, as well as for new configurations
where little or no engineering data is available to guide design decisions. Simplified models are
typically used during the conceptual optimization phase to allow reasonably accurate trades to be
made between drag, fuel consumption, weight, payload/range, thrust, or other performance
measures. Use of simplified models is necessary to allow often time consuming optimization
processes to be used in the overall design effort, but inherently places conservatism into the final
design. This conservatism derives from the use of models that are too similar within the existing
product design space, other geometric simplifications, or the use of low-fidelity CFD tools that trade
off flow physics modeling accuracy for execution speed.
4.3.2 Preliminary/Detailed Design
Once a product development program is launched, CFD is a necessary and uniformly present tool in
the detailed configuration design process. For example, CFD is indispensable in the design of cruise
wings in the presence of nacelles for commercial airplanes, and for inlet and nozzle designs; wind
tunnels are used to confirm the final designs
23
-
24
. In both military and commercial aircraft design
processes, CFD is the primary source of data for aircraft load distributions and ground effect
estimations. Similarly, gas turbine engine manufacturers rely on CFD to predict component design
performance, having reduced the number of single-component rigs substantially as CFD capability
has become more mature. Increasingly, multicomponent and Multiphysics simulations are
performed during the design cycle, but the long clock times often associated with these processes
restricts their widespread adoption. For space exploration, CFD is often used to gain important
21
Jameson, A., “Re-engineering the Design Process Through Computation”, AIAA Journal of Aircraft, 1999.
22
Goldhammer, M. I., “Boeing 787 – Design for Optimal Airplane Performance”, CEAS/KATnet Conference on
Key Aerodynamic Technologies, Bremen, Germany, June 2005.
23
Malik, M. R. and Bushnell, D. M. (eds.), Role of Computational Fluid Dynamics and Wind Tunnels in Aeronautics
R&D”, NASA TP-2012-217602, September 2012.
24
Goldhammer, M. I., “Boeing 787 Design for Optimal Airplane Performance”, CEAS/KATnet Conference on
Key Aerodynamic Technologies, Bremen, Germany, June 2005.
93
insight into flow physics used to properly locate external components on the surface of launch
vehicles or spacecraft. CFD is also increasingly providing substantial portions of the aero and
propulsion performance database. In many cases, wind tunnel data is used only to anchor the CFD
data at a few test points to provide confidence in the CFD database. CFD is the primary source of data
for the hypersonic flight regime when ground testing is limited or does not exist.
4.3.3 Product Validation and Certification
As the product development process moves into the validation phase and certification testing, CFD is
often used to confirm performance test results, assess the redesign of components that show
potential for improved performance, and to answer any other questions that arise during product
testing. Typically, product configurations evolve over the testing period based on a combination of
measured results and engineering judgment bolstered by the best simulation capability available. In
general, CFD modeling capability grows to capture the required scope and physics to answer the
questions raised during testing. The expense of responding to often unplanned technical surprises
which results in more time on the test stand or in flight test, or changes in hardware drives
conservatism into aerospace designs and is a significant motivation for improving the accuracy and
speed of CFD. If CFD is sufficiently accurate and fast, engineers can move from their traditional design
space with greater confidence and less potential risk during testing. For each of these design phases,
the performance of CFD is of critical.
4.3.4 CFD usage of High Performance Computing (HPC)
The effectiveness and impact of CFD on the design and analysis of aerospace products and systems
is largely driven by the power and availability of modern HPC systems. During the last decades, CFD
codes were formulated using message passing (e.g., MPI) and thread (e.g., OpenMP) software models
for expressing parallelism to run as efficiently as possible on current generation systems. However,
with the emergence of truly hierarchical memory architectures having numerous graphical
processing units (GPUs), new CFD algorithms may need to be developed to realize the potential
performance offered by such systems. Government labs, such as Oak Ridge National Lab (ORNL),
Argonne National Lab (ANL), and the NASA Advanced Supercomputing (NAS) facility at NASA
Ames research center, have often led the way with the acquisition and testing of new hardware. Much
research on testing and tailoring of CFD algorithms takes place on these platforms with heavy
participation from academia, national labs and to some extent industry as well. Government
computing resources are also used to tackle large-scale calculations of challenge problems, such as
the detailed direct numerical simulation (DNS) of spray injector atomization or high fidelity
simulations of transition and turbulent separation in turbomachinery. However, because of the high
cost of these leadership-class systems, industry and academia often purchase smaller commodity
clusters utilizing similar types of processors when the latest hardware technology is fully
demonstrated on CFD problems and other important applications.
4.3.5 Turbulence Modeling
Current practices for CFD-based workflows utilize steady Reynolds-average Navier-Stokes (RANS)
with 1 or 2-equation turbulence models
25
-
26
, although hybrid unsteady RANS/LES methods are
increasingly common for certain classes of simulations in which swirling and intentionally separated
flows are dominant, such as combustors. Techniques to combine near-wall RANS regions and outer
flow field, large-eddy simulation (LES) regions in these hybrid methods are immature. Many CFD
design processes include an estimation of boundary layer transition, using a range of models, from
25
Spalart, P. R. and Allmaras, S. R., "A One-Equation Turbulence Model for Aerodynamic Flows", La Recherche
Aerospatiale, No. 1, 1994, pp. 5-21.
26
Wilcox, D. C., Turbulence Modeling for CFD, DCW Industries, 3rd edition, November 2006.
94
purely empirical to coupled partial-differential equation (PDE) solutions of stability equations
27
-
28
.
Both approaches involve much empiricism, may be missing some modes of transition, and are
evolving. As a result, no generalized transition prediction capability is in widespread use in Navier-
Stokes CFD, and the default practice is to run the codes “fully turbulent”. Steady-state CFD accounts
for a vast majority of simulations while unsteady flow predictions are inherently more expensive
and not yet uniformly routine in the design process, with some exceptions.
4.3.6 Process Automation
Current CFD workflows are often paced by the geometry preprocessing and grid generation phases,
which are significant bottlenecks. In some cases, where the design effort involves components of
similar configurations, specialized, automated processes are built that considerably reduce set-up
time, execution of the CFD solver, and post-processing of results. This process to production
capability of the CFD workflow only occurs in areas where the design work is routine and the
investment in automation makes business sense; single prototype designs and novel configurations
continue to suffer the pacing limits of human-in-the-loop workflows because the payoff for
automating is not evident. This issue is not unique to the aerospace industry.
4.3.7 Solution Uncertainty and Robustness
In practice, CFD workflows contain considerable uncertainty that is often not quantified. Numerical
uncertainties in the results come from many sources, including approximations to geometry, grid
resolution, problem setup including flow modeling and boundary conditions, and residual
convergence. Although NASA and professional organizations such as ASME and AIAA have created
standards for the verification and validation of CFD and heat transfer analyses, such techniques are
not widely used in the aerospace industry. With a few notable exceptions, CFD is carried out on fixed
grids that are generated using the best available practices to capture expected flow features, such as
attached boundary layers
29
. Such approaches cannot reliably provide adequate resolution for flow
features when locations are not known a priori, such as shocks, shear layers, and wakes. Although
grid refinement is often seen as a solution to addressing grid resolution issues, it is seldom done in
practice because uniform refinement is impractical in 3D. Adaptive mesh refinement strategies offer
the potential for superior accuracy at reduced cost, but have not seen widespread use due to
robustness, error estimation, and software complexity issues. Achieving consistent and reliable flow
solver or residual convergence remains problematic in many industrial cases. Although many CFD
codes are able to demonstrate convergence for a few simple problems, for many flows involving
difficult flow physics or complex geometries such as an aircraft in high-lift configuration, many
of the current solver techniques are not strong enough to ensure robust convergence.
Engineering judgment is required to interpret results that are not well converged, which introduces
conservatism into decision making. Furthermore, the use of steady-state flow solvers itself is in
question for many flows of engineering interest.
4.3.8 Multidisciplinary Analysis and Optimization (MDAO)
Although the basic concepts of MDAO are fairly well accepted in the community, the routine use of
MDAO methods is not, by any means, universal. At moderate levels of fidelity, it is common industrial
practice to perform coupled multidisciplinary analyses (MDA) of the most tightly integrated
disciplines in a design. Aero structural analyses, conjugate heat transfer calculations, and aero-
27
Stock, H.W., and Haase, W., “Navier-Stokes Airfoil Computations with eN Transition Prediction Including
Transitional Flow Regions”, AIAA Journal, Vol. 38, No. 11, pp. 20592066, 2006, 10.2514/2.893.
28
Langtry, R. B., Menter, F. R., “Correlation-Based Transition Modeling for Unstructured Parallelized
Computational Fluid Dynamics Codes”, AIAA Journal, Vol.47, pp. 2894-2906, 2009, 10.2514/1.42362.
29
Mavriplis, D. J., Vassberg, J., Tinoco, E., Mani, M., Brodersen, O., Eisfeld, B., Wahls, R., Morrison, J., Zickuhr, T.,
Levy, D., and Murayama, M., “Grid Quality and Resolution Issues from the Drag Prediction Workshop Series”, AIAA
Journal of Aircraft, Vol. 46, No. 3, pp. 935-950, March 2009.
95
acoustic simulations all tend to take place in aircraft, spacecraft, jet engine, and rotorcraft analysis
and design processes. High fidelity CFD is not routinely used in such MDAs, although recent years
have witnessed a significant rise in the coupling of state-of-the-art CFD with additional disciplines.
While frameworks for the coupling of disciplinary analyses are widely available, the ability to couple
CFD with other high fidelity descriptions of participating disciplines is limited by the availability of
coupling software and, more fundamentally, by a lack of general methodologies for accurate, stable,
and conservative MDAs. The application of optimization techniques in industry is mostly limited to
single-discipline simulations
30
-
31
.
Although conceptual design tools have long benefited from multidisciplinary optimization (MDO)
approaches, high fidelity CFD-based optimizations are still rare. During the past decade, the
development of advanced surrogate modeling techniques and the introduction of adjoint-based
optimal shape design techniques have enabled the use of CFD in aerodynamic optimization of aircraft
and gas turbine components. However, the use of optimization with multiple disciplines treated
using high-fidelity methods is still within the realm of advanced research and is by no means a routine
practice.
4.4 Vision of CFD in 2030 as anticipated by NASA
This is in fact a mirror image of the report done by USDOE, which will be covered later, but with
emphasis on CFD. Given the inherent difficulties of long-term predictions, our vision for CFD in 2030
is grounded on a desired set of capabilities that must be present for a radical improvement in CFD
predictions. Of special interests are critical flow phenomena associated with the key aerospace
application, including commercial/military aircraft, engine propulsion, rotorcraft, space exploration,
launch vehicle programs, air-breathing space-access, and spacecraft entry
32
. This set of capabilities
includes not only the accurate and efficient prediction of fluid flows of interest, but also the usability
of CFD in broader contexts (including uncertainty quantification, optimization, and multidisciplinary
applications) and in streamlined/automated industrial analysis and design processes. To complicate
things further, CFD in 2030 must effectively leverage the uncertain and evolving environment of HPC
platforms that, together with algorithmic improvements, will be responsible for a large portion of the
realized improvements. The basic set of capabilities for CFD must include, at a minimum:
Emphasis on physics-based, predictive modeling. In particular, transition, turbulence,
separation, chemically reacting flows, radiation, heat transfer, and constitutive models must
reflect the underlying physics more closely than ever before.
Management of errors and uncertainties resulting from all possible sources:
1. Physical modeling errors and uncertainties addressed,
2. Numerical errors arising from mesh and discretization inadequacies, and
3. Uncertainties derived from natural variability, as well as epistemic uncertainties due
to lack of knowledge in the parameters of a particular fluid flow problem.
A much higher degree of automation in all steps of the analysis process is needed
including geometry creation, mesh generation and adaptation, the creation of large databases
of simulation results, the extraction and understanding of the vast amounts of information
generated, and the ability to computationally steer the process. Inherent to all these
improvements is the requirement that every step of the solution chain executes high levels of
reliability/robustness to minimize user intervention.
30
Jeffrey Slotnick and Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie
,Dimitri Mavriplis ,”CFD Vision 2030 Study: A Path to Revolutionary Computational Aerosciences”, NASA/CR
2014-218178.
31
Same as above.
32
Same as above.
96
Ability to effectively utilize massively parallel, heterogeneous, and fault-tolerant HPC
architecture. For complex physical models with nonlocal interactions, the challenges of
mapping the underlying algorithms onto computers with multiple memory hierarchies,
latencies, and bandwidths must be overcome.
Flexibility to tackle capability and capacity-computing tasks in both industrial and
research environments so that both very large ensembles of reasonably-sized solutions
(such as those required to populate full-flight envelopes, operating maps, or for parameter
studies and design optimization).
Seamless integration with multidisciplinary analyses that will be the norm in 2030.
Without sacrificing accuracy or numerical stability of the resulting coupled simulation, and
without requiring a large amount of effort such that only a handful of coupled simulations are
possible. Included in this desired set of capabilities is a vision of the interaction between the
engineer/scientist, the CFD software itself, its framework and all the ancillary software
dependencies (databases, modules, visualization, etc.), and the associated HPC environment.
A single engineer/scientist must be able to conceive, create, analyze, and interpret a large
ensemble of related simulations in a time-critical period (e.g., 24 hours), without individually
managing each simulation, to a pre-specified level of accuracy. There should be less emphasis
on the mechanics of running and collecting the information, and more emphasis on
interpreting and understanding the results of the work. At the moment, CFD is not yet
sufficiently predictive and automated to be used in critical/relevant engineering decisions by
the non-expert user, particularly in situations where separated flows are present
33
.
Finally, we define a set of Grand Challenge (GC) problems that are bold and in fact may not
be solvable in the 2030 timeframe, but are used as drivers to identify critical technologies in
need of investment, and to serve as benchmarks for continually measuring progress toward
the long term development goals. These GC problems are chosen to embody the requirements
for CFD in 2030, and cover all important application areas of relevance to NASA’s aeronautics
mission, as well as important aspects of NASA’s space exploration mission
34
. They are:
1. LES of aircraft configuration across the full flight envelope.
2. Off-design turbofan engine transient simulation.
3. MDAO of a highly flexible advanced aircraft configuration.
4.4.1 Technology Roadmap to achieve GC Challenge
The CFD technology roadmap is a complete and concise view of the key research technologies and
capabilities that must be developed and integrated into production CFD. The individual elements on
the roadmap were identified based on the results of the CFD user survey, detailed technical
discussions held during the Vision 2030 CFD workshop, and from interactions among our team
members. Key technology milestones, proposed technology demonstrations, and critical decision
gates are positioned along timelines, which extend to the year 2030. Separate timelines are identified
for each of the major CFD technology elements that comprise the overall CFD process. The key
milestones indicate important advances in CFD technologies or capabilities that are needed within
each technology element. Technology demonstrations are identified to help verify and validate when
technology advances are accomplished, as well as to validate advances toward the simulations of the
GC problems identified above. Specific details of the development plan for each technology element
are given below.
33
Jeffrey Slotnick and Abdollah Khodadoust, Juan Alonso, David Darmofal, William Gropp, Elizabeth Lurie
,Dimitri Mavriplis ,”CFD Vision 2030 Study: A Path to Revolutionary Computational Aerosciences”, NASA/CR
2014-218178.
34
See Previous.
97
4.4.1.1 High Performance Computing (HPC)
As mentioned previously, advances in HPC hardware systems and related computer software are
critically important to the advancement of the state of the art in CFD simulation, particularly for high
Reynolds turbulent flow simulations. Based on feedback from the user community survey, we
envision HPC technology advancing along two separate paths. Ongoing development of exa-scale
systems, as mentioned earlier, will continue through 2030, and represents the technology that will
most likely provide the large increase in throughput for CFD simulation in the future
35
. However,
novel technologies, such as quantum computing or molecular computing, offer a true paradigm shift
in computing potential and must be carefully considered at strategic points in the overall
development plan, even though the technology is at a very low level today.
In order to properly address the HPC challenge, three specific thrusts must be supported. Firstly,
current simulation software must be ported to evolving and emerging HPC architectures with a view
toward efficiency and software maintainability. Secondly, investments must be made in the
development of new algorithms, discretization, and solvers that are well suited for the massive levels
of parallelism
36
-
37
. Finally, increased access to the latest large-scale computer hardware must be
provided and maintained, not only for production runs, but also for algorithmic research and
software development projects, which will be critical for the design and validation of new simulation
tools and techniques
38
. We propose several key milestones that benchmark the advances that we
seek: modification of NASA and related CFD codes to efficiently execute on hierarchical memory
(GPU/co-processor) systems by 2020, initial evaluation of exa-scale performance on a representative
CFD problem, and a demonstration of 30 exa-FLOP performance for one or more of the proposed GC
problems in the 2030 time frame.
Concurrently, we stress the importance of closely observing advances in revolutionary HPC
technologies, such as superconducting logic, new memory technologies, alternatives to current.
Because these technologies are in their infancy, we foresee decision gates in 2020,-2025, and 2030
to establish the ability of these systems to solve a relevant model problem. Implicit in this strategy is
the need to provide access to experimental hardware on a continual basis and to explore radical new
approaches to devising CFD simulation capabilities. If, at any of these decision points, the technology
clearly shows its expected potential, we recommend increased investment to accelerate the use of
these machines for CFD applications.
4.4.1.2 Physical Modeling
Advances in the physical modeling of turbulence for separated flows, transition, and combustion are
critically needed to achieve the desired state of CFD. For the advancement of turbulent flow
simulation, we propose three separate tracks for research: RANS-based turbulence treatments;
hybrid RANS/LES approaches where the entire boundary layer is resolved with RANS-based models,
and the outer flow is resolved with LES models; and LES, including both Wall-Model and Wall-
Resolved. Details on each of the three development tracks and for transition and combustion
modeling, are given below. Additionally, a longer term high-risk effort should investigate radically
new approaches to physical modeling.
RANS-based turbulence models continue to be the standard approach used to predict a wide range
35
Kogge, P. (Ed.), Exa-scale Computing Study: Technology Challenges in Achieving Exa-scale Systems”,
Contractor report for AFRL Contract No. FA8650-07- C-7724, September 2008.
36
Mavriplis, D., Darmofal, D., Keyes, D. and Turner, M., “Petaflops Opportunities for the NASA Fundamental
Aeronautics Program”, AIAA Paper 2007-4084,18th AIAA Computational Fluid Dynamics Conference, June
2007, 10.2514/6.2007-4084.
37
Sarkar, V. (ed.), Exa-scale Software Study: Software Challenges in Extreme Scale Systems”, DARPA, IPTO,
AFRL report under contract FA8650-07-C-7724, September 2009.
38
Biswas, R., Aftosmis, M. J., Kiris, C., and Shen, B. W., “ Petascale Computing: Impact on Future NASA Missions”,
Petascale Computing: Architectures and Algorithms (D. Bader, ed.), Chapman and Hall / CRC Press, 2007.
98
of flows for very complex configurations across virtually all aerospace product categories. They are
easy to use, computationally efficient, and generally able to capture wall-bounded flows, flows with
shear, flows with streamline curvature and rotation, and flows with mild separation. For these
reasons, as well as the fact that RANS models will remain as an important component in hybrid
RANS/LES methods, their use will continue through 2030. An advanced formulation of the RANS-
based approach, where the eddy viscosity formulation is replaced with the direct modeling of the
Reynolds stresses, known as the Reynolds Stress Transport method, in principle will be able to
capture the onset and extent of flow separation for a wider range of flows
39
.
Currently Hybrid RANS/LES methods show perhaps the most promise in being able to capture more
of the relevant flow physics for complex geometries at an increasingly reasonable computational
cost
40
. From the user survey, the majority of survey participants ranked the continued development
of hybrid RANS/LES methods as the top priority in the area of turbulence modeling. However, as
mentioned previously, several issues still exist. First, the prediction of any separation in the boundary
layer will still require improvements in RANS-based methods. Second, a seamless, automatic RANS-
to-LES transition in the boundary layer is needed to enhance the robustness of these methods.
Continued investment in hybrid RANS/LES methods to address these two critical shortcomings will
be required. Additionally, more effective discretization and solvers designed specifically for LES type
problems must be sought. When combined with advances in HPC hardware, these three
developments will enable continued reduction in the RANS region as larger resolved LES regions
become more feasible. It is fully anticipated that hybrid RANS/LES methods will become viable in
production mode by the 2030 timeframe for problems typical of the proposed GCs.
4.4.1.3 Numerical Algorithms
The development of novel numerical algorithms will be critical to achieving the stated CFD 2030
goals. Indeed, the proposed GCs are sufficiently ambitious that advances in HPC hardware alone
during the next 20 years will not be sufficient to achieve these goals. As demonstrated in Case Study
2, even for LES of relatively simple geometries, leadership class HPC hardware in 2030 will be needed
for 24-hour turnaround if existing algorithms are used. Thus, to tackle the proposed GCs, orders of
magnitude improvement in simulation capabilities must be sought from advances in numerical
algorithms
41
. The focus of investment must be on discretization and solvers that scale to massive
levels of parallelism, that are well-suited for the high-latency, deep memory hierarchies anticipated
in future HPC hardware, and that are robust and fault tolerant. A well balanced research program
must provide for incremental advances of current techniques (e.g., extending the scalability of
current CFD methods to the exa-scale level whenever possible), while at the same time investing in
the fundamental areas of applied mathematics and computer science to develop new approaches
with better asymptotic behavior for largescale problems and better suitability for emerging HPC
hardware.
Discretization techniques such as higher-order accurate methods offer the potential for better
accuracy and scalability, although robustness and cost considerations remain
42
. Investment must
focus on removing these barriers in order to unlock the superior asymptotic properties of these
methods, while at the same time pursuing evolutionary improvements in other areas such as low
39
Eisfeld, B., “Reynolds Stress Modeling for Complex Aerodynamic Flows”, Presented at the European
Conference on Computational Fluid Dynamics, ECCOMAS CFD 2010, Lisbon, Portugal, June 14−17, 2010.
40
Song, F., Haase, W., Peng, S-H., and Schwamborn, D. (eds.), Progress in Hybrid RANS-LES Modeling, Springer
Press, ISBN 978-3-642-31817-7, Sept. 2011.
41
Mavriplis, D., Darmofal, D., Keyes, D. and Turner, M., “Petaflops Opportunities for the NASA Fundamental
Aeronautics Program”, AIAA Paper 2007-4084,18th AIAA Computational Fluid Dynamics Conference, 2007.
42
Kroll, N., Bieler, H., Deconinck, H., Couallier, V., van der Ven, H.; and Sorensen, K. (Eds.), “ADIGMA A
European Initiative on the Development of Adaptive High-Order Variational Methods for Aerospace
Applications”, Notes on Numerical Fluid Mechanics and Multidisciplinary Design, Vol. 11, 2010, Springer.
99
dissipation schemes, flux functions, and limiter formulations. Simultaneously, novel nontraditional
approaches, such as Lattice-Boltzmann methods or other undeveloped schemes, should be
investigated for special applications. Improved linear and nonlinear solvers must be developed, and
here as well, the focus must be on highly scalable methods that are designed to be near optimal for
the large-scale, time-implicit unsteady CFD and MDAO simulations anticipated in the future.
These may include the extension of well-known matrix-based techniques, [Krylov methods]
43
, highly
parallel multigrid methods
44
, or the development of completely novel approaches such as systematic
upscaling methods
45
. Furthermore, these methods must be extensible to tightly coupled
multidisciplinary problems. Investment in discretization and solvers must also consider the potential
of these methods to operate on dynamically adapting meshes, to enable optimization procedures,
and to incorporate advanced uncertainty quantification capabilities.
In many cases, adjoint technology
46
-
47
will be required from the outset for all of these capabilities,
but the potential of other more advanced technologies such as second-order gradients [Hessians]
48
-
49
should be investigated as well. Longer term, high-risk research should focus on the development
of truly enabling technologies such as monotone or entropy stable schemes in combination with
innovative solvers on large-scale HPC hardware. The technology roadmap envisions the
demonstration of improved robust and scalable solvers in the 2015-2017 timeframe, for both
second-order and higher-order accurate methods. The demonstration of complete configuration-grid
convergence technology in the 2020 time frame relies on the use of robust higher order discretization
combined with improved scalable solvers and adaptive h-p refinement. Toward the 2030 time frame,
it is anticipated that novel entropy stable formulations will begin to bear fruit for industrial
simulations.
4.4.1.4 Uncertainty Quantification (UQ)
With regard to uncertainty quantification, a new thrust in the area of probabilistic large-scale CFD
for aerospace applications should be initiated. An initial thrust in this area should focus on enabling
current aerospace CFD tools with well-known uncertainty quantification techniques, such as
sensitivity analysis and propagation methods using adjoints and forward linearization, nonintrusive
polynomial chaos methods, and other reduced-order model formulations
50
-
51
. Additionally, a
concerted effort should be made to characterize important aerospace uncertainties and to make
these available to the general research community for enabling relevant UQ research in these areas.
Improved error estimation techniques must be investigated and developed, given the known
deficiencies of current approaches (including adjoint methods). This will require a foundational
program in the mathematics of error estimation and its application to CFD software. Finally, longer
43
Saad, Y., Iterative Methods for Sparse Linear Systems, Second Edition, SIAM, 2003.
44
Baker, A. H., Falgout, R. D., Kolev, Tz. V., and Yang, U. M., “Scaling Hypre’s Multigrid Solvers to 100,000 Cores”,
High Performance Scientific Computing: Algorithms and Applications, M. Berry et al., eds., Springer (2012).
45
Brandt, A., “Multiscale Solvers and Systematic Upscaling in Computational Physics”, Computer Physics
Communications, Vol 169, Issues 13, pp. 438-441, July 2005.
46
Jameson, A., “Aerodynamic Design via Control Theory”, ICASE Report No. 88-64, November 1988, also, J. of
Scientific Computing, Vol. 3, pp. 233-260, 1988.
47
Errico, R. M., “What is an adjoint model?”, Bulletin of the American Meteorological Society, 2577 2591, 1997.
48
Taylor, A. C., Putko, M. M., Green, L. L., and Newman, P. A., “Some Advanced Concepts in Discrete Aerodynamic
Sensitivity Analysis”, AIAA Journal, Vol.41, pp. 224-1229, 2003,10.2514/2.2085.
49
Rumpfkeil, M. P., and Mavriplis, D. J., “Efficient Hessian Calculations Using Automatic Differentiation and the
Adjoint Method with Applications”, AIAA Journal, Vol.48, pp. 2406-2417, 2008, 10.2514/1.J050451.
50
Shankaran, S., and Jameson, A., “Robust Optimal Control using Polynomial Chaos and Adjoints for Systems
with Uncertain Inputs”, AIAA Paper 2011-3069, 20th AIAA Computational Fluid Dynamics Conference, 2011.
51
Ng, L. W-T., Huynh, D. B. P., and Willcox, K., “Multifidelity Uncertainty Propagation for Optimization Under
Uncertainty”, 12th AIAA Aviation Technology, Integration, and Operations (ATIO) Conference and 14th
AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2012, 10.2514/6.2012-5602.
100
term research must focus on statistical approaches such as Bayesian techniques for quantifying more
accurately modeling and other nonlinear error sources
52
. The technology roadmap includes an early
target date of 2015 for the characterization of typical aerospace uncertainties in order to stimulate
work in this area. Improved error estimation techniques will be gradually brought into the simulation
capabilities and the state of these estimates will be assessed in the 2018 time frame. Comprehensive
uncertainty propagation techniques including discretization error, input and parameter
uncertainties in production-level CFD codes should be targeted for 2025, while the development of
more sophisticated stochastic and Bayesian approaches will continue through the 2030 timeframe.
4.4.1.5 Geometry and Grid Generation
Substantial new investment in geometry and grid generation technology will be required in order to
meet the Vision CFD 2030 goals. In general, this area has seen very little NASA investment during the
last decade, although it remains one of the most important bottlenecks for large-scale complex
simulations. Focused research programs in streamlined CAD access and interfacing, large-scale mesh
generation, and automated optimal adaptive meshing techniques are required. These programs must
concentrate on the particular aspects required to make mesh generation and adaptation less
burdensome and, ultimately, invisible to the CFD process, while developing technologies that enable
the capabilities that will be required by Vision 2030-CFD applications, namely very large scale
parallel mesh generation, curved mesh elements for higher order methods
53
-
54
, highly scalable
dynamic overset mesh technology
55
, and in anisotropic adaptive methods for time-dependent
problems. It is important to realize that advances in these areas will require a mix of investments in
incremental software development, combined with advances in fundamental areas such as
computational geometry, possibly with smaller components devoted to high risk disruptive ideas
such as anisotropic cut-cell meshes
56
, strand mesh ideas
57
, and even meshless methods
58
.
Additionally, because significant technology currently resides with commercial software vendors,
particularly for CAD interfaces and access, involving these stakeholders in the appropriate focused
research programs will be critical for long-term success. Innovative approaches for achieving such
partnerships must be sought out, such as the formation of consortiums for the definition and
adoption of standards or other potential issues such as large scale parallel licensing of commercial
software. The technology development roadmap envisions the demonstration of tight CAD coupling
and production adaptive mesh refinement (AMR) in the 2015-2017 time frame, followed by
maturation of large-scale parallel mesh generation in the 2020-2025 time frame, and leading
ultimately to fully automated in-situ mesh generation and adaptive control for large-scale time-
dependent problems by 2030.
52
Press, S. J. , Subjective and Objective Bayesian Statistics: Principles, Methods and Applications, 2nd edition,
2003, Wiley, New York.
53
Wang, L., Anderson, W. K., Erwin, J., and Kapadia, S., High-order Methods for Solutions of Three dimensional
Turbulent Flows”, AIAA Paper 2013-856, 51st AIAA Aerospace Sciences Meeting, Jan 2013.
54
Persson, P-O., Willis, D., and Peraire, J., The Numerical Simulation of Flapping Wings at Low Reynolds
Numbers”, AIAA Paper 2010-724, 48th AIAA Aerospace Sciences Meeting, Jan 2010.
55
Pulliam, T. H. and Jespersen, D. C., Large Scale Aerodynamic Calculation on Pleiades”, Proceedings of the 21st
International Conference on Parallel Computational Fluid Dynamics, Moffett Field, California, May 18−22, 2009.
56
Modisette, J., and Darmofal, D., “Toward a Robust, Higher-Order Cut-Cell Method for Viscous Flows”, AIAA
Paper 2010-721, 48th AIAA Aerospace Sciences Meeting, Jan 2010.
57
Katz, A., Wissink, A., Sitaraman, J., and Sankaran, V., “Application of Strand Meshes to Complex Aerodynamic
Flow Fields”, AIAA Paper 2010-4934, 28th AIAA Applied Aerodynamics Conference, June 2010.
58
Katz, A., and Jameson, A, “Meshless Scheme Based on Alignment Constraints”, AIAA Journal, Vol.48, pp. 2501-
2511, 2010.
101
4.4.1.6 Knowledge Extraction
Peta-scale and exa-scale simulations will generate vast amounts of data and various government
agencies such as the NSF and DOE have instituted major programs in data-driven simulation
research. In order to make effective use of large scale CFD and MDAO simulations in aerospace
engineering, a thrust in data knowledge extraction should be initiated. Ideally, this should contain
three components, a visualization component, a database management component, and a variable
fidelity, data integration component. Methods to process and visualize very largescale unsteady CFD
simulations in real time, including results from higher-order discretization, are required to support
the advanced CFD capabilities envisioned in 2030. Although many of the current efforts in maturing
visualization technology are being led by commercial vendors who continue to supply enhanced
capabilities in this area, more fundamental research to directly embed visualization capabilities into
production CFD tools optimized for emerging HPC platforms is needed to achieve real-time
processing
59
.
Moreover, the CFD capability in 2030 must provide the analyst with a more intuitive and natural
interface into the flow solution to better understand complex flow physics. Foreseeing the capability
of generating large databases with increasing computational power, techniques for rapidly
integrating these databases, querying them in real time will be required. Finally, integrating high
fidelity simulation data with lower fidelity model data, as well as experimental data from wind tunnel
tests, engine test rigs, or flight-test data will provide a powerful approach for reducing overall risk in
aerospace system design
60
. Techniques for building large-scale flexible databases are in their infancy,
and range from simple software infrastructures that manage large numbers of simulation jobs to
more sophisticated reduced-order models
61
, surrogate models, and Kriging methods
62
. The objective
of a research thrust in this area should be to apply existing techniques to current CFD simulation
capabilities at a large scale, while simultaneously performing foundational research in the
development of better reduced-order models and variable fidelity models that are applicable to
aerospace problems and can support embedded uncertainty quantification strategies.
The technology roadmap envisions the demonstration of real time analysis and visualization of a
notional 1010 point unsteady CFD simulation in 2020, and a 1011 point simulation in 2025. These
technology demonstrations would be an integral part of the GC problems designed to benchmark
advances in other CFD areas. The development of reduced-order models and other variable fidelity
models will entail long term research and will likely remain an active research topic past the 2030
time frame. However, the technology roadmap envisions the periodic assessment of the state-of-the-
art in these areas at 5 to 10 year intervals, with investment directed toward demonstrating promising
approaches on large-scale aerospace applications.
4.4.1.7 Multidisciplinary Design and Optimization
The ability to perform CFD-based multidisciplinary analysis (MDA) and analysis/optimization
(MDAO) relies on the availability of future capabilities that need to be developed between now and
2030. Pervasive and seamless MDAs (that can be routinely exercised in industrial practice for
configuration studies, e.g., full aero-thermo-elastic/aero-acoustic simulations of entire
airframe/propulsion systems including shielding) will require the development of accepted
standards and APIs for disciplinary information and the required multidisciplinary couplings (such
59
Wang, Y., Yu, H., and Ma, K-L, “Scalable Parallel Feature Extraction and Tracking for Large Time-Varying 3D
Volume Data”, Proceedings of EGPGV 2013, May 2013, pp. 55-62.
60
The 1st Workshop on Integration of Experimental Fluid Dynamics (EFD) and Computational Fluid Dynamics
(CFD), JAXA Special Publication SP-09-002, January 2010.
61
Washabaugh, K., Amsallem, D., Zahr, M., and Farhat, C., “Nonlinear Model Reduction for CFD Problems Using
Local Reduced-Order Bases”, AIAA Paper 2012-2686, 42nd AIAA Fluid Dynamics Conference, June 2012.
62
Han, Z-H., and Görtz, S., “Hierarchical Kriging Model for Variable-Fidelity Surrogate Modeling”, AIAA Journal,
Vol.50, pp.1885-1896, 2012, 10.2514/1.J051354.
102
as with acoustics, combustion, structures, heat transfer, radiation). A concerted effort is envisioned
that results in a set of standards available to the community around 2016. In parallel with this effort,
it will also be necessary to develop high-fidelity coupling techniques that guarantee the accuracy and
stability of high fidelity, tightly coupled MDAs
63
, while ensuring that the appropriate conservation
principles are satisfied with errors below acceptable thresholds.
This capability, together with the coupling software that includes such information transfers must be
available around 2018. Together, the standards and the coupling techniques/software would enable
demonstrations of two-way coupled MDAs with the best and most robust existing CFD solvers of the
time, and guaranteeing coupling fidelity by the year 2020. Such demonstrations can focus on multiple
aerospace problems of interest, including aircraft aero-structural/aero-elastic analyses, aircraft
aero-acoustics, rotorcraft aero-structural and aero-acoustic couplings, unsteady combustion, re-
entry aerothermodynamics and material response, and the like. Initially, such routine MDAs would
focus on portions of an entire vehicle (around 2020) and would transition to the treatment of the
entire system around 2025.
A number of capabilities also must be developed in order to enable MDAO with and without the
presence of uncertainties (robust and reliability-based design). A major research component that is
likely to span a significant period (2015 -2025) is the work needed to endow industrial strength CFD
solvers with both gradient calculation and uncertainty quantification capabilities for use in
multidisciplinary optimization. Some of this work has been described in the “Numerical Algorithms”
section. For the gradient/ sensitivity analysis capability, we envision that the CFD solver will be able
to compute this information for full unsteady flows for the turbulence models available at the time.
Finally, all these new capabilities must come together on a series of MDAO grand-challenge
demonstrations in the 2030 timeframe.
4.4.2 Recommendations
In order to effectively execute the CFD development plan described above and achieve the goals laid
out in the vision of CFD in 2030, a comprehensive research strategy and set of recommendations are
presented. This research strategy calls for the renewed preeminence of NASA in the area of
computational sciences and aerodynamics, and calls for NASA to play a leading role in the pursuit
of revolutionary simulation based engineering. Aerospace engineering has had a long history of
developing technology that impacts product development well beyond the boundaries of aerospace
systems. As such, NASA is a critical force in driving technology throughout aerospace engineering
directly by fulfilling its obligation. Computational methods are a key example of this broad impact,
as NASA has historically been a leader in the development of structural finite-element methods,
computational fluid dynamics, and applications of HPC to engineering simulations. NASA’s effort
must be targeted toward research and technology development that can make revolutionary impacts
on simulation-based engineering in the aerospace sciences. In particular, the current state of CFD is
such that small, incremental improvements in existing capability have not had revolutionary effects.
In an environment of constrained resources, this will require that NASA evaluate its activities with a
critical eye toward supporting those efforts whose impact could be revolutionary. To ensure that the
technology plan and roadmap are as effective as possible, we propose specific recommendations (see
Figure 4.4.1). Naturally, individual research thrusts affect multiple technical areas, which in turn
affect the ability to meet various milestones and progress toward the GC problems.
63
“Multiphysics Simulations: Challenges and Opportunities”, Argonne National Lab Report ANL/MCS-TM-321,
Report from Workshop sponsored by the Institute for Computing in Science (ICiS), Park City, Utah, June -
August, 2011.
103
4.5 HPC Envisioned by Department of Energy (DOE)
The aim here is whether or not to pursue the main issues raised by ‘going to the exa-scale, and to
provide some guidance on the level of risk involved in pursuing
64
, and not pursuing, this direction of
high performance computing. Going to the exa-scale will mean a radical change in computing
architecture basically. It vastly increasing the levels of parallelism to the point of millions of
processors working in cycle which will force radical changes in how hardware is designed. It will
dictate in how we go about solving problems (e.g., the application codes), and in how we marry
application codes to the underlying hardware (e.g., the compilers, I/O, middleware, and related
software tools). Understanding the advantages to be gained by going to the exa-scale, and evaluating
the risks involved by going down this path, requires both an evaluation of past experiences in moving
from the megaflop era to the present petaflop era, as well as an assessment of the readiness of
advanced applications to take transformative advantage of exa-scale computing. The challenges
inherent in developing exa-scale computing as a practical endeavor are considerable, and significant
investments will be needed to accomplish this.
64
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
Figure 4.4.1 Proposed New Computational Sciences Program Structure
GEOMETRY AND GRID
GENERATION
CAD access and interfaces
Large scale parallel mesh
generation
Adaptive mesh refinement
Curved mesh elements for
higher‐order
MDAO
Interfaces and standards
Accurate and stable
coupling techniques
UQ support and
sensitivities (systemlevel)
KNOWLEDGE
MANAGEMENT
Visualization
Data‐base management
Variable fidelity models
HPC
Increasing access to
leading‐edge HPC hardware
Porting of current and future
codes to leading‐edge HPC
Radical emerging HPC
technologies
PHYSICAL MODELING
RANS turbulence modeling
Hybrid RANS‐LES modeling
a. Improved RANS component
b. Seamless interface
LES (wall‐modeled and
wall‐resolved)
Transition
Combustion
Radically new modeling
approaches
NUMERICAL
ALGORITHMS
Advances in current algorithms
for HPC
Novel discretizations a.
Higher‐order methods b.
Low dissipation/dispersion
schemes c. Novel approaches
(foundational)
Solvers
a. Linear and non‐linear scalable
solvers
b. Enhancements for MDAO and
UQ
UQ
a. Define aerospace
uncertainties
b. Leverage kn own techniques
c. Improved error estimation
techniques
d. Statistical approaches
104
4.5.1 What is Exa-scale Computing?
Exa-scale computing refers to computing systems capable of at least one exa-FLOPS, or a billion x
billion calculations per second
65
. Such capacity represents a thousand fold increase over the first
peta-scale computer that came into operation in 2008
66
. (One exaflops is a thousand petaflops or a
quintillion, 1018, floating point operations per second). At a supercomputing conference in 2009,
Computerworld projected exa-scale implementation by 2018. Exa-scale computing would be
considered as a significant achievement in computer engineering, for it is believed to be the order of
processing power of the human brain at neural level (functional might be lower). It is, for instance,
the target power of the Human Brain Project
67
.
4.5.2 Why Exa-scale?
The most obvious question, the key question really is of course: why go to the exa-scale? This
question is not meant in the trivial sense that one would pose for any expenditure whatsoever in
leading edge computing technologies, but rather is motivated by the fact that the transition from
current peta-scale computing to the exa-scale will involve investments across the board from
hardware to fundamental algorithms, programming models, compilers, and application codes that
will dwarf previous levels of investment made as computer architectures have evolved in the past.
That is, we recognize that the values to society extracted from this change in computing paradigm
has to be commensurate with the costs of developing this type of computing and given the substantial
costs, we need to be sure that the extracted values are similarly substantial. We will make the
argument in the following that the extracted values are in fact very large but will do so in two stages,
first by making some general points about the present frontiers of computing independent of
discipline and then by focusing on a few example disciplines to illustrate the more general point.
4.5.3 Range of Applications may be Transformed by Going to the Exa-scale
As discussed earlier, a key question to be addressed in considering going to the exa-scale is the
readiness of key applications to take this step, as well as the likelihood that taking this approach will
lead to transformative changes in these application areas. This question is addressed in the present
section, focusing once again on a selection of disciplines to illustrate the breadth of applications that
are ready for this transition.
4.5.3.1 Aerospace, Airframes and Jet Turbines
Computing at an extreme scale will have transformational effects on several key applications in the
aerospace industry
68
. The move from RANS to LES as the industry standard and its use in the design
cycle represents a paradigm shift for the aerospace industry. In addition, there are several
outstanding scientific problems in these sectors that can be understood and hopefully controlled
using extreme scale computing. The accuracy achieved with the RANS approach for prediction of
quantities of engineering interest in the airframe industry has reached a plateau owing to the
epistemic uncertainties inherent in such turbulence models. As a result, the design of aircraft and
propulsions systems relies on an iterative process where several expensive prototypes are
constructed and tested in wind tunnels. Hybrid RANS/LES approaches with grounding in the first
principles can overcome the limitations of RANS and enhance the predictive capability of CFD beyond
the present seemingly stagnant state of speculative trial-and-error in design
69
. In addition, building
65
From Wikipedia, the free encyclopedia.
66
National Research Council, “The potential impact of high-end capability computing on four illustrative fields of
science and engineering”, The National Academies. p. 11. ISBN 978-0-309-12485-0.
67
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
68
See Above.
69
Wall-Modeled LES (WM-LES) and hybrid RANS-LES methods provide a clear path to first-principles design
of next-generation aircraft as exa-scale computing arrives. Transitioning this technology to future exa-scale
105
a complete flight-envelope characterization (accounting for irreducible uncertainties, e.g. angle-of-
attack, flight conditions, and geometry) will only be possible with computing at the Exa-scale and
beyond. Such a design framework for aerodynamically optimized vehicles and propulsion systems is
a critical resource for the design and construction of next generation aircraft and propulsion systems.
Figure 4.5.2 provides estimates of the computing requirements to meet these design goals to
address several Grand Challenges in aerospace systems where the computer speed and memory
requirements for analysis and design of airfoils, wings, and complete aircraft for three different
stages of approximation.
One of the major problems confronting the aircraft industry is the aerodynamic noise generated by
engine exhaust jets and airframes, particularly during take-off and landing approaches. The noise
problem has been a major issue for high-speed commercial aircraft and more recently for military
aircraft, both for impact on communities surrounding airports and military bases, and on the crew
stationed on aircraft carrier decks. It is known that turbulence is a major contributor to aircraft noise.
Unfortunately, modern optical diagnostic techniques are far from adequate in measuring the spatial-
temporal data needed to reveal the mechanics of aerodynamic noise; only high-fidelity simulation
techniques, such as LES, are capable of predicting both the far-field noise as well as details of the
noise generating turbulent eddies. Exa-scale computing would have transformational impact on the
discovery of the mechanics of noise generation, and would be instrumental in designing noise
mitigation strategies. Figure 4.5.3 shows the turbulent flow from a supersonic exhaust jet (M = 1.7)
obtained from a breakthrough state of the art LES computation in 2010. This first-of-a-kind of
calculation lacks high-fidelity representation of the flow inside the nozzle, and the agreement with
the measured noise data is only fair, presumably due to this inadequate grid resolution. As exa-scale
computing tools become available, high-fidelity tools would not only be used to understand and
platforms will have a transformative impact upon simulation-based engineering design, making possible the
design of aerodynamically optimized vehicles including integrated effects of propulsion, structures, and active
controls, a “Grand Challenge” of aerodynamic design.
Figure 4.5.2 Computer speed and memory requirements for the Grand Challenge
106
predict flow-generated noise, they will be used to learn how to control it. Such demonstration
calculations have been extremely computer intensive, and limited to very simple flows. Exa-scale
computing would be the enabling technology for complex flow control and shape optimization (e.g.,
of aircraft wings and nozzle exits), potentially leading to a major transformational effect on the
aerospace industry.
The other outstanding technical problems in the gas-turbine industry is the migration of hot fluid
parcels from the combustor to the turbine. The hot-streak migration is a limiting factor in the design
of turbines, as turbine blades, designed based on mean flow temperatures, are damaged severely
when encountering the migrating hot-spots. High-fidelity simulation of the flow inside the combustor
of a jet engine is a daunting task due to the multi-physics phenomena present. Even in the modern
LES computations of combustors using peta-scale class computers, reduced order models are used
for critical phenomena such as primary atomization of the injected liquid fuel into micron size
droplets, the evaporation process of the droplets and the chemical mechanisms involved. Exa-scale
computing would be the enabling technology for simulation of the jet engine combustors based on
first principles, which in turn promises to facilitate the discovery of mitigating strategies for the
suppression of the hot-streak migrations
70
.
4.5.3.2 Combustion
Reliable prediction requires, for example, the incorporation of heterogeneous kinetics with
quantified uncertainties in turbulent combustion simulations for processes such as soot
formation/burnout and increased fidelity coupling of high-pressure, low-temperature chemistry
with turbulent transport and these vital enhanced modeling techniques will only be feasible at exa-
scale computing performance levels. In particular, combustion scientists must focus on the science
underlying the development of non-petroleum based fuels, including carbon-neutral biofuels, and
their optimal use in transportation. This science intrinsically involves chemistry with transport at
70
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
Figure 4.5.3 A supersonic Jet Engine Nozzle Rapidly Accelerates High-Pressure Gas into the Atmosphere
107
conditions far from equilibrium and at extreme pressures and a coordinated multi-scale approach
for understanding and predicting combustion in turbulent environments
71
.
Combustion in practical devices covers a myriad of time and length scales, from the scale of the
electron to those of the largest scales of turbulence dependent upon the geometry of the device. To
tackle this daunting challenge and complexity, a multi-scale approach is adopted wherein
experiments, theory and direct computation are brought to bear on a limited range of scales (4-5
decades) and fundamental physical insights gained are encapsulated in reduced-order
parameterizations that are used to upscale knowledge to bridge the scales. Several high-fidelity
computational approaches in both the atomistic and continuum regimes utilize peta-scale computing.
Exa-scale computing would greatly facilitate higher fidelity or access to more practically relevant
parameter regimes (e.g., higher pressure, higher turbulence levels, and more complex fuels). In the
continuum regime where turbulence scales interact with flame, ignition, and mixing scales
turbulence-chemistry interactions are important. Virtually all combustion devices operate under
turbulent environments due to enhanced mixing and greater efficiency. Many of the fundamental
turbulence chemistry interactions are amenable to investigation by first principles direct numerical
simulation (DNS) and high-fidelity large-eddy simulation (LES) of building block, laboratory scale
flows. Whereas DNS focuses on the fully resolving the fine-grained physics, LES resolves the energy-
containing end of the turbulence spectrum down to a specified cut-off in the inertial or dissipative
end of the spectrum and the unresolved sub-grid scales are modeled. As such these methods are
complementary. Both DNS and LES require the horsepower of high-performance supercomputing at
the exa-scale and beyond to resolve all relevant flow and chemical scales. Exa-scale simulations are
required, for example, to understand the coupling between low-temperature ignition kinetics and
turbulent mixing at high pressure that determines lifted flame stabilization, ignition timing, rate of
combustion, and emissions characteristics. Understanding complex low-temperature high pressure
kinetics of alternative fuels and its coupling with turbulent transport at high pressure requires much
greater resolution and the transport of large numbers of reactive scalars only afforded by extreme
scale computing power. Moreover, in-situ reduction strategies for accurate and computationally
affordable inclusion of heterogeneous kinetics with quantified uncertainties in DNS and LES are
required. The insights gained from exa-scale simulations will enable the development of predictive
multi-scale models to optimally design future evolving fuels and engines.
Future predictive simulation tools running on exa-scale computing systems will enable deep
understanding of underlying chemical and combustion science processes, enhance combustion
engine design and performance, and ultimately yield a dramatic reduction in engine development
timescales, time to market, and development costs, while ensuring the timely achievement of energy
security and emissions goals, and enhancing the competitiveness of U.S. engine manufacturers and
fuel producers.
4.5.3.3 Climate Modeling
Although substantial uncertainty exists as to the degree and impacts of future climate change,
especially at local and regional scales, it is generally agreed that significant adaptation will be
required. Furthermore, the magnitude of climate change later in the century depends upon the near
and intermediate-term mitigation strategies used to reduce the emission of greenhouse gases. These
strategies also must satisfy an increasing energy demand of a growing global population experiencing
an improvement in its standard of living. Predicting these future climate changes and evaluating the
effects of mitigation strategies require Earth system models (ESMs) that are far more accurate and
comprehensive than those in use today. Integrated assessment models provide the framework for
climate predictions by defining the emissions scenarios and elucidating the relationships among the
natural and human systems that are at the core of climate change studies. In the next decade,
71
See Previous.
108
integrated assessment and comprehensive ESMs will probably be combined into a single system that
could be used to investigate scientific issues and to formulate policy options for adaptation and
mitigation.
The predictions from integrated ESMs will be most credible if the important processes in the climate
system, for example mixing by ocean eddies, are simulated at their native spatial and temporal scales.
Critical organized features in the atmosphere and ocean including clouds and eddies have
characteristic sizes of 1 to 10 km. Some of the major sources of uncertainty in climate predictions
from existing models are associated with the aggregate effects of these phenomena. Experience with
current climate models suggests that simulation of climate change with a model with 10-km grid
resolution is inherently a peta-scale problem. In fact, even higher resolution is required to resolve
these features with sufficient fidelity to the physical principles underlying their formation and
evolution. Since the computational cost increases nonlinearly with higher resolution, it is likely that
predictions of societal and environmental change at 1-km resolution would require truly extreme
scale computers.
4.5.3.4 Computational Biology
The ultimate goal of exa-scale computing applications to challenges in modern biology is to go from
atoms to organs or from microbes to ecosystems: for example, to enable an understanding of how
the brain works as an energy efficient, biologically-based information system, or to understand
microbial processes and their impact on the geosphere. In the process, these newly enlarged scales
of computing will resolve unfathomably complex research issues in a host of fields as diverse as
neuroscience and microbial metagenomics. At exa-scale, new scalable tools that admit a variety of
time, space and trajectory sampling methods (and fully exploit the hundreds of millions of cores of
an exa-scale machine) will enable long time integrations, implicit solvation conditions, and mixed
molecular mechanics and quantum mechanics models, to allow breakthrough science. For example,
a large biochemical network within a full-scale model of a eukaryotic cell could be modeled in the
span of a few hours.
It is important to note that the first million-atom simulation in biology was conducted just five years
ago an all-atom simulation of the ribosome conducted at Los Alamos National Laboratory. This
million particle simulation milestone had already been achieved a decade prior in materials science
and cosmology (computational scientists in both these fields now perform multibillion-particle
simulations). While biology researchers have achieved impressive methodological advances that
permit the modeling of the largest assemblies in the cell, it is only for short periods of time. And, these
simulations are unlikely to scale to the size of a single cell, even a small bacterium, for relevant times
such as minutes or hours even if researchers can employ computers capable of achieving 1,000
petaflops/s. Today, researchers are currently limited to the microsecond timescale for protein
folding required by the huge number of intermolecular, interaction computations. Scientists also lack
rigorous coarse grained models that permit the scaling up of macromolecular pathways and
supramolecular cellular processes. Similarly, systems biology methods lack the dynamic resolution
needed for coupling genomic and other data in order to fully map cellular networks, to predict their
functional states, and to control the time varying responses of living cells. Nor can current kinetics
models adequately analyze the dynamics of complex living systems. Exa-scale computing will be
needed to achieve those capabilities. Within the next decade, scientists expect to have the complete
genome sequence of more than 10,000 bacteria and archaea and other single-celled microbes. Exa-
scale computing platforms will make it possible in principle to systematically reconstruct the
metabolic networks of all sequenced microbes through automated comparative analysis, to
reconstruct their regulatory networks by integrating a variety of data sources, and to combine these
reconstructions into functional models of cellular states. Exa-scale computing will be critical to make
this a routine class of computation such that it can become part of the standard way we analyze
genomes in the future.
109
4.5.3.5 Materials Science
Materials innovations are central to many of the technological advances responsible for the quality
of life and prosperity. In fact, many of the disruptive technological advances since the turn of the last
century modern transportation, medical treatments and prosthetics, space exploration, global
communication, computers and the electronics industry used advances arising from every corner of
the materials world: metals, ceramics, semiconductors, polymers, and novel combinations of these.
Materials establish and support entire industries, and tens of millions of manufacturing jobs depend
on the availability of these advanced materials at affordable costs. A quantifiable understanding of
novel materials and their response is central as well to the technological challenges facing our
country. Whether it is ceramics for high-efficiency automobiles, photovoltaics for next-generation
solar power or smart alloys for efficient building construction, the nation requires the development
of advanced materials with superior properties that will drive the next generation of technologies. In
the highly competitive global marketplace that we find ourselves, minimizing time to solution and
time to market is crucial.
It is instructive to consider two workhorse techniques for materials modeling hydrodynamics and
molecular dynamics and examine the reasons why a simulation might fail to provide sufficiently
useful information. Molecular dynamics simulations are characterized by a force field or potential,
involving many adjustable parameters, which describes the interactions between atoms. There are
no parameters required to describe the response of the materials, however all the constitutive
response emerges naturally from the interaction potentials. Such calculations are currently limited
in size to fractions of a cubic micron simulated for 10’s of nanoseconds, even on the largest
computers. Hydrodynamics, by comparison, involves many adjustable parameters describing both
interaction and the materials response. However, there is no real size or time limit in the simulation.
There is a practical lower limit on resolution, as it
makes no sense to model a atomically sized region
of space using continuum equations. At a given
level of computing, computational scientists using
either method encounter two common barriers to
success: (a) the largest (or most finely resolved)
simulation possible is still too small (or too poorly
resolved) to capture the relevant behavior of
interest, or (b) the most complex, compute-
intensive simulation that can be solved in a
reasonable time is still too simple or approximate
to adequately describe the physics of interest. In
many cases both (a) and (b) are true which is
particularly damning, since it prevents the
investigator from performing the traditional
trade-off between these two constraints: very
often, one makes simplifying approximations to enable a larger simulation or investigates smaller
systems in order to perform a more complicated calculation.
On the other hand, investigating grain formation using molecular dynamics may not be possible, even
in the simplest metals on today’s computers. The availability of an exa-scale platform will move the
location of the constraints, allowing quite generally more detailed calculations of more complex
materials. State of-the-art calculations involving billions of atoms have been performed that
demonstrate the ability to model macroscopic (i.e., continuum) materials behavior with an atomistic
model that makes no assumptions about the cooperative response. Figure 4.5.4 shows a detail view
of 9 Billion-atom molecular dynamics simulation of a developing Kelvin-Helmholtz instability at the
sheared interface between aluminum and copper. With the development of an exa-scale computer it
is possible that such a calculation (which was heroic on a peta-scale computer) could be performed
Figure 4.5.4 Detail View of 9-Billion Atom
Molecular Dynamics Simulation Instability
110
on demand during a hydrodynamics calculation, determining, for example, the equation of state for
a mixed region at precisely the temperature, pressure and composition that was required. By
tabulating this information as it is generated, one can envision that such a simulation would teach
itself as it runs, learning only those regions of this three dimensional phase space that is needed.
4.5.3.6 Nuclear Engineering
Recent studies have reviewed the status and basic science, challenges, opportunities, and research
needs for advanced nuclear energy systems, with specific attention to the role of predictive modeling
and simulations (M&S) in addressing the difficulties posed by the radioactive materials and harsh
environments found in these systems:
Computational M&S offers the opportunity to accelerate nuclear energy development by
simulating complex systems to evaluate options and predict performance, thus narrowing
the technology path and optimizing testing requirements.
Today’s high-performance computational systems are capable of modeling complete reactor
systems and related technologies; the availability of exa-scale systems will enable high-
fidelity
M&S that can further improve the performance of existing reactors and have a significant
positive impact on both the design and the operation of future reactors.
Simulation has the potential for addressing the critical needs of advanced nuclear energy systems by
providing the tools necessary for safety assessments, design activities, cost, and risk reduction. One
can, for example, imagine virtual prototyping of reactor cores yielding data that leads to more
accurate identification of design margins, allows early experimentation with novel design concepts,
and ultimately significantly reduces plant certification timelines. In other areas, such as advanced
fuel fabrication, atomistic fuel simulations could ultimately make it possible to target a small subset
of promising candidate fuel types for further experimentation, greatly reducing the number of
experiments to be performed. A simulation-based methodology is within reach with exa-scale
computers. The scope of the M&S tools needed to support the design, analysis and engineering of
next-generation nuclear energy systems is daunting:
1. Integrated 3D reactor core simulations with rigorous propagation of uncertainty;
2. Coupled thermal hydraulic and primary loop simulation;
3. Advanced fuel design and performance;
4. Fuel behavior engineering;
5. Advanced secondary loop and balance of plant engineering and analysis;
6. Advanced fuel cycle design;
7. Separations facility engineering optimization;
8. Repository design including seismic, geological, chemical, and thermal modeling and simulation;
9. Overall nuclear energy systems model development suitable for alternative economic analysis.
Spent fuel reprocessing is very complicated with a large number of different materials, multiple
pathways must be considered; waste streams must be treated; improve coupling between
computations and experiments must occur. Reprocessing occurs at high temperature, and is in dire
need of better multi-scale M&S. The opportunities for impact on reprocessing with exa-scale M&S
abound. These include developing new separation agents, full-scale plant simulations using first
principles, integrating multiple codes, and separations simulations. Empirical understanding does
not lead to appropriate scale up it will instead require exa-scale computing. Some of the payoffs for
exa-scale computation include: reduced R&D cost and time; improved/accelerated design; process
scaleup; reduced facility cost; opportunity for major change; and waste form design. Many challenges
confront viable and useful (predictive) M&S of fuel performance. These include the ability to reduce
fuel development and qualification time, assess life cycle performance, address safety concerns,
111
predict fuel rod behavior in design basis accident (DBA), and predict current and advanced (e.g.,
transuranic) fuel behavior. Important effects and requirements to incorporate include material
properties, swelling, microstructural phase change, thermal properties, crack formation and
mechanical property change. High-fidelity modeling of fuel performance is inherently multiscale, e.g.,
the effects of point defects and fission products must be considered. Exa-scale platform requirements
drivers in fuel performance can be quantified.
Opportunities for exa-scale M&S of existing and future advanced reactors include eliminating
unrealistic assumptions that drive to more conservative designs and thus higher installation cost,
helping to achieve higher power efficiencies, a reduction of learning curves to get efficiencies, helping
to reduce the required number of repositories, improving safety posture, optimizing design of the
power grid and the fuel cycle and better (more efficient) operations, including in-line monitoring and
operator training. There are numerous issues confronting advanced reactor M&S today. The core is
a coupled physics problem (not currently being done very well today) and the full system needs to
be analyzed in one tool. Current reactor designs are excessively conservative.
4.5.3.7 Others Disciplines
Other frequently mentioned disciplines which will be realty impacted by Exa-scale are:
Astrophysics
Fusion Energy
National Security
Users should consult the report “The Opportunities and Challenges of Exa-scale Computing of USDOE
for further information.
4.5.4 Challenges in Going to the Exa-scale
Creating an exa-scale computer capable of effectively running the applications just described will
require significant R&D breakthroughs. The previous section laid out the case for the wide range of
scientific and technical advances that could be made with an exaflop computer. This section discusses
the challenges required to make that three order of magnitude jump in technology. In this type of
discussion, it is often far too easy to talk about that jump as some quantitative steps in an
evolutionary process, when in fact the jump implies significant qualitative changes in the way
solutions must be approached. Consider the following Table 4.5.1 illustrates three orders of
magnitude in change. The analogy to computing challenges is not quite the same, because we do not
have to explore totally different technologies to make the leap in three orders of magnitude. However,
just like we would not think of asking a marathon runner to explore the solar system, we cannot use
current technology to produce an Exaflop system. Below we highlight the important steps necessary
to take this giant step while users could consult the [Report on Exa-scale Computing]
72
for additional
72
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
Technology
Quantitative Rate
Qualitative Change
Marathon Runner
10 mph
Explore a town
Car
100 mph
Explore a country
Jet
1000 mph
Explore a world
Space Craft
10000 mph
Explore the solar System
Table 4.5.1 Three Order of Magnitude Jump
112
information.
4.5.4.1 The Hardware Challenges
The architectural challenges for reaching exa-scale are dominated by power, memory,
interconnection networks and resilience. Table 4.5.2 compares current HPC designs with potential
exa-scale designs from the DOE
73
. The baseline we need is a factor of 500 change in peak system
performance. The difference in factor changes for the various components show where simple scaling
of systems (e.g., buying 500 2 Pf/s systems) will be inadequate. Take for example, the power line in
the table. While the peak speed goes up by 500, the power cost cannot go up by more than a factor of
3. That means that the power solution for an exaflop system has to be over 150 times more efficient
than current technology. That is a huge challenge. Looking through the other entries, the table clearly
echoes the sentiments of the IAA, and highlight key features that must be addressed in hardware or
downstream in software. Other potential challenges are:
Exaflop hardware needs major R&D progress
Power
System Memory
Data Movement
System Resiliency
4.5.4.2 The Applied Mathematics Challenges
The applied mathematics component of an exa-scale program should include attention to activities
with time horizons ranging from medium term to very long term, where both ends of the time scale
are essential. The description “medium-term” is deliberate because experience in adapting to new
computational modalities shows that short-term, one-off strategies are likely to be wasteful. Even
though much remains unknown about the details of exa-scale systems, a clear medium-term priority
is the definition and implementation of algorithms that are scalable at very large levels of parallelism
(such as on million-core machines) and that remain sufficiently fast under different hardware
73
See Previous.
Table 4.5.2 Potential Exa-scale Computer Design for 2018 and Its Relationship to Current HPC
Designs (DOE)
113
decisions about bandwidth and latency. Scalability should be modeled and analyzed mathematically,
using abstractions that represent key architectural ingredients. Simulations and experiments that
indicate the effects of hardware and software perturbations on algorithmic efficiency can then guide
the definition of methods that retain scalability under a variety of hardware scenarios. In this spirit,
the strategies for applied mathematics in exa-scale science will require sustained support over time
for people-intensive activities, early identification of the hardest (and least straightforward) research
problems, and built-in flexibility to pursue unexpected and promising new directions as they arise.
Some other points important but not discussed here are:
4.5.4.3 Mathematical Modeling
It is natural for those developing mathematical models of practical problems to limit themselves to
formulations that can be solved numerically using currently available methods. Although essential
when the problem needs to be solved in the short term, an ab initio focus on feasibility can create a
too-rigid environment in which non-standard or “blue-sky” formulations are either never thought
about or else summarily rejected. For example, a problem formulation that represents many real-
world problems yet tends to be avoided because of its known intractability is constrained nonlinear
optimization with a mixture of continuous, discrete, and categorical variables. But the prospect of
massive increases in computational power means that modeling ideas previously dismissed as
impossible or impractical may well become realistic, and should be carefully examined and analyzed.
Creative rethinking of mathematical models is an essential strategy to address the challenges of exa-
scale science. The highly desired “transformational” changes flowing from exa-scale computing are
most likely to come from new formulations that change the way we think about problems, rather
than from applying more resources to an existing formulation to obtain a more accurate solution or
to solve a larger problem. Mathematical models are inherently an approximation of reality, and an
exa-scale initiative provides an opportunity to loosen the grip of, or even remove, computationally-
imposed simplifications. The major challenge is to devise models that capture the important details
of physical and engineered systems as they really are. This will almost certainly generate much
harder sub-problems and/or much more data, but the gains are likely to be eminently worthwhile.
4.5.4.4 Numerical Algorithms
The need for scalable algorithms in an exa-scale initiative has already been stressed. Another
essential feature, highlighted in a 2009 talk by Kathy Yelick called Ten ways to waste a parallel
computer, is a “back to basics” approach to reformulation. Without careful analysis of both new
models and new numerical methods, there is the risk of significant inaccuracy or large computational
overhead in unexpected parts of the overall solution process, as illustrated in the following two
examples related to numerical methods for partial differential equations:
1. All indications are that memory will become the rate-limiting factor along the path to exa-
scale, and investments should accordingly be made in designing algorithms with reduced
memory requirements. Examples where this work is appropriate include:
i. Algorithmically scalable matrix-free methods (e.g., multigrid) for sparse systems of
equations, where “algorithmically scalable” means that the total resources needed to
solve the problem (flops plus memory) are proportional to the resources needed to
evaluate the associated operator;
ii. High-order methods that perform more computation to obtain greater accuracy for
each computational degree of freedom;
iii. Adaptive models/methods designed to use the smallest possible number of degrees
of freedom to obtain the needed level of accuracy.
2. Many calculations related to DOE missions involve models that depend on both space and
time. In current methods, obtaining better spatial resolution typically requires a comparable
reduction in the time step. A frequent argument for exa-scale science is that it will allow much
114
finer spatial resolution in numerous application domains, with (for example) meshes reduced
in size by a factor of ten.
Unfortunately, simply reducing mesh spacing by a factor of ten could lead to a ten-fold increase in the
time for solution, even with perfect weak scaling. Several strategies, all in the spirit of rethinking,
should be explored to avoid this inefficiency. For example, models can be made more implicit to
avoid restrictive time-step conditions arising from stiff processes that rapidly relax to equilibrium
(e.g., in the context of low Mach-number fluid flows). A further strategy is aggressive use of sub-
cycling in time for processes that are fast, but either are localized in physical space or involve only a
small subset of the variables in state space. A motivating example here is advection in the jet stream
in atmospheric modeling.
Approaches of this flavor across the spectrum of numerical methods will almost certainly lead to
increased algorithmic complexity, in addition to the daunting software-related challenges discussed.
The substantially greater work needed to devise exa-scale numerical methods and software leads us
to observe that, for decades, there has been, roughly speaking, a dichotomy in the wish list for the
mathematical software used to solve scientific and engineering problems. On one hand, many DOE
scientists have neither time nor inclination to become experts in numerical methods and software
techniques, preferring to leave software development to mathematicians and computer scientists.
On the other hand, some scientists and engineers want to become deeply involved in producing
domain-specific methods and software to attain the highest possible efficiency for their particular
problem. An exa-scale science program needs to address the needs of both these groups. For the first,
“professional” mathematical software and libraries (meaning software developed by mathematicians
and computer scientists for relatively generic problems such as solving linear systems or eigenvalue
problems) should be developed for increasingly broad problem categories as we move toward exa-
scale. In this way, domain scientists will be able to use state-of-the-art software components that can
be shared across multiple application domains. Since writing software is universally recognized to
be time consuming and error-prone, scientists and engineers will benefit from availability of
software that they can use off the shelf while experimenting with domain-specific challenges rather
than writing their own sparse matrix package. For the second group, specific scientific case studies
should be identified that require significant involvement of domain scientists, mathematicians, and
computer scientists in end-to-end software development.
Mathematics for massive data
Machine learning
Compressive sampling
Symbolic computing
4.5.4.5 The Algorithmic Challenges
Advancing science in key areas requires development of next-generation physical models to satisfy
the accuracy and fidelity needs for targeted simulations. The impact of these simulation fidelity needs
on requirements for computational science is twofold.
First, more complex physical models must be developed to account for more aspects of the
physical phenomena being modeled.
Second, for the physical models being used, increases in resolution for key system variables,
such as numbers of spatial zones, time steps or chemical species, are needed to improve
simulation accuracy, which in turn places higher demands on computational hardware and
software.
Application models represent the functional requirements that drive the need for certain numerical
algorithms and software implementations. Science priorities lead to science models, and models are
implemented in the form of algorithms. Algorithm selection is based on various criteria, such as
115
appropriateness, accuracy, verification, convergence, performance, parallelism and scalability.
Moving forward to exa-scale will put heavier demands on algorithms in at least two areas:
the need for increasing amounts of data locality in order to perform computations efficiently,
the need to obtain much higher factors of fine-grained parallelism as high-end systems
support increasing numbers of compute threads.
As a consequence, parallel algorithms must adapt to this environment, and new algorithms and
implementations must be developed to extract the computational capabilities of the new hardware.
Significant new model development, algorithm re-design and science application code
reimplementation, supported by exa-scale-appropriate programming models, will be required to
exploit efficiently the power of exa-scale architectures. The transition from current sub-peta-scale
and peta-scale computing to exa-scale computing will be at least as disruptive as the transition from
vector to parallel computing in the 1990’s. Uncertainty quantification will permeate the exa-scale
science workload. The demand for predictive science results will drive the development of improved
approaches for establishing levels of confidence in computational predictions. Both statistical
techniques involving large ensemble calculations and other statistical analysis tools will have
significantly different dynamic resource allocation requirements than in the past, and the significant
code redesign required for the exa-scale will present an opportunity to embed uncertainty
quantification techniques in exa-scale science applications. Some other points are:
New multicore-friendly and multicore-aware algorithms
Adaptive Response to Load Imbalance
Multiple precision algorithms/software
Communication avoiding
Fast implicit solvers
Auto-tuning
Scheduling and memory management for heterogeneity and scale
Fault tolerance and robustness for large-scale systems
Building energy efficiency into algorithms foundations
Sensitivity analysis
Multiscale/multi-physics modeling
4.5.4.6 Computer Science Challenges
The coming transition in computer architectures as peak capability approaches the exa-scale offers
both challenges and opportunities
74
. The challenges involve a paradigm shift in programming
methodologies. Existing technologies for writing parallel scientific applications have sustained HPC
application software development for the past decade and have been successful for Peta-scale
computing, but were architected for coarse-grained concurrency largely dominated by bulk
synchronous algorithms. Future hardware constraints and growth in explicit on-chip parallelism will
likely require a mass migration to new algorithms and software architecture that is as broad and
disruptive as the migration from vector to parallel computing systems that occurred 15 years go. The
applications and algorithms will need to rely increasingly on fine-grained parallelism, strong scaling,
and fault resilience. Addressing these challenges opens up a renewed opportunity to introduce a
higher level of software engineering into current fusion application subsystems that will enhance the
modularity, portability, and performance of codes while extending their capabilities to new levels. At
the same time, past sound investments must be protected, and a migration path from current to
future environments must be elaborated. Some other themes are:
74
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
116
Programming Models
I/O
Getting There from Here
Tools
Fault Tolerance
4.5.4.7 Educational Challenges
Major challenges in exa-scale science include the building of understanding and awareness among
groups with high prestige in both academia and industry, and the dearth of highly competent young
scientists in this field, two issues that are not entirely unrelated. Many of the reasons for these
problems are reasonably well understood, but not easily dealt with. Application scientists who focus
primarily on building computational tools are sometimes regarded by their scientific community as
not being “real” scientists. This phenomenon is particularly noticeable in both physics and chemistry,
reflecting in part the penetration of “community codes”. From the opposite perspective, high-level
software designers and programmers may not welcome or appreciate the contributions made by
scientific disciplines to building state-of-the-art computational tools. On the bright side, interest in
computational science and engineering worldwide has measurably increased during the past 15
years.
Almost no universities, even those with faculty working on computational science and engineering,
have, or are likely to develop, a curriculum that focuses on issues associated with exa-scale science.
In addition, as our subcommittee has noted already, many of the issues in exa-scale science are not
yet understood, which means that a straightforward program of training in the usual sense is
impossible. Exa-scale hardware and its features will keep changing, so that training people too early
to think about specific hardware configurations is a bad idea. However, it is important to start soon
to lay the foundations for future thinking about exa-scale science. To be successful, an exa-scale
science education and training program needs to be devised and managed with creative flair, not
business as usual
75
.
75
The Opportunities and Challenges of Exa-scale Computing, Summary Report of the Advanced Scientific
Computing Advisory Committee (ASCAC) Subcommittee, Fall 2010, USDOE.
... Machine learning (ML) and artificial intelligence (AI) methods have become pervasive in recent years due to numerous algorithmic advances, and the accessibility of computational power [1]. In computational fluid dynamics (CFD) [23], these methods have been used to replace, accelerate or enhance existing solvers [15,24]. This work focuses on the AI-based acceleration of CFD tools used for chemical mixing simulations. ...
... Accelerating CFD simulations is an established problem in many domains, from industrial applications to fluid effects for computer graphics and animation. Many works are focused on the adaptation of CFD codes to hardware architectures [23] exploring modern compute accelerators such as GPUs [14,19], Intel Xeon Phi [26] or FPGAs [21]. Developing a simulator can require years of engineering effort and often must trade off generality for accuracy in a given range of settings. ...
... The achieved performance is also visualized in Fig.4. Typically, conventional CFD solvers belong to the group of memory-bound algorithms [23]. There is a relatively enormous amount of data for AI-based acceleration to feed the AI model. ...
Chapter
Full-text available
In this paper, we perform an extensive benchmarking and analysis of the performance and scalability of our software tool called CFD suite, which implements the AI-based domain-specific method for accelerating CFD (computation fluid dynamic) simulations proposed by us recently. By exploring various computing platforms containing both CPUs and GPUs, this analysis helps select suitable platforms for training and inference stages across heterogeneous execution environments. We propose and investigate two modes of utilizing the proposed decomposition of the AI model at the inference stage – either by calling each sub-model one by one (on GPUs) with reduced memory requirements or by performing pipeline predictions (on CPUs with large RAM) to improve the overall performance. It is shown that for the whole inference stage (including overheads), due to the pipeline execution and excluding overheads for data transfers through PCIe, the speedup provided by two Intel Xeon Gold CPUs (Skylake) is 2.4 times higher than for V100 GPU.KeywordsAI-accelerated HPCCPU/GPU/cluster computingchemical mixingCFDOpenFOAMperformancescalability
Article
Full-text available
We develop a distributed framework for the physics-informed neural networks (PINNs) based on two recent extensions, namely conservative PINNs (cPINNs) and extended PINNs (XPINNs), which employ domain decomposition in space and in time-space, respectively. This domain decomposition endows cPINNs and XPINNs with several advantages over the vanilla PINNs, such as parallelization capacity, large representation capacity, efficient hyperparameter tuning, and is particularly effective for multi-scale and multi-physics problems. Here, we present a parallel algorithm for cPINNs and XPINNs constructed with a hybrid programming model described by MPI + X, where X ∈{CPUs,GPUs}. The main advantage of cPINN and XPINN over the more classical data and model parallel approaches is the flexibility of optimizing all hyperparameters of each neural network separately in each subdomain. We compare the performance of distributed cPINNs and XPINNs for various forward problems, using both weak and strong scalings. Our results indicate that for space domain decomposition, cPINNs are more efficient in terms of communication cost but XPINNs provide greater flexibility as they can also handle time-domain decomposition for any differential equations, and can deal with any arbitrarily shaped complex subdomains. To this end, we also present an application of the parallel XPINN method for solving an inverse diffusion problem with variable conductivity on the United States map, using ten regions as subdomains.
Article
Full-text available
A high-order Navier-Stokes solver based on the flux reconstruction (FR) or the correction procedure via reconstruction (CPR) formulation is employed to perform a direct numerical simulation (DNS) and large eddy simulations (LES) of a well-known benchmark problem – transitional flow over the low-pressure T106C turbine cascade. Hp-refinement studies are carried out to assess the resolution requirement. A 4th order (p3) simulation on the fine mesh is performed with a DNS resolution to establish a "converged" solution, including the mean pressure and skin-friction distributions, and the power spectral density in the wake. Then LES on the coarse and fine meshes with lower order schemes are conducted to assess the mesh and order dependence of the solution. In particular, we study the error in the transition location, the mean skin-friction distribution, and the mean lift and drag coefficients. These h- and p-refinement studies provide a much-needed guideline in h- and p- resolutions to achieve a certain level of accuracy for industrial LES applications.
Conference Paper
This paper addresses aspects related to the efficient use of the FR/CPR method for large eddy simulations (LES) on GPU clusters. The present work considers four different cell types: hexahedron, prism, pyramid and tetrahedron. Issues related to their difference in speedup performance on a GPU cluster are discussed. Approaches to improve the parallel efficiency of multi-GPU simulations are also investigated, and a CUDA-aware MPI implementation is compared against a non-CUDA-aware one. A new communication strategy named time lagging MPI is proposed and implemented. It reduces the simulation temporal order of accuracy, but has the potential to considerably improve parallel efficiency. A LES benchmark case is simulated using the time lagging approach and its solution compared with the literature. The points of flow separation and reattachment are accurately predicted and the relative error for mean values of lift and drag coefficients are of the order of 0.1. Strong scalability studies are done at the Summit supercomputer with up to 800 nodes, or 4800 GPUs. The node-hour cost to run at Summit is evaluated for both GPU and CPU cases. A cost reduction of a factor of 10 is measured for hexahedral meshes and a factor of 100 for tetrahedral cells. The use of GPUs enable the reduction of the iteration cost per degree of freedom for increasing polynomial order and for all cell types, including pyramids and tetrahedrons. The same is only observed on CPU simulations for hexahedral cells.
Article
Αn optimized MPI+OpenACC implementation model that performs efficiently in CPU/GPU systems using large-eddy simulation (LES) is presented. The code was validated for the simulation of wave boundary-layer flows against numerical and experimental data in the literature. A direct Fast-Fourier-Transform-based solver was developed for the solution of the Poisson equation for pressure taking advantage of the periodic boundary conditions. This solver was optimized for parallel execution in CPUs and outperforms by 10 times in computational time a typical iterative preconditioned conjugate gradient (PCG) solver in GPUs. In terms of parallel performance, an overlapping strategy was developed to reduce the overhead of performing MPI communications using GPUs. As a result, the weak scaling of the algorithm was improved up to 30%. Finally, a large-scale simulation (Re=2 × 10⁵) using a grid of 4 × 10⁸ cells was executed, and the performance of the code was analyzed. The simulation was launched using up to 512 nodes (512 GPUs + 6144 CPU-cores) on one of the current top 10 supercomputers of the world (Piz Daint). A comparison of the overall computational time showed that the GPU version was 4.2 times faster than the CPU one. The parallel efficiency of this strategy (47%) is competitive compared with the state-of-the-art CPU implementations, and it has the potential to take advantage of modern supercomputing capabilities.
Article
We present a matrix-free flow solver for high-order finite element discretizations of the incompressible Navier-Stokes and Stokes equations with GPU acceleration. For high polynomial degrees, assembling the matrix for the linear systems resulting from the finite element discretization can be prohibitively expensive, both in terms of computational complexity and memory. For this reason, it is necessary to develop matrix-free operators and preconditioners, which can be used to efficiently solve these linear systems without access to the matrix entries themselves. The matrix-free operator evaluations utilize GPU-accelerated sum-factorization techniques to minimize memory movement and maximize throughput. The preconditioners developed in this work are based on a low-order refined methodology with parallel subspace corrections, as described for diffusion problems in [1]. The saddle-point Stokes system is solved using block-preconditioning techniques, which are robust in mesh size, polynomial degree, time step, and viscosity. For the incompressible Navier-Stokes equations, we make use of projection (fractional step) methods, which require Helmholtz and Poisson solves at each time step. The performance of our flow solvers is assessed on several benchmark problems in two and three spatial dimensions.
Article
In this work we present ZEFR, a GPU accelerated flow solver based around the high-order accurate flux reconstruction (FR) approach. Written in a combination of C++ and CUDA, ZEFR is designed to perform scale resolving simulations within the vicinity of complex geometrical configurations. A unique feature of ZEFR is its support for overset grids; a feature which greatly expands the addressable problem space compared with existing high-order codes. The C++ implementation of FR in a manner which is suitable for modern hardware platforms is described in detail. Additionally, an overview of the input deck used by ZEFR is included. Validation results are presented for a range of steady and unsteady flow problems including Couette flow, the Taylor–Green vortex, and flow around an SD7003 airfoil. Single node performance on a NVIDIA V100 GPU is analyzed where it is shown that all of the kernels in ZEFR attain a high proportion of peak memory bandwidth. Moreover, multi-node performance is also assessed with strong scalability being demonstrated from 60 to 3840 NVIDIA V100 GPUs. Program summary Program title: ZEFR v1.0 Program files doi: http://dx.doi.org/10.17632/wzy83bscxd.1 Licensing provisions: BSD 3-clause Programming language: C++ and CUDA External routines/libraries: Eigen, HDF5, METIS, MPI, and TIOGA. Nature of problem: Compressible Euler and Navier–Stokes equations. Solution method: High-order direct flux reconstruction approach suitable for curved, mixed, unstructured grids. Unusual features: Code incorporates support for overset grids.
Article
The accuracy, efficiency and scalability of explicit and implicit temporal schemes associated with FR/CPR spatial schemes are studied in the context of large eddy simulation. One low pressure turbine and one high pressure turbine blade cases with different Mach numbers and Reynolds numbers are employed in this study. The 3 stage SSP Runge–Kutta (RK) scheme is compared with implicit backward difference formulas (BDF) of first and second order accuracy with a non-linear LU-SGS solver in the present evaluation. Various factors such as inner convergence tolerance, the frequency of the implicit operator update, and the order of time accuracy are investigated for large eddy simulation. The implicit BDF2-LUSGS algorithm can achieve good time accuracy, and is more efficient than the 3rd order explicit RK scheme, but not as scalable on a small cluster. At a very low Mach number, the explicit scheme is clearly not adequate, suggesting the need for pre-conditioning and/or an implicit scheme.
Article
Turbulent incompressible flows play an important role in a broad range of natural and industrial processes. High-order direct numerical simulations are often used for resolving the spatio-temporal scales of such flows. Such high-fidelity simulations require an extensive computational layout which often results in prohibitive computational costs. Recent advances in modern computing platforms, such as GPU-powered hybrid-node supercomputers, appear to become an enabler for high-fidelity CFD at large scales. In this work, we propose methods for accelerating a distributed-memory high-order incompressible Navier–Stokes solver by using NVIDIA Pascal GPUs of a Cray XC40/50 supercomputer. Arithmetically intensive or chronically invoked routines were ported to the GPUs using CUDA C. Host- side driver routines were developed to invoke CUDA C ‘‘external" kernels from the FORTRAN legacy code. Numerical methods, for some of the most intensive operations, namely multigrid preconditioners, were modified to be suited to the SIMD standard for graphics processors. Customized unit testing was performed to ensure double-precision accuracy of GPU computations. The optimization layer maintained the memory structure of the legacy code. Post-profiling confirms that backbone distributed memory communications increase the number of dynamic CPU–GPU memory copies, which offsets a part of the computational performance. Strong scalability of the entire flow solver and of the stand- alone pressure solver has been examined on up to 512 P100 GPUs. Strong scaling efficiency decreased for higher numbers of GPUs, probably due to a less favorable communication-to-computation ratio. Weak scalability of the entire solver was tested on up to 4096 P100 GPUs for two problems of different sizes. The solver maintained nearly ideal weak scalability for the larger problem, illustrating the potential of GPUs in dealing with highly resolved flows. The GPU-enabled solver is finally deployed for the scale-resolving simulation of flow transition in the wake of a solid sphere at Re=3700, by utilizing 192 GPUs. The time-averaged pressure coefficient along the sphere surface was in good agreement with previously reported data acquired from CPU-based direct numerical simulations and experiments.
Article
With the increasing requirement of high computational power in computational fluid dynamics (CFD) field, the graphic processing units (GPUs) with great floating-point computing capability play more important roles. This work explores the porting of an Euler solver from central processing units (CPUs) to three different CPU/GPU heterogeneous hardware platforms using MUSCL and NND schemes, and then the computational acceleration of one-dimensional (1D) Riemann problem and two-dimensional (2D) flow past a forward-facing step is investigated. Based on hardware structures, memory models and programming methods, the working manner of heterogeneous systems was firstly introduced in this paper. Subsequently, three different heterogeneous methods employed in the current study were presented in detail, while porting all parts of the solver loop to GPU possessed the best performance among them. Several optimization strategies suitable for the solver were adopted to achieve substantial execution speedups, while using shared memory on GPU was relatively rarely reported in CFD literature. Finally, the simulation of 1D Riemann verified the reliability of the modified codes on GPU, demonstrating strong ability in capturing discontinuities of both schemes. The two cases with their 1D computational domains discretized into 10,000 cells both realized a speedup exceeding 25, compared to that executed on a single-core CPU. In simulation of the 2D step flow, we came to the highest speedups of 260 for MUSCL scheme with 800 × 400 mesh size and 144 for NND scheme with 400 × 200 computational domain, respectively.