An example of sELL format and its memory layout for CPUs and GPUs.

Source publication

Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers

Article

Full-text available

Oct 2017

Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an...

Context 1

... are padded in the case that a slice is composed by rows with different number of entries. Figure 1 shows the storage of a sparse matrix using the sELL format for both a CPU and a GPU execution. In this case S=4 and the matrix is divided in two slices with 3 and 4 elements per row, respectively. ...

View in full-text

A Workload Analysis of NSF's Innovative HPC Resources Using XDMoD

Article

Full-text available

Jan 2018

Workload characterization is an integral part of performance analysis of high performance computing (HPC) systems. An understanding of workload properties sheds light on resource utilization and can be used to inform performance optimization both at the software and system configuration levels. It can provide information on how computational scienc...

Efficient and scalable hybrid fluid-particle simulations with geometrically resolved particles on heterogeneous CPU-GPU architectures

Preprint

Full-text available

Mar 2023

In recent years, it has become increasingly popular to accelerate numerical simulations using Graphics Processing Unit (GPU)s. In multiphysics simulations, the various combined methodologies may have distinctly different computational characteristics. Therefore, the best-suited hardware architecture can differ between the simulation components. Furthermore, not all coupled software frameworks may support all hardware. These issues predestinate or even force hybrid implementations, i.e., different simulation components running on different hardware. We introduce a hybrid coupled fluid-particle implementation with geometrically resolved particles. The simulation utilizes GPUs for the fluid dynamics, whereas the particle simulation runs on Central Processing Unit (CPU)s. We examine the performance of two contrasting cases of a fluidized bed simulation on a heterogeneous supercomputer. The hybrid overhead (i.e., the CPU-GPU communication) is negligible. The fluid simulation shows good performance utilizing nearly the entire memory bandwidth. Still, the GPU run time accounts for most of the total time. The parallel efficiency in a weak scaling benchmark for 1024 A100 GPUs is up to 71%. Frequent CPU-CPU communications occurring in the particle simulation are the leading cause of the decrease in parallel efficiency. The results show that hybrid implementations are promising for large-scale multiphysics simulations on heterogeneous supercomputers.

Hybrid MPI and CUDA paralleled finite volume unstructured CFD simulations on a multi-GPU system

Article

Sep 2022
FUTURE GENER COMP SY

Porting unstructured Computational Fluid Dynamics (CFD) analysis of compressible flow to Graphics Processing Units (GPUs) confronts two difficulties. Firstly, non-coalescing access to the GPU’s global memory is induced by indirect data access leading to performance loss. Secondly, data exchange among multi-GPU is complex due to data communication between processes and transfer between host and device, which degrades scalability. For increasing data locality on unstructured finite volume GPU simulations for compressible flow, we perform some optimizations, including cell and face renumbering, data dependence resolving, nested loops split, and loop mode adjustment. Then, a hybrid MPI-CUDA parallel framework with packing and unpacking exchange data on GPU is established for multi-GPU computing. Finally, after optimizations, the performance of the whole application on a GPU is increased by around 50%. Simulations of ONERA M6 cases on a single GPU (Nvidia Tesla V100) can achieve an average of 13.4 speedup compared to those on 28 CPU cores (Intel Xeon Glod 6132). On the baseline of 2 GPUs, strong scaling results show a parallel efficiency of 42% on 200 GPUs, while weak scaling tests give a parallel efficiency of 82.4% up to 200 GPUs.

A heterogeneous accelerated simulation framework for wind field dynamic model

Article

Full-text available

May 2022
IET RENEW POWER GEN

Abstract This paper proposes an efficient and extensible simulation framework for mid‐fidelity wind field model based on heterogeneous architecture. The wind field model consists of 3‐D wind inflow, wake dynamics, and wind turbine models. The CPU‐GPU heterogeneous platform accelerates the 3‐D wind inflow model simulation by performing large‐scale parallel computing based on GPU. The CPU is utilized to dispatch the simulation and perform serial computing. Compared with the conventional simulation tool, the simulation speed of the 3‐D wind inflow model is highly improved, which fulfils the requirement of the real‐time wind field simulation. The hardware architecture of the framework is designed to be highly extensible by multiple computing nodes for 3‐D wind inflow, wake dynamic, and wind turbine model. The extendability enables the ability for large‐scale wind field simulation with high efficiency. A 5 km$ m km$ × 2 km$ m km$ × 0.35 km$ m km$ wind field consisting of three wind turbines is used as the case. The case study was conducted and analyzed, proving that the framework can capture the detailed wake dynamic, wind turbine dynamic, and wind inflow dynamic. Meanwhile, the calculation speed of disturbed wind speed increased by 264 times.

On the implementation of flux limiters in algebraic frameworks

Preprint

Full-text available

Oct 2021

The use of flux limiters is widespread within the scientific computing community to capture shock discontinuities and are of paramount importance for the temporal integration of high-speed aerodynamics, multiphase flows, and hyperbolic equations in general. Meanwhile, the breakthrough of new computing architectures and the hybridization of supercomputer systems pose a huge portability challenge, particularly for legacy codes, since the computing subroutines that form the algorithms, the so-called kernels, must be adapted to various complex parallel programming paradigms. From this perspective, the development of innovative implementations relying on a minimalist set of kernels simplifies the deployment of scientific computing software on state-of-the-art supercomputers, while it requires the reformulation of algorithms, such as the aforementioned flux limiters. Equipped with basic algebraic topology and graph theory underlying the classical mesh concept, a new flux limiter formulation is presented based on the adoption of algebraic data structures and kernels. As a result, traditional flux limiters are cast into a stream of only two types of computing kernels: sparse matrix-vector multiplication and generalized pointwise binary operators. The newly proposed formulation eases the deployment of such a numerical technique in massively parallel, potentially hybrid, computing systems and is demonstrated for a canonical advection problem.

A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers

Article

Jan 2021
COMPUT FLUIDS

The quest for new portable implementations of simulation algorithms is motivated by the increasing variety of computing architectures. Moreover, the hybridization of high-performance computing systems imposes additional constraints, since heterogeneous computations are needed to efficiently engage processors and massively-parallel accelerators. This, in turn, involves different parallel paradigms and computing frameworks and requires complex data exchanges between computing units. Typically, simulation codes rely on sophisticated data structures and computing subroutines, so-called kernels, which makes portability terribly cumbersome. Thus, a natural way to achieve portability is to dramatically reduce the complexity of both data structures and computing kernels. In our algebra-based approach, the scale-resolving simulation of incompressible turbulent flows on unstructured meshes relies on three fundamental kernels: the sparse matrix-vector product, the linear combination of vectors and the dot product. It is noteworthy that this approach is not limited to a particular kind of numerical method or a set of governing equations. In our code, an auto-balanced multilevel partitioning distributes workload among computing devices of various architectures. The overlap of computations and multistage communications efficiently hides the data exchanges overhead in large-scale supercomputer simulations. In addition to computing on accelerators, special attention is paid at efficiency on manycore processors in multiprocessor nodes with significant non-uniform memory access factor. Parallel efficiency and performance are studied in detail for different execution modes on various supercomputers using up to 9,600 processor cores and up to 256 graphics processor units. The heterogeneous implementation model described in this work is a general-purpose approach that is well suited for various subroutines in numerical simulation codes.

A scalable framework for the partitioned solution of fluid–structure interaction problems

Article

Full-text available

Aug 2020
COMPUT MECH

In this work, we present a scalable and efficient parallel solver for the partitioned solution of fluid–structure interaction problems through multi-code coupling. Two instances of an in-house parallel software, TermoFluids, are used to solve the fluid and the structural sub-problems, coupled together on the interface via the preCICE coupling library. For fluid flow, the Arbitrary Lagrangian–Eulerian form of the Navier–Stokes equations is solved on an unstructured conforming grid using a second-order finite-volume discretization. A parallel dynamic mesh method for unstructured meshes is used to track the moving boundary. For the structural problem, the nonlinear elastodynamics equations are solved on an unstructured grid using a second-order finite-volume method. A semi-implicit FSI coupling method is used which segregates the fluid pressure term and couples it strongly to the structure, while the remaining fluid terms and the geometrical nonlinearities are only loosely coupled. A robust and advanced multi-vector quasi-Newton method is used for the coupling iterations between the solvers. Both the fluid and the structural solver use distributed-memory parallelism. The intra-solver communication required for data update in the solution process is carried out using non-blocking point-to-point communicators. The inter-code communication is fully parallel and point-to-point, avoiding any central communication unit. Inside each single-physics solver, the load is balanced by dividing the computational domain into fairly equal blocks for each process. Additionally, a load balancing model is used at the inter-code level to minimize the overall idle time of the processes. Two practical test cases in the context of hemodynamics are studied, demonstrating the accuracy and computational efficiency of the coupled solver. Strong scalability test results show a parallel efficiency of 83% on 10,080 CPU cores.

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

Article

Jun 2020
FUTURE GENER COMP SY

High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

Preprint

Full-text available

May 2020

One of the main challenges of civil aviation is the construction of more efficient airplanes in terms of fuel consumption and noise emissions. Research and development on the aerodynamics of the full airplane are one of the priorities established by the Advisory Council for Aeronautics Research in Europe. In this context, high fidelity simulations are one of the main tools for the design of innovative solutions. Such simulations are based on accurate numerical algorithms, as well as advanced LES turbulence models, which have high computational requirements. Consequently, significant research efforts on the computational and algorithmic aspects are required to unlock the computing power of leading-edge pre-Exascale systems. In this paper, we explain the approach implemented into a CFD simulation code, Alya, to achieve these physical, numerical and computational We present a global parallelization strategy to fully exploit the different levels of parallelism proposed by modern architectures, together with a novel co-execution model for the concurrent exploitation of both the CPU and GPU, targeting the maximum efficiency at the node level. The latter is based on a multi-code co-execution approach together with a dynamic load balancing mechanism. Assessment of the performance of all the proposed strategies has been carried out on the cutting edge POWER9 architecture.

Accelerating unstructured large eddy simulation solver with GPU

Article

Aug 2018
ENG COMPUTATION

Purpose Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling and it provides a deeper understanding of the complicated transitional and turbulent flow mechanism; however, the large computational cost limits its application in high Reynolds number flow. This study aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. Design/methodology/approach Compared to the central processing units (CPUs), graphics processing units (GPUs) can provide higher computational speed. This work aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. A set of low-dissipation schemes designed for unstructured mesh is implemented with compute unified device architecture programming model. Several key parameters affecting the performance of the GPU code are discussed and further speed-up can be obtained by analysing the underlying finite volume-based numerical scheme. Findings The results show that an acceleration ratio of approximately 84 (on a single GPU) for double precision algorithm can be achieved with this unstructured GPU code. The transitional flow inside a compressor is simulated and the computational efficiency has been improved greatly. The transition process is discussed and the role of K-H instability playing in the transition mechanism is verified. Practical/implications The speed-up gained from GPU-enabled solver reaches 84 compared to original code running on CPU and the vast speed-up enables the fast-turnaround high-fidelity LES simulation. Originality/value The GPU-enabled flow solver is implemented and optimized according to the feature of finite volume scheme. The solving time is reduced remarkably and the detail structures including vortices are captured.

Development of A GPU Based Unstructured Solver for High Fidelity Turbomachinery Simulation

Article

Nov 2017

Speed and accuracy are the key points to the evaluation of solver for steady and unsteady flows in computational fluid dynamics (CFD), and much time is needed for high precision turbulence flow generally. The aim of this paper is to develop a GPU-enabled parallel solver to speed up the solution. The flow solver developed by our groups is introduced, then the implementation and optimization of original solver is conducted according to the features of CUDA model. Finally, the GPU-enabled solver after optimization is applied to the unsteady flow simulation in compressor cascade and the transition process and mechanism is analyzed with the speed-up of 53 reported.

An example of sELL format and its memory layout for CPUs and GPUs.

Context in source publication

Similar publications

Citations