Figure 1 - uploaded by R. Borrell
Content may be subject to copyright.
An example of sELL format and its memory layout for CPUs and GPUs.

An example of sELL format and its memory layout for CPUs and GPUs.

Source publication
Article
Full-text available
Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an...

Context in source publication

Context 1
... are padded in the case that a slice is composed by rows with different number of entries. Figure 1 shows the storage of a sparse matrix using the sELL format for both a CPU and a GPU execution. In this case S=4 and the matrix is divided in two slices with 3 and 4 elements per row, respectively. ...

Similar publications

Article
Full-text available
Workload characterization is an integral part of performance analysis of high performance computing (HPC) systems. An understanding of workload properties sheds light on resource utilization and can be used to inform performance optimization both at the software and system configuration levels. It can provide information on how computational scienc...

Citations

... A significant challenge with these simulations is that they are very computationally expensive, which is why they often run on supercomputers. Especially supercomputers containing Graphics Processing Unit (GPU)s have become increasingly popular in recent years [7,8,9] as they offer unprecedented computing power. In such multiphysics simulations, the various combined methodologies may exhibit distinctly contrasting computational properties, e.g., problem sizes, parallel and sequential portions, frequency of conditions, and branching. ...
Preprint
Full-text available
In recent years, it has become increasingly popular to accelerate numerical simulations using Graphics Processing Unit (GPU)s. In multiphysics simulations, the various combined methodologies may have distinctly different computational characteristics. Therefore, the best-suited hardware architecture can differ between the simulation components. Furthermore, not all coupled software frameworks may support all hardware. These issues predestinate or even force hybrid implementations, i.e., different simulation components running on different hardware. We introduce a hybrid coupled fluid-particle implementation with geometrically resolved particles. The simulation utilizes GPUs for the fluid dynamics, whereas the particle simulation runs on Central Processing Unit (CPU)s. We examine the performance of two contrasting cases of a fluidized bed simulation on a heterogeneous supercomputer. The hybrid overhead (i.e., the CPU-GPU communication) is negligible. The fluid simulation shows good performance utilizing nearly the entire memory bandwidth. Still, the GPU run time accounts for most of the total time. The parallel efficiency in a weak scaling benchmark for 1024 A100 GPUs is up to 71%. Frequent CPU-CPU communications occurring in the particle simulation are the leading cause of the decrease in parallel efficiency. The results show that hybrid implementations are promising for large-scale multiphysics simulations on heterogeneous supercomputers.
... However, due to difficulties in load balance, CPUs are only used for managing GPUs in large scale simulations [22]. Oyarzen et al. [23] developed a portable MPI-CUDA paralleled CFD model for CPU/GPU heterogeneous supercomputer. It is found that the scalability of CPU-only computing is better than that of GPU-only computing, while performance on multi-GPU is higher than that on multi-CPU. ...
Article
Porting unstructured Computational Fluid Dynamics (CFD) analysis of compressible flow to Graphics Processing Units (GPUs) confronts two difficulties. Firstly, non-coalescing access to the GPU’s global memory is induced by indirect data access leading to performance loss. Secondly, data exchange among multi-GPU is complex due to data communication between processes and transfer between host and device, which degrades scalability. For increasing data locality on unstructured finite volume GPU simulations for compressible flow, we perform some optimizations, including cell and face renumbering, data dependence resolving, nested loops split, and loop mode adjustment. Then, a hybrid MPI-CUDA parallel framework with packing and unpacking exchange data on GPU is established for multi-GPU computing. Finally, after optimizations, the performance of the whole application on a GPU is increased by around 50%. Simulations of ONERA M6 cases on a single GPU (Nvidia Tesla V100) can achieve an average of 13.4 speedup compared to those on 28 CPU cores (Intel Xeon Glod 6132). On the baseline of 2 GPUs, strong scaling results show a parallel efficiency of 42% on 200 GPUs, while weak scaling tests give a parallel efficiency of 82.4% up to 200 GPUs.
... Nonetheless, the computation burden of the high-fidelity tool makes it impractical for real-time simulation. Ref. [9][10][11][12] propose the CPU-GPU (Central Processing Unit and Graphics Processing Unit) architecture-based acceleration schemes for high-fidelity simulation. Despite that, they are focusing on computing efficiency, but neglect the super realtime characteristic. ...
Article
Full-text available
Abstract This paper proposes an efficient and extensible simulation framework for mid‐fidelity wind field model based on heterogeneous architecture. The wind field model consists of 3‐D wind inflow, wake dynamics, and wind turbine models. The CPU‐GPU heterogeneous platform accelerates the 3‐D wind inflow model simulation by performing large‐scale parallel computing based on GPU. The CPU is utilized to dispatch the simulation and perform serial computing. Compared with the conventional simulation tool, the simulation speed of the 3‐D wind inflow model is highly improved, which fulfils the requirement of the real‐time wind field simulation. The hardware architecture of the framework is designed to be highly extensible by multiple computing nodes for 3‐D wind inflow, wake dynamic, and wind turbine model. The extendability enables the ability for large‐scale wind field simulation with high efficiency. A 5 km$ m km$ × 2 km$ m km$ × 0.35 km$ m km$ wind field consisting of three wind turbines is used as the case. The case study was conducted and analyzed, proving that the framework can capture the detailed wake dynamic, wind turbine dynamic, and wind inflow dynamic. Meanwhile, the calculation speed of disturbed wind speed increased by 264 times.
... By casting discrete operators and mesh functions into sparse matrices and vectors, it has been shown that nearly 90% of the calculations in a typical CFD algorithm for the direct numerical simulation (DNS) and large eddy simulation (LES) of incompressible turbulent flows boil down to the following basic linear algebra subroutines: sparse matrix-vector product (SpMV), linear combination of vectors (axpy) and dot product (dot) [18]. Moreover, after the generalizations detailed in Section 3.2 this value will be raised to 100%. ...
... In previous works of Oyarzun et al. [18] andÁlvarez et al. [23], an algebra-based implementation model was proposed for the DNS and LES of incompressible turbulent flows such that the algorithm of the time-integration phase reduces to a set of only three algebraic kernels: SpMV, axpy and dot. However, a close look at Equations 17 and 18, for instance, reveals that this set is insufficient to fulfill the implementation of the flux limiter because it comprises non-linear operations. ...
Preprint
Full-text available
The use of flux limiters is widespread within the scientific computing community to capture shock discontinuities and are of paramount importance for the temporal integration of high-speed aerodynamics, multiphase flows, and hyperbolic equations in general. Meanwhile, the breakthrough of new computing architectures and the hybridization of supercomputer systems pose a huge portability challenge, particularly for legacy codes, since the computing subroutines that form the algorithms, the so-called kernels, must be adapted to various complex parallel programming paradigms. From this perspective, the development of innovative implementations relying on a minimalist set of kernels simplifies the deployment of scientific computing software on state-of-the-art supercomputers, while it requires the reformulation of algorithms, such as the aforementioned flux limiters. Equipped with basic algebraic topology and graph theory underlying the classical mesh concept, a new flux limiter formulation is presented based on the adoption of algebraic data structures and kernels. As a result, traditional flux limiters are cast into a stream of only two types of computing kernels: sparse matrix-vector multiplication and generalized pointwise binary operators. The newly proposed formulation eases the deployment of such a numerical technique in massively parallel, potentially hybrid, computing systems and is demonstrated for a canonical advection problem.
... Therefore, the majority of rows contain seven non-zero coefficients. The ELLPACK sparse storage format and its block-transposed version are used for CPU and GPU computations, respectively (see [33] for details). This format provides a uniform, aligned memory access with coalescence of memory transactions on GPUs. ...
Article
The quest for new portable implementations of simulation algorithms is motivated by the increasing variety of computing architectures. Moreover, the hybridization of high-performance computing systems imposes additional constraints, since heterogeneous computations are needed to efficiently engage processors and massively-parallel accelerators. This, in turn, involves different parallel paradigms and computing frameworks and requires complex data exchanges between computing units. Typically, simulation codes rely on sophisticated data structures and computing subroutines, so-called kernels, which makes portability terribly cumbersome. Thus, a natural way to achieve portability is to dramatically reduce the complexity of both data structures and computing kernels. In our algebra-based approach, the scale-resolving simulation of incompressible turbulent flows on unstructured meshes relies on three fundamental kernels: the sparse matrix-vector product, the linear combination of vectors and the dot product. It is noteworthy that this approach is not limited to a particular kind of numerical method or a set of governing equations. In our code, an auto-balanced multilevel partitioning distributes workload among computing devices of various architectures. The overlap of computations and multistage communications efficiently hides the data exchanges overhead in large-scale supercomputer simulations. In addition to computing on accelerators, special attention is paid at efficiency on manycore processors in multiprocessor nodes with significant non-uniform memory access factor. Parallel efficiency and performance are studied in detail for different execution modes on various supercomputers using up to 9,600 processor cores and up to 256 graphics processor units. The heterogeneous implementation model described in this work is a general-purpose approach that is well suited for various subroutines in numerical simulation codes.
... As a future work, the authors are interested in extending the implementation of the framework to a hybrid MPI-openMP paradigm and study its effect on the computational efficiency. The code also has a multi-threading capability with CUDA or openCL to use GPUs as co-processors on a hybrid machine [50][51][52]. This configuration was not used in this work and we focused on CPU-only clusters. ...
Article
Full-text available
In this work, we present a scalable and efficient parallel solver for the partitioned solution of fluid–structure interaction problems through multi-code coupling. Two instances of an in-house parallel software, TermoFluids, are used to solve the fluid and the structural sub-problems, coupled together on the interface via the preCICE coupling library. For fluid flow, the Arbitrary Lagrangian–Eulerian form of the Navier–Stokes equations is solved on an unstructured conforming grid using a second-order finite-volume discretization. A parallel dynamic mesh method for unstructured meshes is used to track the moving boundary. For the structural problem, the nonlinear elastodynamics equations are solved on an unstructured grid using a second-order finite-volume method. A semi-implicit FSI coupling method is used which segregates the fluid pressure term and couples it strongly to the structure, while the remaining fluid terms and the geometrical nonlinearities are only loosely coupled. A robust and advanced multi-vector quasi-Newton method is used for the coupling iterations between the solvers. Both the fluid and the structural solver use distributed-memory parallelism. The intra-solver communication required for data update in the solution process is carried out using non-blocking point-to-point communicators. The inter-code communication is fully parallel and point-to-point, avoiding any central communication unit. Inside each single-physics solver, the load is balanced by dividing the computational domain into fairly equal blocks for each process. Additionally, a load balancing model is used at the inter-code level to minimize the overall idle time of the processes. Two practical test cases in the context of hemodynamics are studied, demonstrating the accuracy and computational efficiency of the coupled solver. Strong scalability test results show a parallel efficiency of 83% on 10,080 CPU cores.
... The final speedup of the GPUs according to the CPUs of the node is of 5.44 and 3.48 times for the sphere and airplane meshes respectively. These results match other unstructured CFD codes running on GPUs [33,34]. Fig. 14 (right) shows the speedup of the best GPU implementation according to the optimal CPU version at the node level. ...
Article
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.
... The final speedup of the GPUs according to the CPUs of the node is of 5.44 and 3.48 times for the homogeneous and non-homogeneous meshes respectively. These results match other unstructured CFD codes running on GPUs [35]. Figure 17 shows the speedup of the best GPU implementation according to the optimal CPU version at the node level. ...
Preprint
Full-text available
One of the main challenges of civil aviation is the construction of more efficient airplanes in terms of fuel consumption and noise emissions. Research and development on the aerodynamics of the full airplane are one of the priorities established by the Advisory Council for Aeronautics Research in Europe. In this context, high fidelity simulations are one of the main tools for the design of innovative solutions. Such simulations are based on accurate numerical algorithms, as well as advanced LES turbulence models, which have high computational requirements. Consequently, significant research efforts on the computational and algorithmic aspects are required to unlock the computing power of leading-edge pre-Exascale systems. In this paper, we explain the approach implemented into a CFD simulation code, Alya, to achieve these physical, numerical and computational We present a global parallelization strategy to fully exploit the different levels of parallelism proposed by modern architectures, together with a novel co-execution model for the concurrent exploitation of both the CPU and GPU, targeting the maximum efficiency at the node level. The latter is based on a multi-code co-execution approach together with a dynamic load balancing mechanism. Assessment of the performance of all the proposed strategies has been carried out on the cutting edge POWER9 architecture.
... By combining Message Passing Interfaces (MPI) and CUDA, multiple GPUs can be applied to break through the limitation of memory size. Up to 128 GPUs were tested on hybrid CPU/GPU supercomputers to understand the challenges of implementing CFD codes on new architectures (Oyarzun et al., 2017). Besides CUDA, emerging architectures like Open Accelerators (OpenACC) (Wienke et al., 2012;Lou et al., 2016;Ma et al., 2017) are also explored to accelerate the CFD process. ...
Article
Purpose Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling and it provides a deeper understanding of the complicated transitional and turbulent flow mechanism; however, the large computational cost limits its application in high Reynolds number flow. This study aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. Design/methodology/approach Compared to the central processing units (CPUs), graphics processing units (GPUs) can provide higher computational speed. This work aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. A set of low-dissipation schemes designed for unstructured mesh is implemented with compute unified device architecture programming model. Several key parameters affecting the performance of the GPU code are discussed and further speed-up can be obtained by analysing the underlying finite volume-based numerical scheme. Findings The results show that an acceleration ratio of approximately 84 (on a single GPU) for double precision algorithm can be achieved with this unstructured GPU code. The transitional flow inside a compressor is simulated and the computational efficiency has been improved greatly. The transition process is discussed and the role of K-H instability playing in the transition mechanism is verified. Practical/implications The speed-up gained from GPU-enabled solver reaches 84 compared to original code running on CPU and the vast speed-up enables the fast-turnaround high-fidelity LES simulation. Originality/value The GPU-enabled flow solver is implemented and optimized according to the feature of finite volume scheme. The solving time is reduced remarkably and the detail structures including vortices are captured.
... By combining Message Passing Interfaces (MPI) and CUDA, multiple GPUs can be applied to break through the limitation of memory size. Up to 128 GPUs were tested on hybrid CPU/GPU supercomputers to understand the challenges of implementing CFD codes on new architectures (Oyarzun et al., 2017). Besides CUDA, emerging architectures like Open Accelerators (OpenACC) (Wienke et al., 2012;Lou et al., 2016;Ma et al., 2017) are also explored to accelerate the CFD process. ...
Article
Speed and accuracy are the key points to the evaluation of solver for steady and unsteady flows in computational fluid dynamics (CFD), and much time is needed for high precision turbulence flow generally. The aim of this paper is to develop a GPU-enabled parallel solver to speed up the solution. The flow solver developed by our groups is introduced, then the implementation and optimization of original solver is conducted according to the features of CUDA model. Finally, the GPU-enabled solver after optimization is applied to the unsteady flow simulation in compressor cascade and the transition process and mechanism is analyzed with the speed-up of 53 reported.