Figure 1 - uploaded by Umberto Perego
Content may be subject to copyright.
Schematic illustration of GPU architecture and programming elements.

Schematic illustration of GPU architecture and programming elements.

Source publication
Article
Full-text available
With the availability of user oriented software tools, dedicated architectures, such as the parallel computing platform and programming model CUDA (Compute Unified Device Architecture) released by NVIDIA, one of the main producers of graphics cards, and of improved, highly performing GPU (Graphics Processing Unit) boards, GPGPU (General Purpose pro...

Contexts in source publication

Context 1
... addition, the solver currently runs on single-GPU ma- chines, even though extending the current implementa- tion to a multi-GPU environment is straightforward. A basic graphical representation of the different elements of the GPU architecture, together with the relevant termi- nology used in GPU programming, is provided in figure 1. A generic CUDA application is mainly divided into two parts: host code, running serially or in parallel on CPU, and device code, running on one (or more) GPUs. ...
Context 2
... the code concerning each DOF and node is executed indepen- dently in steps 3 and 5, the assembly of elemental forces in step 4 requires the accumulation of values in the nodes, which is essentially a reduction process. Communication and synchronization between threads belonging to differ- ent blocks (and therefore using non-shared memory, see figure 1) in GPU is a difficult task, since CUDA supports native thread synchronization only at thread-block level. ...
Context 3
... is probably due to the fact that the regular geometry and discretization permit to take advantage of the whole potential of the GPU implementation, exploit- ing the performance of the parallelized algorithm. This is clear also by looking at the speedup graphs in figure 10: the software seems to reach a saturation point, where ca- pabilities of the GPU are used at their maximum and the speedup cannot increase any further. Obviously, these sat- uration curves depend heavily on the GPU hardware: with computationally-oriented GPU boards they are expected to reach higher peak values. ...
Context 4
... two opposite concentrated loads have magnitude of 500 MPa. shows a snapshot of the deformed structure and Figure 12 compares the load-displacement curve with the results of [27]. With this kind of geometry it is simple to refine the mesh and thus analyze the solver performance with a number of degrees of freedom regularly increasing. ...
Context 5
... column Speed lists the number of time steps processed per second, which multiplied by the time step size gives the physical time span which can be simulated in one second of computation. The performance of the proposed imple- mentation can be analyzed in figure 13, where the compu- tational time is reported over the total number of degrees of freedom. The relation is almost perfectly linear: this means that the proposed implementation has not reached a saturation point of the computational power, even us- ing the finest mesh considered. ...
Context 6
... table 5, in the column Memory, theoreti- cal memory consumption estimates are reported (they dif- fer from actual consumption measures by few megabytes). Therefore, the trend shown in figure 13 is not affected even though the last mesh considered fills almost completely the available GPU memory. ...

Citations

... This paper focuses on distributed explicit time integrators, where time updates are computed through matrix-vector products and are therefore highly scalable and amenable to efficient GPU implementation. Highly scalable GPU solvers for physics-based modelling are already available in the literature [3,20,29,34,53] with GPU-based accelerated explicit finite element structural simulations of soft tissues discussed, for example, in [21,34,53,54]. Unlike implicit time integration, explicit schemes typically do not need element-level quantities to be assembled in a global matrix, leading to memory and runtime savings. ...
... Unlike implicit time integration, explicit schemes typically do not need element-level quantities to be assembled in a global matrix, leading to memory and runtime savings. However, explicit schemes are only conditionally stable [3,4,11,19] counterpart. This difference becomes less pronounced for the structural analysis of biological soft tissue where explicit approaches have the potential to be competitive with respect to implicit time integration, for example in the context of cardiovascular modeling. ...
... The pseudo-code in Algorithm 1 illustrates how our displacement-based parallel finite element elastodynamics solver is implemented based on element-level computation and communication operations [3,4,19]. We consider a computational mesh partitioned and distributed over n c processors, labeled as i = 1, . . . ...
Article
Full-text available
We propose a data-driven framework to increase the computational efficiency of the explicit finite element method in the structural analysis of soft tissue. An encoder–decoder long short-term memory deep neural network is trained based on the data produced by an explicit, distributed finite element solver. We leverage this network to predict synchronized displacements at shared nodes, minimizing the amount of communication between processors. We perform extensive numerical experiments to quantify the accuracy and stability of the proposed synchronization-avoiding algorithm.
... Cai et al. [43] realized GPU computing of shell elements for sheet forming simulation, which improves computational efficiency by 27 times. Bartezzaghi et al. [44] developed a GPU structural solver for thin shell elements with a speedup of more than 40. Ma's group [45][46][47] developed a GPU-based FEM code and carried out a series of investigations on joining and structural safety analysis. ...
Article
Single-particle tests are normally used to evaluate the impact resistance performance of automotive coatings. Impact-induced debonding is one of the major failure patterns for coating–substrate structures of a vehicle. Currently, it remains a challenging task to accurately and efficiently simulate progressive debonding of automotive coating–substrate interfaces under single particle impact. The main purpose of this work is to develop a graphics processing unit (GPU)-based parallel computational framework to achieve this end. An efficient coating finite element model in three dimensions is established, where solid elements with good aspect ratios and solid-shell elements are respectively used to discretize the impact contact region and the rest region of a coating. An intrinsic cohesive zone model coupled with a mortar-based contact algorithm is used to accurately describe the progressive debonding behavior of coating–substrate interfaces. The computational efficiency of the developed method is dramatically enhanced by recourse to the GPU parallel computing technique. Three benchmark tests are carried out to validate the effectiveness and efficiency of the developed computational framework. Finally, the novel computational framework is successfully applied to the debonding analysis of an organic coating bonded with an aluminum substrate under single-particle impact. Results show that the GPU parallel computational framework can achieve a total speedup of 136.5, which provides a powerful tool for coating debonding analysis.
... There have been incremental improvements, of course 17 , and radical departures such as triangular shells with discrete Kirchhoff constraints 18 , or quadrilaterals with membrane response incorporating drilling degrees freedom 19 . The shell finite element technology also keeps developing to incorporate modern computing devices 20,16 . There is a creative tension between two approaches: perform a lot of numerical calculations with limited data movement 21 , or perform very simple numerical operations on a large number of simple entities 22 . ...
Article
Full-text available
The use of a flat‐facet finite element with three nodes and six degrees of freedom per node for modeling of linear wave‐propagation events is discussed. The triangular shell finite element was presented in a preceding publication. The main novelty of the current approach is the treatment of the drilling rotations that decouples the drilling degree of freedoms on the global level, and hence makes it possible to eliminate negative effects of these degrees of freedom on the explicit time‐stepping algorithms employed to compute the dynamic response, such as deterioration in accuracy or artificially reduced time step. The element is shown to be robust for wave propagation simulations, and its performance is illustrated with examples including guided‐wave scenarios.
... Implementation of structural explicit solvers on GPU are discussed in various studies in the literature. In [3] the authors describe in detail an application involving thin shells, while an overview on applications in biomechanics is discussed in [33]. Additionally, ensemble methods for fluid problems have been recently proposed by [18][19][20][21]34] in the context of the Navier-Stokes equations with distinct initial conditions and forcing terms. ...
Article
Full-text available
Computational models are increasingly used for diagnosis and treatment of cardiovascular disease. To provide a quantitative hemodynamic understanding that can be effectively used in the clinic, it is crucial to quantify the variability in the outputs from these models due to multiple sources of uncertainty. To quantify this variability, the analyst invariably needs to generate a large collection of high-fidelity model solutions, typically requiring a substantial computational effort. In this paper, we show how an explicit-in-time ensemble cardiovascular solver offers superior performance with respect to the embarrassingly parallel solution with implicit-in-time algorithms, typical of an inner-outer loop paradigm for non-intrusive uncertainty propagation. We discuss in detail the numerics and efficient distributed implementation of a segregated FSI cardiovascular solver on both CPU and GPU systems, and demonstrate its applicability to idealized and patient-specific cardiovascular models, analyzed under steady and pulsatile flow conditions.
... In the field of FE analysis, the GPUs have been applied for the parallel computation of mesh generation [295], stiffness matrix assembly [296,297], matrix-free methods [298], etc. The parallel processing architectures of GPUs have been exploited to perform the explicit and implicit FE analysis for the linear and non-linear dynamic problems such as building models analysis, elastic shell problems, etc. [299][300][301][302]. Further, the simulations of non-linear contact-impact problems, including sheet metal forming simulations and crash simulations [296,303,304], have been performed exploiting the fine-grained parallel computing power of GPUs. ...
Thesis
Full-text available
The impact-induced fracture analysis has a wide range of engineering and defence applications, including aerospace, manufacturing and construction. An accurate simulation of impact events often requires modelling large-scale complex geometries along with dynamic stress waves and damage propagation. To perform such simulations in a timely manner, a highly efficient and scalable computational framework is necessary. This thesis aims to develop a high-performance computational framework for analysing large-scale structural problems pertaining to impact-induced fracture events. A hierarchical grid-based mesh containing octree cells is utilised for discretising the problem domain. The scaled boundary finite element method (SBFEM) is employed, which can efficiently handle the octree cells by eliminating the hanging node issues. The octree-mesh is used in balanced form with a limited number of octree cell patterns. The master element matrices of each pattern are pre-computed while the storage of the individual element matrices is avoided leading to a significant reduction in memory requirements, especially for large-scale models. Further, the advantages of octree cells are leveraged by automatic mesh generation and local refinement process, which enables efficient pre-processing of models with complex geometries. To handle the matrix operations associated with large-scale simulation, a pattern-by-pattern (PBP) approach is proposed. In this technique, the octree-patterns are exploited to recast a majority of the computational work into pattern-level dense matrix operations. This avoids global matrix assembly, allows better cache utilisation, and aids the associated memory-bandwidth limited computations, resulting in significant performance gains in matrix operations. The PBP approach also supports large-scale parallelism. In this work, the parallel computation is carried out using the mesh-partitioning strategy and implemented using the message passing technique. It is shown that the developed solvers can simulate large-scale and complex structural problems, e.g. delamination/fracture in sandwich panels with approximately a billion unknowns (or DOFs). A massive scaling can be achieved with more than ten thousand cores in a distributed computing environment, which reduces the computation time from months (on a single core) to a few minutes.
... Implementation of structural explicit solvers on GPU are discussed in various studies in the literature. In [8] the authors describe in detail an application involving thin shells, while an overview on applications in biomechanics is discussed in [9]. Additionally, Ensemble methods for fluid problems have been recently proposed by [10,11,12,13,14] in the context of the Navier-Stokes equations with distinct initial conditions and forcing terms. ...
Preprint
Full-text available
Computational models are increasingly used for diagnosis and treatment of cardiovascular disease. To provide a quantitative hemodynamic understanding that can be effectively used in the clinic, it is crucial to quantify the variability in the outputs from these models due to multiple sources of uncertainty. To quantify this variability, the analyst invariably needs to generate a large collection of high-fidelity model solutions, typically requiring a substantial computational effort. In this paper, we show how an explicit-in-time ensemble cardiovascular solver offers superior performance with respect to the embarrassingly parallel solution with implicit-in-time algorithms, typical of an inner-outer loop paradigm for non-intrusive uncertainty propagation. We discuss in detail the numerics and efficient distributed implementation of a segregated FSI cardiovascular solver on both CPU and GPU systems, and demonstrate its applicability to idealized and patient-specific cardiovascular models, analyzed under steady and pulsatile flow conditions.
... However, the hardware costs of CPU-based parallel architectures are very high, and the quantitative restriction of CPU cores in personal computers makes concurrently processing a large number of procedures a difficult task. Recently, thanks to the development of graphics processing units (GPUs) with unified architectures and the introduction of specialized programming models, such as Compute Unified Device Architecture (CUDA) designed by NVIDIA, general-purpose computing on GPU (GPGPU) has become an increasingly adopted computing technique in engineering simulations [24][25][26]. While a CPU is designed to excel at executing a single thread as fast as possible, a GPU is designed to excel at executing thousands of threads in parallel [27]. ...
... The global memory can be accessed by all threads using aligned memory transactions. The optimization of global memory throughput can be achieved if the memory access patterns are suitable for coalescence [25], which requires adjacent threads to access successive memory addresses. To achieve this coalescence of memory accesses, an adaptation of the structure-ofarrays (SoA) storage pattern [42] is adopted to manage all data buffers in global memory. ...
Article
Full-text available
Large deformation contact problems generally involve highly nonlinear behaviors, which are very time-consuming and may lead to convergence issues. The finite particle method (FPM) effectively separates pure deformation from total motion in large deformation problems. In addition, the decoupled procedures of the FPM make it suitable for parallel computing, which may provide an approach to solve time-consuming issues. In this study, a graphics processing unit (GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems. The fundamentals of the FPM for planar solids are first briefly introduced, including the equations of motion of particles and the internal forces of quadrilateral elements. Subsequently, a linked-list data structure suitable for parallel processing is built, and parallel global and local search algorithms are presented for contact detection. The contact forces are then derived and directly exerted on particles. The proposed method is implemented with main solution procedures executed in parallel on a GPU. Two verification problems comprising large deformation frictional contacts are presented, and the accuracy of the proposed algorithm is validated. Furthermore, the algorithm's performance is investigated via a large-scale contact problem, and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4, respectively, relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit (CPU). The contact calculation time percentage of the total calculation time is only 18% with the FPM, much smaller than that (50%) with Abaqus/Explicit, demonstrating the efficiency of the proposed method.
... As unidades de processamento gráfico (GPUs) começaram a ser projetadas voltadas para a computação de propósito geral e, além disso, surgiram linguagens de programação que permitiram seu uso pra tal (Bartezzaghi et al., 2015). Viabilizando assim a obtenção de resultados de melhor qualidade para simulações de larga escala. ...
... In this section, the proposed FPM platform is compared with the Abaqus software for code validation and efficiency assessment. While this kind of efficiency comparison is not always fair and meaningful since equivalence cannot be fully guaranteed, it is nevertheless useful for estimating the achievable performance relative to a common and widely known software tool [Bartezzaghi, Cremonesi, Parolini et al. (2015)]. Three numerical examples are presented to test the performance of the proposed GPU-accelerated FPM solvers: two for the shell solver and one for the solid solver. ...
... To test the performance of the FPM solver for more challenging cases, a pinched cylinder adapted from Bartezzaghi et al. [Bartezzaghi, Cremonesi, Parolini et al. (2015)] was numerically modeled to test cases with much larger numbers of elements. The cylinder, with a radius of 1.016 m, a length of 3.048 m and a thickness of 0.03 m, is clamped at one end and pinched under two opposing forces on the other end. ...
... As a result, the comparisons presented here are between the results of the GPU-accelerated FPM solver and the CPU-accelerated explicit FEM solver in Abaqus, and the latter can be regarded only as a reference. A similar treatment has previously been presented in Bartezzaghi et al. [Bartezzaghi, Cremonesi, Parolini et al. (2015)]. In Abaqus, each model was separately analyzed with 1, 4 and 8 CPU cores. ...
... Since NVIDIA officially released the parallel computing technology of CUDA (Compute Unified Device Architecture) in 2007, it has been able to reduce the development threshold by using the standard C language extended API form [12], [13]. In recent years, it has been widely used in explicit dynamics [14], fluid mechanics [15], [16], structural mechanics [17], molecular dynamics [18], and has become an important branch in the field of HPC. Compared with CPUs cluster computing, GPUs has higher bandwidth, lower latency, and lower power consumption per unit floatingpoint operation [19]. ...
... Once the assembly is completed, the displacement, velocity, and acceleration of each node under distributed aerodynamic forces are calculated by explicit dynamics [34]. Referring to the integration algorithm in reference [14], matrix addition and matrix-vector multiplication are performed in each time step, which does not take much time on GPU [35], [36]. According to Amdahl's law, the calculation of element matrices is the bottleneck of realtime simulation performance. ...
Article
Full-text available
Modern aircraft such as missile and rocket, due to the large slenderness ratio of slender body vehicles, the influence of elastic deformation and vibration on navigation, guidance, and engine modules in simulation can not be ignored. For the problems of slow calculation speed and incapability of real-time simulation for time-domain simulation, by analyzing the time proportion of each calculation step under different computing scale, the dynamic parallel construction of octree is used to represent the aerodynamic parameter table under the environment of single and multi GPU. Meanwhile, an innovative parallel algorithm of element stiffness matrix based on finite element model is designed in GPU architecture. Accordingly, the optimized performance is enhanced through the adaptive hardware resources and rational use of shared memory. Furthermore, A multi-threaded asynchronous framework based on task queue and thread pool is proposed to realize the parallel task calculation with different granularities. The numerical result shows that the acceleration ratio of about 20 times in the single GPU condition can be obtained, and the acceleration ratio of at least 30 times can be obtained by the parallel computing of dual GPUs, enabling the real-time simulation of the flexible aircraft with 1200 elements within 20ms.