Schematic illustration of GPU architecture and programming elements.

Source publication

An explicit dynamics GPU structural solver for thin shell finite elements

Article

Full-text available

Jul 2015

With the availability of user oriented software tools, dedicated architectures, such as the parallel computing platform and programming model CUDA (Compute Unified Device Architecture) released by NVIDIA, one of the main producers of graphics cards, and of improved, highly performing GPU (Graphics Processing Unit) boards, GPGPU (General Purpose pro...

Context 1

... addition, the solver currently runs on single-GPU ma- chines, even though extending the current implementa- tion to a multi-GPU environment is straightforward. A basic graphical representation of the different elements of the GPU architecture, together with the relevant termi- nology used in GPU programming, is provided in figure 1. A generic CUDA application is mainly divided into two parts: host code, running serially or in parallel on CPU, and device code, running on one (or more) GPUs. ...

View in full-text

Context 2

... the code concerning each DOF and node is executed indepen- dently in steps 3 and 5, the assembly of elemental forces in step 4 requires the accumulation of values in the nodes, which is essentially a reduction process. Communication and synchronization between threads belonging to differ- ent blocks (and therefore using non-shared memory, see figure 1) in GPU is a difficult task, since CUDA supports native thread synchronization only at thread-block level. ...

View in full-text

Context 3

... is probably due to the fact that the regular geometry and discretization permit to take advantage of the whole potential of the GPU implementation, exploit- ing the performance of the parallelized algorithm. This is clear also by looking at the speedup graphs in figure 10: the software seems to reach a saturation point, where ca- pabilities of the GPU are used at their maximum and the speedup cannot increase any further. Obviously, these sat- uration curves depend heavily on the GPU hardware: with computationally-oriented GPU boards they are expected to reach higher peak values. ...

View in full-text

Context 4

... two opposite concentrated loads have magnitude of 500 MPa. shows a snapshot of the deformed structure and Figure 12 compares the load-displacement curve with the results of [27]. With this kind of geometry it is simple to refine the mesh and thus analyze the solver performance with a number of degrees of freedom regularly increasing. ...

View in full-text

Context 5

... column Speed lists the number of time steps processed per second, which multiplied by the time step size gives the physical time span which can be simulated in one second of computation. The performance of the proposed imple- mentation can be analyzed in figure 13, where the compu- tational time is reported over the total number of degrees of freedom. The relation is almost perfectly linear: this means that the proposed implementation has not reached a saturation point of the computational power, even us- ing the finest mesh considered. ...

View in full-text

Context 6

... table 5, in the column Memory, theoreti- cal memory consumption estimates are reported (they dif- fer from actual consumption measures by few megabytes). Therefore, the trend shown in figure 13 is not affected even though the last mesh considered fills almost completely the available GPU memory. ...

View in full-text

Data-driven synchronization-avoiding algorithms in the explicit distributed structural analysis of soft tissue

Article

Full-text available

Nov 2022
COMPUT MECH

We propose a data-driven framework to increase the computational efficiency of the explicit finite element method in the structural analysis of soft tissue. An encoder–decoder long short-term memory deep neural network is trained based on the data produced by an explicit, distributed finite element solver. We leverage this network to predict synchronized displacements at shared nodes, minimizing the amount of communication between processors. We perform extensive numerical experiments to quantify the accuracy and stability of the proposed synchronization-avoiding algorithm.

Development of a GPU parallel computational framework for impact debonding of coating–substrate interfaces

Article

Jun 2022
THIN WALL STRUCT

Single-particle tests are normally used to evaluate the impact resistance performance of automotive coatings. Impact-induced debonding is one of the major failure patterns for coating–substrate structures of a vehicle. Currently, it remains a challenging task to accurately and efficiently simulate progressive debonding of automotive coating–substrate interfaces under single particle impact. The main purpose of this work is to develop a graphics processing unit (GPU)-based parallel computational framework to achieve this end. An efficient coating finite element model in three dimensions is established, where solid elements with good aspect ratios and solid-shell elements are respectively used to discretize the impact contact region and the rest region of a coating. An intrinsic cohesive zone model coupled with a mortar-based contact algorithm is used to accurately describe the progressive debonding behavior of coating–substrate interfaces. The computational efficiency of the developed method is dramatically enhanced by recourse to the GPU parallel computing technique. Three benchmark tests are carried out to validate the effectiveness and efficiency of the developed computational framework. Finally, the novel computational framework is successfully applied to the debonding analysis of an organic coating bonded with an aluminum substrate under single-particle impact. Results show that the GPU parallel computational framework can achieve a total speedup of 136.5, which provides a powerful tool for coating debonding analysis.

Explicit dynamics of shells with a Flat‐Facet Triangular Finite Element

Article

Full-text available

May 2022
INT J NUMER METH ENG

Petr Krysl

The use of a flat‐facet finite element with three nodes and six degrees of freedom per node for modeling of linear wave‐propagation events is discussed. The triangular shell finite element was presented in a preceding publication. The main novelty of the current approach is the treatment of the drilling rotations that decouples the drilling degree of freedoms on the global level, and hence makes it possible to eliminate negative effects of these degrees of freedom on the explicit time‐stepping algorithms employed to compute the dynamic response, such as deterioration in accuracy or artificially reduced time step. The element is shown to be robust for wave propagation simulations, and its performance is illustrated with examples including guided‐wave scenarios.

An ensemble solver for segregated cardiovascular FSI

Article

Full-text available

Dec 2021
COMPUT MECH

Computational models are increasingly used for diagnosis and treatment of cardiovascular disease. To provide a quantitative hemodynamic understanding that can be effectively used in the clinic, it is crucial to quantify the variability in the outputs from these models due to multiple sources of uncertainty. To quantify this variability, the analyst invariably needs to generate a large collection of high-fidelity model solutions, typically requiring a substantial computational effort. In this paper, we show how an explicit-in-time ensemble cardiovascular solver offers superior performance with respect to the embarrassingly parallel solution with implicit-in-time algorithms, typical of an inner-outer loop paradigm for non-intrusive uncertainty propagation. We discuss in detail the numerics and efficient distributed implementation of a segregated FSI cardiovascular solver on both CPU and GPU systems, and demonstrate its applicability to idealized and patient-specific cardiovascular models, analyzed under steady and pulsatile flow conditions.

High-Performance Computing for Impact-Induced Fracture Analysis exploiting Octree Mesh Patterns

Thesis

Full-text available

Nov 2021

Ankit Ankit

The impact-induced fracture analysis has a wide range of engineering and defence applications, including aerospace, manufacturing and construction. An accurate simulation of impact events often requires modelling large-scale complex geometries along with dynamic stress waves and damage propagation. To perform such simulations in a timely manner, a highly efficient and scalable computational framework is necessary. This thesis aims to develop a high-performance computational framework for analysing large-scale structural problems pertaining to impact-induced fracture events. A hierarchical grid-based mesh containing octree cells is utilised for discretising the problem domain. The scaled boundary finite element method (SBFEM) is employed, which can efficiently handle the octree cells by eliminating the hanging node issues. The octree-mesh is used in balanced form with a limited number of octree cell patterns. The master element matrices of each pattern are pre-computed while the storage of the individual element matrices is avoided leading to a significant reduction in memory requirements, especially for large-scale models. Further, the advantages of octree cells are leveraged by automatic mesh generation and local refinement process, which enables efficient pre-processing of models with complex geometries. To handle the matrix operations associated with large-scale simulation, a pattern-by-pattern (PBP) approach is proposed. In this technique, the octree-patterns are exploited to recast a majority of the computational work into pattern-level dense matrix operations. This avoids global matrix assembly, allows better cache utilisation, and aids the associated memory-bandwidth limited computations, resulting in significant performance gains in matrix operations. The PBP approach also supports large-scale parallelism. In this work, the parallel computation is carried out using the mesh-partitioning strategy and implemented using the message passing technique. It is shown that the developed solvers can simulate large-scale and complex structural problems, e.g. delamination/fracture in sandwich panels with approximately a billion unknowns (or DOFs). A massive scaling can be achieved with more than ten thousand cores in a distributed computing environment, which reduces the computation time from months (on a single core) to a few minutes.

An ensemble solver for segregated cardiovascular FSI

Preprint

Full-text available

Jan 2021

A GPU-Based Parallel Algorithm for 2D Large Deformation Contact Problems Using the Finite Particle Method

Article

Full-text available

Jan 2021
Comput Model Eng Sci

Large deformation contact problems generally involve highly nonlinear behaviors, which are very time-consuming and may lead to convergence issues. The finite particle method (FPM) effectively separates pure deformation from total motion in large deformation problems. In addition, the decoupled procedures of the FPM make it suitable for parallel computing, which may provide an approach to solve time-consuming issues. In this study, a graphics processing unit (GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems. The fundamentals of the FPM for planar solids are first briefly introduced, including the equations of motion of particles and the internal forces of quadrilateral elements. Subsequently, a linked-list data structure suitable for parallel processing is built, and parallel global and local search algorithms are presented for contact detection. The contact forces are then derived and directly exerted on particles. The proposed method is implemented with main solution procedures executed in parallel on a GPU. Two verification problems comprising large deformation frictional contacts are presented, and the accuracy of the proposed algorithm is validated. Furthermore, the algorithm's performance is investigated via a large-scale contact problem, and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4, respectively, relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit (CPU). The contact calculation time percentage of the total calculation time is only 18% with the FPM, much smaller than that (50%) with Abaqus/Explicit, demonstrating the efficiency of the proposed method.

Anais do II Seminário Integrado do Programa de Pós-Graduação em Engenharia Civil - UFPE

Book

Full-text available

Feb 2020

Tema: conectando a academia aos setores produtivos e de serviço.

Parallelized Implementation of the Finite Particle Method for Explicit Dynamics in GPU

Article

Full-text available

Jan 2020
Comput Model Eng Sci

Real-Time Simulation and Optimization of Elastic Aircraft Vehicle Based on Multi-GPU Workstation

Article

Full-text available

Oct 2019

Modern aircraft such as missile and rocket, due to the large slenderness ratio of slender body vehicles, the influence of elastic deformation and vibration on navigation, guidance, and engine modules in simulation can not be ignored. For the problems of slow calculation speed and incapability of real-time simulation for time-domain simulation, by analyzing the time proportion of each calculation step under different computing scale, the dynamic parallel construction of octree is used to represent the aerodynamic parameter table under the environment of single and multi GPU. Meanwhile, an innovative parallel algorithm of element stiffness matrix based on finite element model is designed in GPU architecture. Accordingly, the optimized performance is enhanced through the adaptive hardware resources and rational use of shared memory. Furthermore, A multi-threaded asynchronous framework based on task queue and thread pool is proposed to realize the parallel task calculation with different granularities. The numerical result shows that the acceleration ratio of about 20 times in the single GPU condition can be obtained, and the acceleration ratio of at least 30 times can be obtained by the parallel computing of dual GPUs, enabling the real-time simulation of the flexible aircraft with 1200 elements within 20ms.

Schematic illustration of GPU architecture and programming elements.

Contexts in source publication

Citations