Example of non-zero pattern of a sparse matrix A. a Example matrix A, nnz first non-zero on the row. b Non-zero pattern of matrix A

Source publication

Example of non-zero pattern of a sparse matrix A. a Example matrix A,...

Compressed sparse row (CSR) storage format

Finite element mesh examples of the cube. a luf_cube-35937...

Computer-aided design (CAD) model of the Royaumont church. a Exterior...

Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units

Article

Full-text available

Aug 2017

In this paper, an original Jacobi implementation is considered for the solution of sparse linear systems of equations. The proposed algorithm helps to optimize the parallel implementation on GPU. The performance analysis of GPU-based (using CUDA) algorithm of the implementation of this algorithm is compared to the corresponding serial CPU-based alg...

Figure 2: SFG for the proposed algorithm. Dashed lines represents...

Figure 3: SFG for pre-processing stage consisting of the DC component...

Block diagram of the proposed architecture.

SFG of (a) the DC removal block and (b) the finite difference operator,...

Efficient Computation of the 8-point DCT via Summation by Parts

Article

Full-text available

Apr 2018

This paper introduces a new fast algorithm for the 8-point discrete cosine transform (DCT) based on the summation-by-parts formula. The proposed method converts the DCT matrix into an alternative transformation matrix that can be decomposed into sparse matrices of low multiplicative complexity. The method is capable of scaled and exact DCT computat...

Iterated Ritz Method for solving systems of linear algebraic equations

Article

Full-text available

Jun 2017

The paper describes research state of a new iterative method for solving systems of linear algebraic equations. The method is suitable for extremely large systems with sparse matrices. In addition to its own characteristics, it also has a feature of generality, as many iterative methods are only special cases of this approach. The algorithm was dev...

An efficient sparse matrix-vector multiplication on CUDA-enabled graphic processing units for finite element method simulations: Sparse Matrix-Vector Multiplication on GPUs for FEM Simulations

Article

Full-text available

Aug 2016

Atakan Altınkaynak

Finite Element Method (FEM) is a well-developed method to solve real-world problems that can be modeled with differential equations. As the available computational power increases, complex and large size problems can be solved using FEM which typically involves multiple degrees of freedom (DOF) per node, high order of elements and an iterative solv...

Figure 2. Directory structure after processing the data for frequency...

Figure 3. An example of the frequency distribution data for a feature

Feature Selection in Sparse Matrices

Article

Full-text available

May 2019

Non-Parametric Sparse Matrix Decomposition for Cross-View Dimensionality Reduction

Article

Mar 2017

Cross-view data are collected from two different views or sources about the same subjects. As the information from these views often consolidate and/or complement each other, cross-view data analysis can gain more insights for decision making. A main challenge of cross-view data analysis is how to effectively explore the inherently correlated and h...

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Article

Full-text available

Aug 2023

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain.

Efficient GPU implementation of a Boltzmann-Schrödinger-Poisson solver for the simulation of nanoscale DG MOSFETs

Article

Full-text available

Mar 2023
J SUPERCOMPUT

A previous study by Mantas and Vecil (Int J High Perform Comput Appl 34(1): 81–102, 2019) describes an efficient and accurate solver for nanoscale DG MOSFETs through a deterministic Boltzmann-Schrödinger-Poisson model with seven electron–phonon scattering mechanisms on a hybrid parallel CPU/GPU platform. The transport computational phase, i.e. the time integration of the Boltzmann equations, was ported to the GPU using CUDA extensions, but the computation of the system’s eigenstates, i.e. the solution of the Schrödinger-Poisson block, was parallelized only using OpenMP due to its complexity. This work fills the gap by describing a port to GPU for the solver of the Schrödinger-Poisson block. This new proposal implements on GPU a Scheduled Relaxation Jacobi method to solve the sparse linear systems which arise in the 2D Poisson equation. The 1D Schrödinger equation is solved on GPU by adapting a multi-section iteration and the Newton-Raphson algorithm to approximate the energy levels, and the Inverse Power Iterative Method is used to approximate the wave vectors. We want to stress that this solver for the Schrödinger-Poisson block can be thought as a module independent of the transport phase (Boltzmann) and can be used for solvers using different levels of description for the electrons; therefore, it is of particular interest because it can be adapted to other macroscopic, hence faster, solvers for confined devices exploited at industrial level.

Point-Cloud-based Deep Learning Models for Finite Element Analysis

Preprint

Full-text available

Nov 2022

In this paper, we explore point-cloud based deep learning models to analyze numerical simulations arising from finite element analysis. The objective is to classify automatically the results of the simulations without tedious human intervention. Two models are here presented: the Point-Net classification model and the Dynamic Graph Convolutional Neural Net model. Both trained point-cloud deep learning models performed well on experiments with finite element analysis arising from automotive industry. The proposed models show promise in automatizing the analysis process of finite element simulations. An accuracy of 79.17% and 94.5% is obtained for the Point-Net and the Dynamic Graph Convolutional Neural Net model respectively.

Multilayer Perceptron-based Surrogate Models for Finite Element Analysis

Preprint

Full-text available

Nov 2022

Many Partial Differential Equations (PDEs) do not have analytical solution, and can only be solved by numerical methods. In this context, Physics-Informed Neural Networks (PINN) have become important in the last decades, since it uses a neural network and physical conditions to approximate any functions. This paper focuses on hypertuning of a PINN, used to solve a PDE. The behavior of the approximated solution when we change the learning rate or the activation function (sigmoid, hyperbolic tangent, GELU, ReLU and ELU) is here analyzed. A comparative study is done to determine the best characteristics in the problem, as well as to find a learning rate that allows fast and satisfactory learning. GELU and hyperbolic tangent activation functions exhibit better performance than other activation functions. A suitable choice of the learning rate results in higher accuracy and faster convergence.

Parallel design and implementation of Jacobi iterative algorithm based on ternary optical computer

Article

Full-text available

Sep 2022
J SUPERCOMPUT

The Jacobi iterative algorithm has the characteristic of low computational load, and multiple components of the solution can be solved independently. This paper applies these characteristics to the ternary optical computer, which can be used for parallel optimization because it has a large number of data bits and reconfigurable processor bits. Therefore, a new parallel design scheme is constructed to solve the problem of slow efficiency in solving large linear equations. And the elaborate experiment is used to verify. The experimental method is to simulate the calculation on the ternary optical computer experimental platform. Then, the resource consumption is numerically calculated and summarized to measure the feasibility of the parallel design. Eventually, the results show that the parallel design has obvious advantages in computing speed. The Jacobi iterative algorithm is optimized in parallel on ternary optical processor for the first time. There are two parallel highlights of the scheme. First, the n components are calculated in full parallel. Second, the modified signed-digit (MSD) multiplier based on the minimum module and one-step MSD adder are used to calculate each component to eliminate the impact of large amount of data on calculation time. The research provides a new method for fast solution of large linear equations.

Persistent Kernels for Iterative Memory-bound GPU Applications

Preprint

Full-text available

Apr 2022

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of $2.29$x in small domains and $1.53$x in large domains), and a Krylov subspace solver (geometric mean speedup of $4.67$x in smaller SpMV datasets from SuiteSparse and $1.39$x in larger SpMV datasets, for conjugate gradient).

On the stability and performance of the solution of sparse linear systems by partitioned procedures

Preprint

Full-text available

Dec 2021

In this paper, we present, evaluate and analyse the performance of parallel synchronous Jacobi algorithms by different partitioned procedures including band-row splitting, band-row sparsity pattern splitting and substructuring splitting, when solving sparse large linear systems. Numerical experiments performed on a set of academic 3D Laplace equation and on a real gravity matrices arising from the Chicxulub crater are exhibited, and show the impact of splitting on parallel synchronous iterations when solving sparse large linear systems. The numerical results clearly show the interest of substructuring methods compared to band-row splitting strategies.

Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

Article

Full-text available

Oct 2020

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, npr variance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU-GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.

Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

Preprint

Jun 2020

High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth. In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems on graphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performed within on-chip shared memory every cycle. A domain decomposition style procedure is adopted in which the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory, avoiding the need to perform expensive global memory accesses every step. We test our algorithm on the linear systems arising from discretization of Poisson's equation in 1D and 2D, and observe speedup in convergence using our shared memory approach compared to a traditional Jacobi implementation which only uses global memory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedup in the 2D case from the use of shared memory compared to a conventional GPU approach.

Performance Comparison of GPU-Based Jacobi Solvers Using CUDA Provided Synchronization Methods

Article

Full-text available

Feb 2020

In this manuscript, variants of Jacobi solver implementation on general purpose graphical processing units (GPGPU) have been purposed and compared. During this work, parallel implementation of finite element method (FEM) using Poisson’s equation on shared memory architecture as well as on GPGPUs has been observed to identify computationally most expensive part of FEM software, which is linear algebra Jacobi solver. Sparse matrices were used for system of linear equations. Nine implementations of Jacobi solver have been developed and compared using various synchronization and computation methods like atomicAdd, atomicAdd_block, butterfly communication, grid synchronization, hybrid and whole GPU based computation methods, respectively. Experiments have showed that Jacobi implementations based on our implemented Butterfly communication method have outperformed CUDA 10.0 provided critical execution methods like atomicAdd, atomicAdd_block and grid methods. The GPU has achieved a max speedup of 46 times using GTX 1060 and 60 times using Quadro P4000 with double precision computations when compared with sequential implementation on Core-i7 8750H. All the developments were performed using C/C++ GNU compiler 7.3.0 on Ubuntu 18.04 and CUDA 10.0.

Example of non-zero pattern of a sparse matrix A. a Example matrix A, nnz first non-zero on the row. b Non-zero pattern of matrix A

Similar publications

Citations