Figure - available from: The Journal of Supercomputing
This content is subject to copyright. Terms and conditions apply.
Example of non-zero pattern of a sparse matrix A. a Example matrix A, nnz first non-zero on the row. b Non-zero pattern of matrix A

Example of non-zero pattern of a sparse matrix A. a Example matrix A, nnz first non-zero on the row. b Non-zero pattern of matrix A

Source publication
Article
Full-text available
In this paper, an original Jacobi implementation is considered for the solution of sparse linear systems of equations. The proposed algorithm helps to optimize the parallel implementation on GPU. The performance analysis of GPU-based (using CUDA) algorithm of the implementation of this algorithm is compared to the corresponding serial CPU-based alg...

Similar publications

Article
Full-text available
This paper introduces a new fast algorithm for the 8-point discrete cosine transform (DCT) based on the summation-by-parts formula. The proposed method converts the DCT matrix into an alternative transformation matrix that can be decomposed into sparse matrices of low multiplicative complexity. The method is capable of scaled and exact DCT computat...
Article
Full-text available
The paper describes research state of a new iterative method for solving systems of linear algebraic equations. The method is suitable for extremely large systems with sparse matrices. In addition to its own characteristics, it also has a feature of generality, as many iterative methods are only special cases of this approach. The algorithm was dev...
Article
Full-text available
Finite Element Method (FEM) is a well-developed method to solve real-world problems that can be modeled with differential equations. As the available computational power increases, complex and large size problems can be solved using FEM which typically involves multiple degrees of freedom (DOF) per node, high order of elements and an iterative solv...
Article
Cross-view data are collected from two different views or sources about the same subjects. As the information from these views often consolidate and/or complement each other, cross-view data analysis can gain more insights for decision making. A main challenge of cross-view data analysis is how to effectively explore the inherently correlated and h...

Citations

... Sparse matrix-vector multiplication (SpMV) is a commonly used operation in computer science and numerical computation with a wide range of applications in many fields [1,2], such as large-scale linear systems [3], graph analytics [4,5], machine learning [6], and so on. SpMV is often the performance bottleneck of these applications, so it is important to study how to improve the performance of SpMV. ...
Article
Full-text available
Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain.
... The presence of several valleys inside the Si band structure, plus the confinement due to the oxide layers make that we have as many Boltzmann Transport (1) The scattering operator Q ,p [ ] describes the electron-phonon interactions and s (w) is a given function due to the change of variables in the impulsion space. Refer [27,38] for the details about these terms. ...
... A classical Jacobi iteration can be rewritten in vector form [1] in order to exploit the matrix-vector product operation: A significant acceleration of the Jacobi algorithm can be obtained by applying the Scheduled Relaxation Jacobi (SRJ) method. The SRJ method extends the classical Jacobi method by introducing P different relaxation factors i > 0, i = 1, … , P . ...
Article
Full-text available
A previous study by Mantas and Vecil (Int J High Perform Comput Appl 34(1): 81–102, 2019) describes an efficient and accurate solver for nanoscale DG MOSFETs through a deterministic Boltzmann-Schrödinger-Poisson model with seven electron–phonon scattering mechanisms on a hybrid parallel CPU/GPU platform. The transport computational phase, i.e. the time integration of the Boltzmann equations, was ported to the GPU using CUDA extensions, but the computation of the system’s eigenstates, i.e. the solution of the Schrödinger-Poisson block, was parallelized only using OpenMP due to its complexity. This work fills the gap by describing a port to GPU for the solver of the Schrödinger-Poisson block. This new proposal implements on GPU a Scheduled Relaxation Jacobi method to solve the sparse linear systems which arise in the 2D Poisson equation. The 1D Schrödinger equation is solved on GPU by adapting a multi-section iteration and the Newton-Raphson algorithm to approximate the energy levels, and the Inverse Power Iterative Method is used to approximate the wave vectors. We want to stress that this solver for the Schrödinger-Poisson block can be thought as a module independent of the transport phase (Boltzmann) and can be used for solvers using different levels of description for the electrons; therefore, it is of particular interest because it can be adapted to other macroscopic, hence faster, solvers for confined devices exploited at industrial level.
... The problem is discretized with P 1 finite element with stabilized parameter obtained from [5], and the associated linear system of equations is solved with the Alinea library [16]. The problem is reformulated with a domain decomposition method [17] and is solved on Graphics Processing Unit (GPU) [15] [13]. The local solver inside each subdomain is performed with the conjugate gradient method [14] involving a auto-tuning of the GPU memory [18]. ...
Preprint
Full-text available
In this paper, we explore point-cloud based deep learning models to analyze numerical simulations arising from finite element analysis. The objective is to classify automatically the results of the simulations without tedious human intervention. Two models are here presented: the Point-Net classification model and the Dynamic Graph Convolutional Neural Net model. Both trained point-cloud deep learning models performed well on experiments with finite element analysis arising from automotive industry. The proposed models show promise in automatizing the analysis process of finite element simulations. An accuracy of 79.17% and 94.5% is obtained for the Point-Net and the Dynamic Graph Convolutional Neural Net model respectively.
... The linear system of equations is solved with the Alinea library [12]. The problem is reformulated in parallel with a domain decomposition method [13] and is solved on GPU [11] [9]. The local solver inside each subdomain is performed with the conjugate gradient method [10] involving a auto-tuning of the GPU memory [14]. ...
Preprint
Full-text available
Many Partial Differential Equations (PDEs) do not have analytical solution, and can only be solved by numerical methods. In this context, Physics-Informed Neural Networks (PINN) have become important in the last decades, since it uses a neural network and physical conditions to approximate any functions. This paper focuses on hypertuning of a PINN, used to solve a PDE. The behavior of the approximated solution when we change the learning rate or the activation function (sigmoid, hyperbolic tangent, GELU, ReLU and ELU) is here analyzed. A comparative study is done to determine the best characteristics in the problem, as well as to find a learning rate that allows fast and satisfactory learning. GELU and hyperbolic tangent activation functions exhibit better performance than other activation functions. A suitable choice of the learning rate results in higher accuracy and faster convergence.
... x is the solution vector [24]. ...
Article
Full-text available
The Jacobi iterative algorithm has the characteristic of low computational load, and multiple components of the solution can be solved independently. This paper applies these characteristics to the ternary optical computer, which can be used for parallel optimization because it has a large number of data bits and reconfigurable processor bits. Therefore, a new parallel design scheme is constructed to solve the problem of slow efficiency in solving large linear equations. And the elaborate experiment is used to verify. The experimental method is to simulate the calculation on the ternary optical computer experimental platform. Then, the resource consumption is numerically calculated and summarized to measure the feasibility of the parallel design. Eventually, the results show that the parallel design has obvious advantages in computing speed. The Jacobi iterative algorithm is optimized in parallel on ternary optical processor for the first time. There are two parallel highlights of the scheme. First, the n components are calculated in full parallel. Second, the modified signed-digit (MSD) multiplier based on the minimum module and one-step MSD adder are used to calculate each component to eliminate the impact of large amount of data on calculation time. The research provides a new method for fast solution of large linear equations.
... More than half the systems on the Top500 [1] list include discrete GPUs and seven of the systems in the top ten are GPU-accelerated (November 2021 list). As a result, extensive efforts went into optimizing iterative methods for GPUs, for instance: iterative stencils [2]- [5] used widely in numerical solvers for PDEs, iterative stationary methods for solving systems of linear equations (ex: Jacobi [6], [7], Gauss-Seidel method [7]- [9]), iterative Krylov subspace methods for solving systems of linear equations (ex: conjugate gradient [10], [11], BiCG [11], [12], and GMRES [11], [13]). ...
... Tgm(D) = Agm(D) · S(type)/Bgm (6) In the case when the kernel is bounded by shared memory bandwidth, i.e., the volume of data cached in shared memory moves the bottleneck to be the shared memory bandwidth, the total shared memory (in bytes) accessed A sm becomes: ...
... Accordingly, we do not use those new features in our PERKS implementations. [16], [35] Benchmark(Stencil Order, FLOPs/Cell) 2d5pt (1,10) 2ds9pt (2,18) 2d13pt (3,26) 2d17pt(4,34) 2d21pt (5,42) 2ds25pt (6,59) 2d9pt (1,18) 2d25pt(2,50) 3d7pt (1,14) 3d13pt (2,26) 3d17pt (1,34) 3d27pt(1,54) poisson (1,38) --- The experimental results presented here are evaluated on the two latest generations of Nvidia GPUs: Volta V100 and Ampere A100 with CUDA 11.5 and driver version 495.29.05. ...
Preprint
Full-text available
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of $2.29$x in small domains and $1.53$x in large domains), and a Krylov subspace solver (geometric mean speedup of $4.67$x in smaller SpMV datasets from SuiteSparse and $1.39$x in larger SpMV datasets, for conjugate gradient).
... as described by the authors in [13]. Algorithm 1 present vectorial version of Jacobi algorithm [13]. ...
... as described by the authors in [13]. Algorithm 1 present vectorial version of Jacobi algorithm [13]. The parallel version consists in computing all operations locally and then exchange the local residual between cooperating processors in order to compute the global convergence. ...
... The authors presented in [13], an original Jacobi implementation that helps optimize the solution of sparse linear systems on GPU-based implementation. ...
Preprint
Full-text available
In this paper, we present, evaluate and analyse the performance of parallel synchronous Jacobi algorithms by different partitioned procedures including band-row splitting, band-row sparsity pattern splitting and substructuring splitting, when solving sparse large linear systems. Numerical experiments performed on a set of academic 3D Laplace equation and on a real gravity matrices arising from the Chicxulub crater are exhibited, and show the impact of splitting on parallel synchronous iterations when solving sparse large linear systems. The numerical results clearly show the interest of substructuring methods compared to band-row splitting strategies.
... sparse matrix-vector multiplication (SpMV) is fundamental to many scientific, engineering, and other applications [1][2][3][4][5][6][7][8][9]. These include web ranking [10,11], communication and networked systems [12], finding steady-state and transient solutions of Markov chains [13,14], and many others [1,2,15]. ...
Article
Full-text available
Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, npr variance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU-GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.
... To further improve performance, they develop asynchronous implementations which remove the need for complete synchronization each step, leading to a 1.2 to 2.5 times speedup over their previously optimal implementation [12]. Ahamed and Magoulés develop a Krylov-like version of Jacobi iteration and apply it to several three dimensional problems (Laplace equation, gravitational potential equation, heat equation) on the GPU, observing a 23 times speedup using this new version compared to the serial CPU-based algorithm [13]. ...
Preprint
High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth. In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems on graphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performed within on-chip shared memory every cycle. A domain decomposition style procedure is adopted in which the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory, avoiding the need to perform expensive global memory accesses every step. We test our algorithm on the linear systems arising from discretization of Poisson's equation in 1D and 2D, and observe speedup in convergence using our shared memory approach compared to a traditional Jacobi implementation which only uses global memory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedup in the 2D case from the use of shared memory compared to a conventional GPU approach.
... Ahamed and Magoules [37] proposed first GPU based algorithm of jacobi solver for solution of sparse linear system of equations. they performed experimentations on 3D problems like 3D laplace equation and 3D heat equation etc. and achieved 23 times speedups with NVIDIA GPU Tesla K20c. ...
Article
Full-text available
In this manuscript, variants of Jacobi solver implementation on general purpose graphical processing units (GPGPU) have been purposed and compared. During this work, parallel implementation of finite element method (FEM) using Poisson’s equation on shared memory architecture as well as on GPGPUs has been observed to identify computationally most expensive part of FEM software, which is linear algebra Jacobi solver. Sparse matrices were used for system of linear equations. Nine implementations of Jacobi solver have been developed and compared using various synchronization and computation methods like atomicAdd, atomicAdd_block, butterfly communication, grid synchronization, hybrid and whole GPU based computation methods, respectively. Experiments have showed that Jacobi implementations based on our implemented Butterfly communication method have outperformed CUDA 10.0 provided critical execution methods like atomicAdd, atomicAdd_block and grid methods. The GPU has achieved a max speedup of 46 times using GTX 1060 and 60 times using Quadro P4000 with double precision computations when compared with sequential implementation on Core-i7 8750H. All the developments were performed using C/C++ GNU compiler 7.3.0 on Ubuntu 18.04 and CUDA 10.0.