Graphical representation with dependencies of one repetition of the outer loop in Algorithm 2 with N T = 3.

Source publication

A Parallel Tiled Solver for Dense Symmetric Indefinite Systems on Multicore Architectures

Conference Paper

Full-text available

May 2012

We describe an efficient and innovative parallel tiled algorithm for solving symmetric indefinite systems on multicore architectures. This solver avoids pivoting by using a multiplicative preconditioning based on symmetric randomization. This randomization prevents the communication overhead due to pivoting, is computationally inexpensive and requi...

Context 1

... a symmetric matrix A of size N ×N , N T as the number of tiles, such as in Equation (9), and making the assumption that N = N T ×N B (for simplicity), where N B × N B is the size of each tile A ij , then the tiled LDL T algorithm can be described as in Algorithm 2. A graphical representation of Algorithm 2 is depicted in Figure 3. ...

View in full-text

Context 2

... row in the execution flow shows which tasks are performed and each task is executed by one of the threads involved in the factorization. The trace follows the same color code of Figure 3. Figure 5(b) shows the trace of Algorithm 2 using static scheduling, which means that each core's workload is predetermined. ...

View in full-text

Making Learning Parallel Processing Interesting

Conference Paper

Full-text available

May 2012

The abundant availability of multi-core computers makes "parallel computers" a common place and teaching Computer Science students to be able to design and develop parallel algorithms an urgent task. Most students recognize the needs of developing skills in parallel programming. However, since their Computer Science related curriculum are mostly ta...

Symmetric indefinite triangular factorization revealing the rank profile matrix

Article

Feb 2018

We present a novel recursive algorithm for reducing a symmetric matrix to a triangular factorization which reveals the rank profile matrix. That is, the algorithm computes a factorization $\mathbf{P}^T\mathbf{A}\mathbf{P} = \mathbf{L}\mathbf{D}\mathbf{L}^T$ where $\mathbf{P}$ is a permutation matrix, $\mathbf{L}$ is lower triangular with a unit diagonal and $\mathbf{D}$ is symmetric block diagonal with $1{\times}1$ and $2{\times}2$ antidiagonal blocks. The novel algorithm requires $O(n^2r^{\omega-2})$ arithmetic operations. Furthermore, experimental results demonstrate that our algorithm can even be slightly more than twice as fast as the state of the art unsymmetric Gaussian elimination in most cases, that is it achieves approximately the same computational speed. By adapting the pivoting strategy developed in the unsymmetric case, we show how to recover the rank profile matrix from the permutation matrix and the support of the block-diagonal matrix. There is an obstruction in characteristic $2$ for revealing the rank profile matrix which requires to relax the shape of the block diagonal by allowing the 2-dimensional blocks to have a non-zero bottom-right coefficient. This relaxed decomposition can then be transformed into a standard $\mathbf{P}\mathbf{L}\mathbf{D}\mathbf{L}^T\mathbf{P}^T$ decomposition at a negligible cost.

Non‐GPU‐resident symmetric indefinite factorization

Article

Nov 2016
CONCURR COMP-PRACT E

We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.

Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures

Conference Paper

Apr 2016
Lect Notes Comput Sci

We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen's algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of a symmetric indefinite factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.

3-dimensional magnetotelluric inversion including topography using deformed hexahedral edge finite elements and direct solvers parallelized on symmetric multiprocessor computers - Part II: Direct data-space inverse solution

Article

Nov 2015
GEOPHYS J INT

Following the creation described in Part I of a deformable edge finite-element simulator for 3-D magnetotelluric (MT) responses using direct solvers, in Part II we develop an algorithm named HexMT for 3-D regularized inversion of MT data including topography. Direct solvers parallelized on large-RAM, symmetric multiprocessor (SMP) workstations are used also for the Gauss-Newton model update. By exploiting the data-space approach, the computational cost of the model update becomes much less in both time and computer memory than the cost of the forward simulation. In order to regularize using the second norm of the gradient, we factor thematrix related to the regularization termand apply its inverse to the Jacobian, which is done using the MKL PARDISO library. For dense matrix multiplication and factorization related to the model update, we use the PLASMA library which shows very good scalability across processor cores. A synthetic test inversion using a simple hill model shows that including topography can be important; in this case depression of the electric field by the hill can cause false conductors at depth ormask the presence of resistive structure.With a simplemodel of two buried bricks, a uniform spatial weighting for the norm of model smoothing recovered more accurate locations for the tomographic images compared to weightings which were a function of parameter Jacobians.We implement joint inversion for static distortionmatrices tested using the Dublin secret model 2, for which we are able to reduce nRMS to ~1.1 while avoiding oscillatory convergence. Finally we test the code on field data by inverting full impedance and tipper MT responses collected around Mount St Helens in the Cascade volcanic chain. Among several prominent structures, the north-south trending, eruption-controlling shear zone is clearly imaged in the inversion.

Metaprogramming Dense Linear Algebra Solvers Applications to Multi and Many-Core Architectures

Conference Paper

Full-text available

Aug 2015

The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparisons with state-of-the-art codes, and show that it is possible to obtain a generic code with a high-level interface (similar to MATLAB) which runs either on CPU or GPU without generating a significant overhead.

Solving dense linear systems on accelerated multicore architectures

Article

Full-text available

Jul 2015

Adrien Rémy

In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers.

Towards an automatic generation of dense linear algebra solvers on parallel architectures

Article

Full-text available

Oct 2014

The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparison with state-of-the-art codes, showing that it is possible to obtain a generic code with a high-level interface (similar to Matlab) that can run either on CPU or GPU and that does not generate significant overhead.

Collective Mind: Towards Practical and Collaborative Auto-Tuning

Article

Full-text available

Jul 2014
SCI PROGRAMMING-NETH

Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Conference Paper

Feb 2014

We consider the solution of sparse linear systems using direct methods via LU factorization. Unless the matrix is positive definite, numerical pivoting is usually needed to ensure stability, which is costly to implement especially in the sparse case. The Random Butterfly Transformations (RBT) technique provides an alternative to pivoting and is easily parallelizable. The RBT transforms the original matrix into another one that can be factorized without pivoting with probability one. This approach has been successful for dense matrices; in this work, we investigate the sparse case. In particular, we address the issue of fill-in in the transformed system.

Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines

Article

Full-text available

Oct 2013

Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an efficient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines.

Graphical representation with dependencies of one repetition of the outer loop in Algorithm 2 with N T = 3.

Contexts in source publication

Similar publications

Citations