Figure 3 - uploaded by Dulceneia Becker
Content may be subject to copyright.
Graphical representation with dependencies of one repetition of the outer loop in Algorithm 2 with N T = 3.  

Graphical representation with dependencies of one repetition of the outer loop in Algorithm 2 with N T = 3.  

Source publication
Conference Paper
Full-text available
We describe an efficient and innovative parallel tiled algorithm for solving symmetric indefinite systems on multicore architectures. This solver avoids pivoting by using a multiplicative preconditioning based on symmetric randomization. This randomization prevents the communication overhead due to pivoting, is computationally inexpensive and requi...

Contexts in source publication

Context 1
... a symmetric matrix A of size N ×N , N T as the number of tiles, such as in Equation (9), and making the assumption that N = N T ×N B (for simplicity), where N B × N B is the size of each tile A ij , then the tiled LDL T algorithm can be described as in Algorithm 2. A graphical representation of Algorithm 2 is depicted in Figure 3. ...
Context 2
... row in the execution flow shows which tasks are performed and each task is executed by one of the threads involved in the factorization. The trace follows the same color code of Figure 3. Figure 5(b) shows the trace of Algorithm 2 using static scheduling, which means that each core's workload is predetermined. ...

Similar publications

Conference Paper
Full-text available
The abundant availability of multi-core computers makes "parallel computers" a common place and teaching Computer Science students to be able to design and develop parallel algorithms an urgent task. Most students recognize the needs of developing skills in parallel programming. However, since their Computer Science related curriculum are mostly ta...

Citations

... We recall here some of the standard algorithms from the BLAS3 [7] and LA-PACK [2] interfaces and generalization thereof [3], which will be used to define the main block recursive symmetric eliminating algorithm. In addition, we need to introduce the TRSSYR2K routine solving Problem 1. ...
Article
We present a novel recursive algorithm for reducing a symmetric matrix to a triangular factorization which reveals the rank profile matrix. That is, the algorithm computes a factorization $\mathbf{P}^T\mathbf{A}\mathbf{P} = \mathbf{L}\mathbf{D}\mathbf{L}^T$ where $\mathbf{P}$ is a permutation matrix, $\mathbf{L}$ is lower triangular with a unit diagonal and $\mathbf{D}$ is symmetric block diagonal with $1{\times}1$ and $2{\times}2$ antidiagonal blocks. The novel algorithm requires $O(n^2r^{\omega-2})$ arithmetic operations. Furthermore, experimental results demonstrate that our algorithm can even be slightly more than twice as fast as the state of the art unsymmetric Gaussian elimination in most cases, that is it achieves approximately the same computational speed. By adapting the pivoting strategy developed in the unsymmetric case, we show how to recover the rank profile matrix from the permutation matrix and the support of the block-diagonal matrix. There is an obstruction in characteristic $2$ for revealing the rank profile matrix which requires to relax the shape of the block diagonal by allowing the 2-dimensional blocks to have a non-zero bottom-right coefficient. This relaxed decomposition can then be transformed into a standard $\mathbf{P}\mathbf{L}\mathbf{D}\mathbf{L}^T\mathbf{P}^T$ decomposition at a negligible cost.
... Though our focus of this paper is on the deterministic algorithms with theoretical error bounds, there are growing interests in randomized algorithms [13]. When combined with the iterative refinements, these randomized algorithms may compute the solution of the desired accuracy without pivoting, while obtaining the high performance on modern computers [14,15,1]. ...
Article
We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.
... RBT can be combined with an LDL T factorization to probabilistically improve the stablity of the factorization without pivoting. The performance of RBT has been studied on multicores [8] and distributed-memory system [6], but its performance has not been investigated on the GPU. This paper is organized as follows. ...
... It is also shown that random butterfly matrices are cheap to store and apply (O(nd) and O(dn 2 ) respectively). An implementation for the multicore library PLASMA was described in [8]. ...
Conference Paper
We study the performance of dense symmetric indefinite factorizations (Bunch-Kaufman and Aasen's algorithms) on multicore CPUs with a Graphics Processing Unit (GPU). Though such algorithms are needed in many scientific and engineering simulations, obtaining high performance of the factorization on the GPU is difficult because the pivoting that is required to ensure the numerical stability of the factorization leads to frequent synchronizations and irregular data accesses. As a result, until recently, there has not been any implementation of these algorithms on hybrid CPU/GPU architectures. To improve their performance on the hybrid architecture, we explore different techniques to reduce the expensive communication and synchronization between the CPU and GPU, or on the GPU. We also study the performance of a symmetric indefinite factorization with no pivoting combined with the preprocessing technique based on Random Butterfly Transformations. Though such transformations only have probabilistic results on the numerical stability, they avoid the pivoting and obtain a great performance on the GPU.
... For matrix multiplications like J T B d J and the Cholesky factorization needed to solve eqs (11) and (15), we use the PLASMA library (Buttari et al. 2009;Baboulin et al. 2011). PLASMA is a linear algebra library for dense matrices, parallelized for shared memory machines (such as the SMP unit we use). ...
Article
Following the creation described in Part I of a deformable edge finite-element simulator for 3-D magnetotelluric (MT) responses using direct solvers, in Part II we develop an algorithm named HexMT for 3-D regularized inversion of MT data including topography. Direct solvers parallelized on large-RAM, symmetric multiprocessor (SMP) workstations are used also for the Gauss-Newton model update. By exploiting the data-space approach, the computational cost of the model update becomes much less in both time and computer memory than the cost of the forward simulation. In order to regularize using the second norm of the gradient, we factor thematrix related to the regularization termand apply its inverse to the Jacobian, which is done using the MKL PARDISO library. For dense matrix multiplication and factorization related to the model update, we use the PLASMA library which shows very good scalability across processor cores. A synthetic test inversion using a simple hill model shows that including topography can be important; in this case depression of the electric field by the hill can cause false conductors at depth ormask the presence of resistive structure.With a simplemodel of two buried bricks, a uniform spatial weighting for the norm of model smoothing recovered more accurate locations for the tomographic images compared to weightings which were a function of parameter Jacobians.We implement joint inversion for static distortionmatrices tested using the Dublin secret model 2, for which we are able to reduce nRMS to ~1.1 while avoiding oscillatory convergence. Finally we test the code on field data by inverting full impedance and tipper MT responses collected around Mount St Helens in the Cascade volcanic chain. Among several prominent structures, the north-south trending, eruption-controlling shear zone is clearly imaged in the inversion.
... Future work includes support for more architectures like Intel Xeon Phi, with work on new algorithms that provide good performances while not being available in numerical libraries like randomized algorithms [53], [54] or communication-avoiding algorithms [55] for dense linear systems. Moving to sparse problems is also a possibility where libraries like Cups [56] or VexCL [57] provide an interesting approach. ...
Conference Paper
Full-text available
The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparisons with state-of-the-art codes, and show that it is possible to obtain a generic code with a high-level interface (similar to MATLAB) which runs either on CPU or GPU without generating a significant overhead.
... To avoid the cost of pivoting, and therefore improve the performance of the factorization, Random Butterfly Transformation (RBT) was proposed. This method, first described in [28,29] was recently developed for general systems in [30] and for symmetric indefinite systems in [94,95,96]. Tests performed in [30] for a collection of test matrices showed that in practice two recursions are sufficient to obtain a satisfactory accuracy. ...
Article
Full-text available
In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers.
... Future work includes support for more architectures like Intel Xeon Phi, with work on new algorithms that provide good performances while not being available in numerical libraries like randomized algorithms [5,9] or communication-avoiding algorithms [7] for dense linear systems. Moving to sparse problems is also a possibility where libraries like Cups [10] or VexCL [21] provide an interesting approach. ...
Article
Full-text available
The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparison with state-of-the-art codes, showing that it is possible to obtain a generic code with a high-level interface (similar to Matlab) that can run either on CPU or GPU and that does not generate significant overhead.
... It motivates the need for adaptive scheduling since it may be beneficial either to run kernel on CPU or GPU depending on data set parameters (to amortize the cost of data transfers to GPU). However, as we show in [22], [27], [28], the final decision tree is architecture and kernel dependent and requires both off-line kernel cloning and some on-line, automatic and ad-hoc modeling of application behavior. ...
Article
Full-text available
Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
... They also showed that random butterfly matrices are cheap to store and to apply (O(nd) and O(dn 2 ) respectively) and they proposed implementations on hybrid multicore/GPU systems for the unsymmetric [2] case. For the symmetric case, they proposed a tiled algorithm for multicore architectures [3] and more recently a distributed solver [4] combined with a runtime system [5]. As was demonstrated, the preprocessing by RBT can be easily parallelized with good scalability. ...
Conference Paper
We consider the solution of sparse linear systems using direct methods via LU factorization. Unless the matrix is positive definite, numerical pivoting is usually needed to ensure stability, which is costly to implement especially in the sparse case. The Random Butterfly Transformations (RBT) technique provides an alternative to pivoting and is easily parallelizable. The RBT transforms the original matrix into another one that can be factorized without pivoting with probability one. This approach has been successful for dense matrices; in this work, we investigate the sparse case. In particular, we address the issue of fill-in in the transformed system.
... An effective approach to diminish the memory access cost by increasing data locality is the technique called blocksize calculation or Tiling [1,16,17,3], where the loops are separated into smaller ranges so that data will remain in the cache while required. The difficulty lies on identifying the block value that will lead to the smallest execution time. ...
Article
Full-text available
Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an efficient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines.