ArticlePDF Available

Analysis of Matrix Multiplication Computational Methods

January 2014
European Journal of Scientific Research 121(3):258-266

January 2014
121(3):258-266

Authors:

Khaled M. Matrouk

Al-Hussein Bin Talal University

Abdullah Alhasanat

Al-Hussein Bin Talal University

Haitham Alashaary

Al-Hussein Bin Talal University

Ziad A.A Al Qadi

Al-Balqa Applied University

Show all 5 authorsHide

Matrix multiplication is a basic concept that is used in engineering applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computation time as its complexity is O(n3). Because most engineering applications require higher computational throughputs with minimum time, many sequential and parallel algorithms are developed. In this paper, methods of matrix multiplication are chosen, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using openMP and MPI methods of parallel computing.

Maximum performance of using openMP

: Experiment 4 Results

: Speedup results of Using OpenMP

: Speedup and efficiency of using MPI

Figures - uploaded by Hasan M AL-Shalabi

Content may be subject to copyright.

Content uploaded by Hasan M AL-Shalabi

Content may be subject to copyright.

European Journal of Scientific Research

ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266

http://www.europeanjournalofscientificresearch.com

Analysis of Matrix Multiplication Computational Methods

Khaled Matrouk

Corrospondent Author, Department of Computer Engineering, Faculty of Engineering

Al-Hussein Bin Talal University, P. O. Box (20), Ma'an, Zip Code 71111, Jordan

E-mail: khaled.matrouk@ahu.edu.jo

Tel: +962-3-2179000 (ext. 8503), Fax: +962-3-2179050

Abdullah Al- Hasanat

Department of Computer Engineering, Faculty of Engineering

Al-Hussein Bin Talal University, P. O. Box (20), Ma'an, Zip Code 71111, Jordan

Haitham Alasha'ary

Department of Computer Engineering, Faculty of Engineering

Al-Hussein Bin Talal University, P. O. Box (20), Ma'an, Zip Code 71111, Jordan

Ziad Al-Qadi

Prof, Department of Computer Engineering, Faculty of Engineering

Al-Hussein Bin Talal University, P. O. Box (20), Ma'an, Zip Code 71111, Jordan

Hasan Al-Shalabi

Prof, Department of Computer Engineering, Faculty of Engineering

Al-Hussein Bin Talal University, P. O. Box (20), Ma'an, Zip Code 71111, Jordan

Abstract

Matrix multiplication is a basic concept that is used in engineering applications such

as digital image processing, digital signal processing and graph problem solving.

Multiplication of huge matrices requires a lot of computation time as its complexity is

O(n

3

). Because most engineering applications require higher computational throughputs

with minimum time, many sequential and parallel algorithms are developed. In this paper,

methods of matrix multiplication are chosen, implemented, and analyzed. A performance

analysis is evaluated, and some recommendations are given when using openMP and MPI

methods of parallel computing.

Keywords: OpenMP, MPI, Processing Time, Speedup, Efficiency

1. Introduction

With the advent of parallel hardware and software technologies users are faced with the challenge to

choose a programming paradigm best suited for the underlying computer architecture (Alqadi and

Abu-Jazzar, 2005a; Alqadi and Abu-Jazzar, 2005b; Alqadi et al, 2008). With the current trend in

parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP)

parallel programming techniques have evolved to support parallelism beyond a single level (Choi et al,

1994).

Analysis of Matrix Multiplication Computational Methods 259

Parallel programming within one SMP node can take advantage of the globally shared address

space. Compilers for shared memory architectures usually support multi-threaded execution of a

program. Loop level parallelism can be exploited by using compiler directives such as those defined in

the OpenMP standard (Dongarra et al, 1994; Alpatov et al, 1997).

OpenMP provides a fork-and-join execution model in which a program begins execution as a single

thread. This thread executes sequentially until a parallelization directive for a parallel region is found

(Alpatov et al, 1997; Anderson et al, 1987). At this time, the thread creates a team of threads and becomes

the master thread of the new team (Chtchelkanova et al, 1995; Barnett et al, 1994; Choi et al, 1992).

All threads execute the statements until the end of the parallel region. Work-sharing directives

are provided to divide the execution of the enclosed code region among the threads. All threads need to

synchronize at the end of parallel constructs. The advantage of OpenMP (web ref.) is that an existing

code can be easily parallelized by placing OpenMP directives around time consuming loops which do

not contain data dependences, leaving the source code unchanged. The disadvantage is that it is not

easy for the user to optimize workflow and memory access.

On an SMP cluster the message passing programming paradigm can be employed within and

across several nodes. MPI (web ref.) is a widely accepted standard for writing message passing

programs (web ref.; Rabenseifner, 2003).

MPI provides the user with a programming model where processes communicate with other

processes by calling library routines to send and receive messages. The advantage of the MPI programming

model is that the user has complete control over data distribution and process synchronization, permitting

the optimization data locality and workflow distribution. The disadvantage is that existing sequential

applications require a fair amount of restructuring for a parallelization based on MPI.

1.1. Serial Matrix Multiplication

Matrix multiplication involves two matrices A and B such that the number of columns of A and the

number of rows of B are equal. When carried in sequential, it takes a time O(n

3

). The algorithm for

ordinary matrix multiplication is:

for i=1 to n

for j=1 to n

c(i,j)=0

for k=1 to n

c(i,j)=c(i,j)+a(i,k)*b(k,j)

end

end

end

1.2. Parallel Matrix Multiplication Using OpenMp

Master thread forks the outer loop between the slave threads, thus each of these threads implements

matrix multiplication using a part of rows from the first matrix, when the threads multiplication are

done the master thread joins the total result of matrix multiplication.

1.3. Parallel Matrix Multiplication Using MPI

The procedures of implementing the sequential algorithm in parallel using MPI can be divided into the

following steps:

 Split the first matrix row wise to split to the different processors, this is performed by the

master processor.

 Broadcast the second matrix to all processors.

 Each processor performs multiplication of the partial of the first matrix and the second matrix.

 Each processor sends back the partial product to the master processor.

260 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary

Ziad Al-Qadi and Hasan Al-Shalabi

Implementation

 Master (processor 0) reads data

 Master sends size of data to slaves

 Slaves allocate memory

 Master broadcasts second matrix to all other processors

 Master sends respective parts of first matrix to all other processors

 Every processor performs its local multiplication

o All slave processors send back their result.

2. Methods and Tools

One station with Pentium i5 processor with 2.5 GHz and 4 GB memory is used to implement serial

matrix multiplication. Visual Studio 2008 with openMP library is used as an environment for building,

executing and testing matrix multiplication program. The program is tested using Pentium i5 processer

with 2.5 GHz and 4 GB memory. A distributed processing system with different number of processors

is used, each processor is a 4 core processor with 2.5 MHz and 4 GB memory, the processors are

connected though Visual Studio 2008 with MPI environment.

3. Experimental Part

Different sets of 2 matrices are chosen (different in sizes and data types ) and each pair of matrices is

multiplied serially and in parallel using both openMP and MPI environments, and the average

multiplication time is taken.

3.1. Experiment 1

Sequential matrix multiplication program is tested using different size matrices. Different size matrices

with different data types (integer, float, double, and complex) are chosen, 100 types of matrices with

different data types and different sizes are multiplied. Table 1 shows the average results obtained in

this experiment.

Table 1: Experiment 1 Results

Matrices size Multiplication time in seconds

10*10 0.00641199

20*20 0.00735038

40*40 0.0063971

100*100 0.0142716

200*200 0.0386879

1000*1000 6.75335

1200*1200 11.889

5000*5000 2007

10000*10000 13000

3.2. Experiment 2

Matrix multiplication program is tested using small size matrices. Different size matrices with different

data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data

types and different sizes are multiplied using openMP environment. Table 2 shows the average results

obtained in this experiment.

Analysis of Matrix Multiplication Computational Methods 261

Table 2: Experiment 2 Results

# of threads

10,10

(time in seconds)

20,20

(time in seconds)

40,40

(time in seconds)

100,100

(time in seconds)

200,200

(time in seconds)

1 0.00641199 0.00735038 0.0063971 0.0142716 0.0386879

2 0.03675360 0.06866570 0.0370589 0.0373609 0.0615986

3 0.06271470 0.06311360 0.0701701 0.0978940 0.0787245

4 0.07273020 0.06979990 0.0710032 0.0706766 0.079643

5 0.06772930 0.07232620 0.0673493 0.0699920 0.051531

6 0.06918620 0.07037430 0.0707350 0.0724863 0.0837632

7 0.07124480 0.07204210 0.0727263 0.0727355 0.0820219

8 0.74631600 0.07348000 0.0677064 0.0762404 0.0820226

3.3. Experiment 3

Matrix multiplication program is tested using big size matrices. Different size matrices with different

data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data

types and different sizes are multiplied using openMP environment with 8 threads. Table 3 shows the

average results obtained in this experiment.

Table 3: Experiment 3 Results

Matrices size Multiplication time (in seconds)

1000,1000 1.8377

1200,1200 3.19091

2000,2000 18.0225

5000,5000 508.1

10000,10000 333.3

3.4. Experiment 4

Matrix multiplication program is tested using big size matrices. Different size matrices with different

data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data

types and different sizes are multiplied using MPI environment with different number of processors.

Table 4 shows the average results obtained in this experiment.

Table 4: Experiment 4 Results

Number of

processors

Multiplication time in second

1000x1000 matrices

Multiplication time in second

5000x5000 matrices

Multiplication time in

second 10000x10000

matrices

1 6.96 2007 13000

2 5.9 1055 7090

4 3.3 525 3290

5 2.8 431 2965

8 2.1 260 1920

10 1.5 235 1600

20 0.8 119 900

25 0.6 91 830

50 0.55 52 292

4. Results Discussion

From the results obtained in the previous section we can categorize the matrices into three groups:

 Small size matrices with size less than 1000*1000

 Mid size matrices with 1000*1000 ≤ size ≤ 5000*5000

262 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary

Ziad Al-Qadi and Hasan Al-Shalabi

 Huge size matrices with size ≥ 5000*5000

 The following recommendation can be declared:

 For small size matrices, it is preferable to use sequential matrix multiplication.

 For mid size matrices, it is preferable to use parallel matrix multiplication using openMP.

 For huge size matrices, it is preferable to use parallel matrix multiplication using MPI.

 Also it is recommended to use hybrid parallel systems (MPI with openMP) to multiply matrices

with huge sizes.

From the results obtained in Table 2 we can see that the speedup of using openMP is limited to

the number of actual available physical cores in the computer system as it is shown in Table 5 and Fig.

1.

Speedup (times) = Time of execution with 1 thread/parallel time

Table 5: Speedup results of Using OpenMP

Matrix size #of threads = 1 #of threads = 8 Speedup

300,300 0.110188 0.097704 1.1278

400,400 0.314468 0.170006 1.8497

500,500 0.601031 0.277821 2.1634

600,600 1.14773 0.64882 1.7689

700,700 2.17295 0.704228 3.0856

800,800 3.16512 0.963983 3.2834

900,900 4.93736 1.37456 3.5920

1000,1000 6.69186 1.8377 3.6414

1024,1024 7.18151 1.97027 3.6449

1200,1200 12.0819 3.19091 3.7863

2000,2000 72.8571 18.0996 4.0253

2048,2048 74.7383 18.8406 3.9669

Figure 1: Maximum performance of using openMP

From the results obtained in Tables 1 and 2 we can see that the matrix multiplication time

increases rapidly when the matrix size increases as shown in Figs 2 and 3.

Analysis of Matrix Multiplication Computational Methods 263

Figure 2: Comparing between 8 and 1 threads results

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

0

10

20

30

40

50

60

70

80

n matrix(nxn)

time in seconds

One Thread

8 threads

Figure 3: Relationship between the speedup and the matrix size

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

1

1.5

2

2.5

3

3.5

4

4.5

matrix size(nxn)

speedup :times

max # of cores

From the results obtained in Table 4 we can calculate the speedup of using MPI and the system

efficiency:

Efficiency = speedup/number of processors

The calculation results are shown in Table 6:

264 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary

Ziad Al-Qadi and Hasan Al-Shalabi

Table 6: Speedup and efficiency of using MPI

Number of

processors

Speedup of

Multiplication

1000x100

matrices

Efficiency

Speedup of

Multiplication

5000x5000

matrices

Efficiency

Speedup of

Multiplication

10000x10000

matrices

Efficiency

1 1 1 1 1 1 1

2 1.17 0.585 1.9 0.38 1.83 0.92

4 2.9 0.725 3.8 0.95 3.9 0.98

5 2.46 0.492 4.66 0.93 4.4 0.88

8 3.29 0.411 7.72 0.96 6.77 0.85

10 4.6 0.46 8.54 0.85 8.13 0.81

20 8.63 0.43 16.87 0,84 14.44 0.72

25 11.5 0.45 22.05 0.88 15.66 0.63

50 12.55 0.251 38.6 0.77 44.52 0.89

From Table 6 we can see that increasing the number of processors in an MPI environments

leads to enhancing the speedup of matrix multiplication but it also leads to poor system efficiency as

shown in Figs 4, 5 and 6.

Figure 4: Multiplication time for 10000x10000 matrices

0 5 10 15 20 25 30 35 40 45 50

0

2000

4000

6000

8000

10000

12000

14000

Number of processors

Time in seconds

Running times for parallel matrix multiplication of two 10000x10000 matrices

Analysis of Matrix Multiplication Computational Methods 265

Figure 5: Speedup of multiplication for 10000x10000 matrices

12 45 8 10 20 25 50

0

5

10

15

20

25

30

35

40

45

Speedup of matrix multiplication 10000*10000

Number of processors

Speedup

Figure 6: System efficiency of matrices multiplication

0 5 10 15 20 25 30 35 40 45 50

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Number of processors

Efficiency

System efficiency

1000*1000

5000*5000

10000*10000

5. Conclusions

Based on the results obtained and shown above, the following conclusions can be drawn:

 Sequential matrix multiplication is preferable for matrices with small sizes.

 OpenMP is a good method to use as an environment for parallel matrix multiplication with mid

sizes, and here the speedup is limited to the number of available physical cores.

266 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary

Ziad Al-Qadi and Hasan Al-Shalabi

 MPI is a good method to use as an environment for parallel matrix multiplication with huge

sizes, here we can increase the speedup but negatively affects the system efficiency.

 To avoid the problems in the two previous conclusions we can recommend a hybrid parallel

system

References

[1] Alqadi Z., and Abu-Jazzar A., 2005. CNRS-NSF Workshop on Environments and Program

Methods Used for Optimizing Matrix Tools for Parallel Scientific Computing, Saint Hilaire

Multiplication, Journal of Engineering 15(1), pp. 73-78, du Touvet, France, Sept. 7-8, Elsevier

Sci. Publishers.

[2] Alqadi Z., and Abu-Jazzar A., 2005. "Analysis of Program Methods Used for Optimizing

Matrix Multiplication", Journal of Engineering 15(1), pp. 73-78.

[3] Alqadi Z., Aqel M., and El Emary I. M. M., 2008. "Performance Analysis and Evaluation of

Parallel Matrix Multiplication Algorithms" ,World Applied Sciences Journal 5(2).

[4] Dongarra, J. J., R.A. Van de Geijn, and D.W. Walker, 1994. "Scalability Issues Affecting the

Design of a Linear Algebra Library, Parallel Linear Algebra Package Design", Distributed

Computing 22( 3), Proceedings of SC 97, pp. 523-537.

[5] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, and P. Geng, 1997. "Parallel Matrix

Distributions: Parallel Linear Algebra Package", Tech. Report TR-95-40, Proceedings of the

SIAM Parallel Processing Computer Sciences Conference, The University of Texas, Austin.

[6] Choi, J., J. J. Dongarra and D.W. Walker, 1994. "A High-Performance Matrix Multiplication

Algorithm Pumma: Parallel Universal Matrix Multiplication on a Distributed Memory Parallel

Computer Using Algorithms on Distributed Memory Concurrent Overlapped Communication",

IBM J. Res. Develop., Computers, Concurrency: Practice and Experience 6(7), pp. 543-570.

[7] Chtchelkanova, A., J. Gunnels, and G. Morrow, 1986. "IEEE Implementation of BLAS:

General Techniques for Level 3 BLAS", Proceedings of the 1986 International Conference on

Parallel Processing, pp. 640-648, TR-95-40, Department of Computer Sciences, University of

Texas.

[8] Barnett, M., S. Gupta, D. Payne, and L. Shuler, 1994. "Using MPI: Communication Library

(InterCom), Scalable High Portable Programming with the Message-Passing Performance,

Computing Conference, pp. 17-31.

[9] Anderson E., Z. Bai, C. Bischof, and J. Demmel, 1987. "Solving Problems on Concurrent

Processors", Proceedings of Matrix Algorithms Supercomputing '90, IEEE 1, pp. 1-10.

[10] Choi J., J. J. Dongarra, R. Pozo and D.W. Walker, 1992. "Scalapack: A Scalable Linear

Algebra Library for Distributed Memory Concurrent Computers", Proceedings of the Fourth

Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, pp.

120-127.

[11] MPI 1.1 Standard, http://www-unix.mcs.anl.gov/mpi/mpich.

[12] OpenMP Fortran Application Program Interface, http://www.openmp.org/.

[13] Rabenseifner, R., 2003. “Hybrid Parallel Programming: Performance Problems and Chances”,

Proceedings of the 45th Cray User Group Conference, Ohio, May 12-16, 2003.

SECURING LSB STEGANOGRAPHY USING BITE SUBSTITUTION AND IMAGE BLOCKING

Article

Full-text available

Jan 2023

Protecting secret messages is a vital issue, in this paper's research, a simplified, highly secure method of message steganography will be introduced. The proposed method will use a complicated PK, which contains information to select a secret block from the color image to be used as a covering block; also it will contain the values of the chaotic logistic map model to run this model to generate the indices key needed to substitute the message binary matrix. The PK will provide a huge keyspace capable to resist hacking attacks, the extracted message will be very sensitive to any minor changes in the PK, and any changes in this key during the extraction phase will be considered a hacking attempt by producing a damaged extracted message. It will be shown that the proposed method will be always efficient when changing the message length and changing the covering images. The proposed method will be implemented using various messages and various covering images, the obtained results will be analyzed using various types of data analysis methods to prove the improvements provided by the proposed method (quality, security, and efficiency).

Enhanced Efficiency and Security in LSB2 Steganography: Burst Embedding and Private Key Integration

Article

Full-text available

Oct 2023
TRAIT SIGNAL

2x2 Matrix Multiplication with 4-Bit elements in 45nm CMOS Technology

Conference Paper

Nov 2020

Performance Analysis and Evaluation of Parallel Matrix Multiplication Algorithms

Article

Full-text available

Jan 2008

3 Abstract: Multiplication of large matrices requires a lot of computation time as its complexity is O (n ). Because 3 most current applications require higher computational throughputs with minimum time, many sequential and parallel algorithms are developed. In this paper, a theoretical analysis for the performance and evaluation of the parallel matrix multiplication algorithms is carried out. However, an experimental analysis is performed to support the theoretical analysis results. Recommendations are made based on this analysis to select the proper parallel multiplication algorithms.

Analysis of program methods used in optimizing matrix multiplication

Article

Jan 2005

Ziad A.A Al Qadi

ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers

Conference Paper

Nov 1992

The authors describe ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level 3 BLAS as building blocks, and an object-oriented interface to the library routines. The square block scattered decomposition is described. The implementation of a distributed memory version of the right-looking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrate the scalability of the algorithm

Hybrid Parallel Programming: Performance Problems and Chances

Article

Jun 2003

Rolf Rabenseifner

This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of use

PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers

Article

Feb 1997

0-5, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T-3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, P. O. Box 3000, Boulder, CO 80307 137. Jurgen Steppeler, DWD, Frankfurterstr 135, 6050 Offenbach, WEST GERMANY 138. Rick Stevens, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 139. Paul N. Swarztrauber, National Center for Atmospheric Research, P. O. Box 3000, Boulder, CO 80307 140. Wei Pai Tang, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 141. Harold Trease, Los Alamos National Laboratory, Mail Stop B257, Los Alamos, NM 87545 142. Robert G. Voigt, ICASE, MS 132-C, NASA Langley Research Center, Hampton, VA 23665 143. Mary F. Wheeler, Rice University, Department of Mathematical Sc

Scalability Issues Affecting The Design Of A Dense Linear Algebra Library

Article

Jun 1994

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highl...

Parallel implementation of BLAS: general techniques for Level 3 BLAS

Article

Sep 1997
Concurrency Pract Ex

In this paper, we present straight forward techniques for a highly efficient, scalable implementation of common matrix-matrix operations generally known as the Level 3 Basic Linear Algebra Subprograms (BLAS). This work builds on our recent discovery of a parallel matrix-matrix multiplication implementation, which has yielded superior performance, and requires little work space. We show that the techniques used for the matrix-matrix multiplication naturally extend to all important level 3 BLAS and thus this approach becomes an enabling technology for efficient parallel implementation of these routines and libraries that use BLAS. Representative performance results on the Intel Paragon system are given. 1 Introduction Over the last five years, we have learned a lot about how to parallelize dense linear algebra libraries [5, 12, 15, 20, 21, 23, 36, 39]. Since much effort went into implementation of individual algorithms, it became clear that in order to parallelize entire sequential libr...

PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers

Article

Oct 1994
Concurrency Pract Ex

This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the nontransposed matrix multiplication routine C = A Delta B, but also transposed multiplication routines C = A T Delta B, C = A Delta B T , and C = A T Delta B T , for a block scattered data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer. iii 1 Introduction Current advanced architecture computers possess hierarchical memories in which accesses to data in the upper levels of the memory hierarchy (registers, cache, and/or local memory) are faster than those in lower levels (shared or off-processor memory). One technique to mor...

Parallel Matrix Distributions: Parallel Linear Algebra Package

Jan 1997

P Alpatov
G Baker
C Edwards
J Gunnels
P Geng

Alpatov, P., G. Baker, C. Edwards, J. Gunnels, and P. Geng, 1997. "Parallel Matrix Distributions: Parallel Linear Algebra Package", Tech. Report TR-95-40, Proceedings of the SIAM Parallel Processing Computer Sciences Conference, The University of Texas, Austin.

Using MPI: Communication Library (InterCom), Scalable High Portable Programming with the Message-Passing Performance, Computing Conference

Jan 1994
17-31

M Barnett
S Gupta
D Payne
L Shuler

Barnett, M., S. Gupta, D. Payne, and L. Shuler, 1994. "Using MPI: Communication Library (InterCom), Scalable High Portable Programming with the Message-Passing Performance, Computing Conference, pp. 17-31.

Article

Modeling of the turbulent mixing on basis of the large eddy simulation by using parallel computing

January 2013

In the research the numerical simulation of the room's ventilation was performed on the basis of the large eddy simulation. The numerical algorithm was developed by using the physical parameters splitting scheme. Poisson equation for pressure field is solved by Fourier method in combination with tridiagonal matrix method for determination of Fourier coefficients. The parallel programming ... [Show full abstract] technique MPI in combination with directives of OpenMP was applied to solve the assign task, as the most relevant for today and the most efficient a view of the productivity improvement of complex calculations. Simulation data are represented as three-dimensional graphs.

Article

2.5 dimension electromagnetic forward parallel computing based on MPI + OpenMP

February 2015 · Shiyou Diqiu Wuli Kantan/Oil Geophysical Prospecting

We propose in this paper a complex resistivity 2.5-Dimension electromagnetic forward parallel computing algorithm based on MPI + OpenMP to improve low efficiency of large-scale numerical computation in geophysical exploration. This hybrid algorithm puts different wave domains to nodes of computer clusters using MPI. In the algorithm interior, the computation of sparse linear equations with the ... [Show full abstract] incomplete Cholesky biconjugate gradient (ICBCG) method is parallized on OpenMP. All waves are assigned to the nodes with an equal interval to have load balance. And to reduce communication overhead, each node reads its own model parameters. Different experiments demonstrate that our parallel algorithm can not only acquire the same data accuracy as that of serial computation methods, but also gain a near-linear speed up.

Article

Computational Methods and Systems for Problems in Science and Engineering

June 1989

Joseph Oliger

Our major objective in the work funded under this contract was to continue work begun under the ONR contract N0014-82-K-0335 related to adaptive grid methods for engineering problems in fluid dynamics and the development of parallel algorithms for the same class of problems. Keywords: Adaptive numerical methods for incompressible flow; Composite grid generation; Software systems for scientific ... [Show full abstract] computation; Algorithm development for parallel computation.

Article

A MPI-based parallel computing method for the simulation on river ecological restoration (in Chinese...

July 2014 · Earth Science Frontiers

Large quantities of scenarios could be generated in the decision-making process of river ecosystem restoration, and thus led to huge computing workloads of simulation and assessment, especially when a high-precision of the results is needed. A parallel computing method based on the Message-Passing Interface (MPI) standard is proposed in the paper to shorten the calculation time and meet the ... [Show full abstract] efficiency requirement. The implementation procedure of this method includes reading the variables and constraints from the input files, dividing the variables into sub-sets for different processes, generating scenarios, simulating the results of scenarios, and finally writing the results into the output files. In particular, an adaptive set dividing algorithm was developed to balance the workloads of all the processes in the parallel computing. By taking the ecological restoration of the Yongding River as a case study, the proposed method was applied in a decision support system and tested on two computers with different multi-core CPUs. The experimental results obtained showed that when performed on a computer with a 4-core CPU, the method reduced the calculation time to a quarter of the normal computation time, with a stable speedup factor of 3.8 at different amounts of scenarios. It is further indicated that, generally speaking, the proposed method is practical and it can effectively shorten the execution time of simulation of massive scenarios and significantly improve the computational efficiency with a stable performance.