Conference PaperPDF Available

Pattern Matching of Collective MPI Operations.

January 2004

January 2004
3:1243-1249

Source
DBLP

Conference: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA '04, June 21-24, 2004, Las Vegas, Nevada, USA, Volume 3

Authors:

Dieter Kranzlmüller

Ludwig-Maximilians-Universität München (LMU)

Wolfgang E. Nagel

Technische Universität Dresden

Programming message passing systems can be tedious and error-prone, especially for the inexperienced user facing the sheer amount of available functionality in todays message passing libraries. Instead of choosing the most optimal communication function, many users tend to apply only a small set of well-known standard operations. This paper describes a pattern matching approach based on execution traces which detects connected groups of point-to-point communication operations that may resemble existing collective operations. After high-lighting the detected patterns, users are able to improve their codes by replacing the point-to-point operations with more appropriate and efficient collective alternatives.

Scatter Communication Pattern in Vampir Space Time Diagram

…

Complete Call Graph including Scatter Communication Pattern

…

Figures - uploaded by Wolfgang E. Nagel

Content may be subject to copyright.

Content uploaded by Wolfgang E. Nagel

Content may be subject to copyright.

Pattern Matching of Collective MPI Operations

Dieter Kranzlm¨uller1,2, Andreas Kn¨upfer1, Wolfgang E. Nagel1

1Center for High Performance Computing

Dresden University of Technology, Germany

{knuepfer, nagel}@zhr.tu-dresden.de

2GUP - Institute of Graphics and Parallel Processing

Joh. Kepler University Linz, Austria/Europe

kranzlmueller@gup.jku.at

Abstract

Programming message passing systems can be te-

dious and error-prone, especially for the inexperi-

enced user facing the sheer amount of available

functionality in todays message passing libraries.

Instead of choosing the most optimal communica-

tion function, many users tend to apply only a small

set of well-known standard operations. This paper

describes a pattern matching approach based on

execution traces which detects connected groups of

point-to-point communication operations that may

resemble existing collective operations. After high-

lighting the detected patterns, users are able to im-

prove their codes by replacing the point-to-point

operations with more appropriate and efﬁcient col-

lective alternatives.

Keywords: parallel programming, message pass-

ing, pattern detection, collective operations

1 Motivation

The Message Passing Interface standard

MPI [5] is probably the most used parallel

programming paradigm today [4]. This fact

is based on the characteristics of MPI, which

should provide a “practical, portable, efﬁcient,

and ﬂexible standard” for writing message-

passing programs1. Among these character-

istics, efﬁciency of MPI implementations is

achieved by allowing vendors the possibility to

1http://www-unix.mcs.anl.gov/mpi/

optimize MPI on their particular hardware or

even with dedicated hardware support, as long

as the same functionality is provided.

One example area for optimizations are col-

lective routines, such as broadcast, scatter or

gather operations. While it is possible to im-

plement the complete library of collective op-

erations using the MPI point-to-point commu-

nication and some auxiliary functions, most

vendors already provide optimized collective

operations, which utilize hardware speciﬁcs

(such as network topology) for improving the

runtime performance of the routines [14].

Unfortunately, the rich functionality of MPI

represents a substantial burden for the in-

experienced programmer. While experts have

a good idea which operations ﬁt to the intended

program behavior on the underlying hardware

architecture, the learning curve for the novice

user approaching this level is quite steep. As

a consequence, many beginners tend to apply

well-known but non-optimal operations or im-

plement collective functionality by manually

grouping point-to-point communication state-

ments.

The approach described in this paper repre-

sents a solution for this problem based on the

idea of pattern matching in program traces. If

a tool is able to detect communication patterns

which resemble collective operations, it may

be possible to inform the user about more op-

Process 0 2 5 5 5 5 5 55

Process 1 2 12 10 12 55

Process 2 2 12 10 12 55

Process 3 2 10 12 55

Process 4 2 12 10 12 55

Process 5 2 12 10 55

functions

mpi

0.35 s0.3 s0.25 s

scatter06.vpt (0.231 s - 0.393 s = 0.162 s) Printed by Vampir ®

Figure 1: Scatter Communication Pattern in Vampir Space Time Diagram

timal alternative options when implementing

message passing programs.

The ﬁrst results following this idea are pre-

sented in this paper, which is organized as fol-

lows: Section 2 discusses the idea and its rela-

tion to pattern matching for program analysis

activities. Afterwards, the current implementa-

tion is described together with a simple exam-

ple, before the current state of the implemen-

tation and an outlook on future goals in this

project is summarized.

2 General Idea

The situation described above has often been

observed during our work on the program anal-

ysis tools Vampir [1, 2] and MAD [11]. While

Vampir is well-known for its usage in perfor-

mance analysis, MAD’s major application is

the area of program debugging. Nevertheless,

both program analysis tools operate on traces

of the program’s execution, which are then vi-

sualized as space-time diagrams. The experi-

ence based on the observation of different pro-

gram executions for different groups of users

in different real-world application domains has

shown, that point-to-point communication is

often used where collective operations would

be much more beneﬁcial.

An example of a space-time diagram as

produced by Vampir is shown in Figure 1.

The 6 processes are arranged vertically, while

the horizontal axis represents the observation

time. Different colors and shades indicate dif-

ferent states during the execution of the pro-

cesses. Arcs connecting corresponding states

on distinct processes represent communication

or synchronization operations.

In the example of Figure 1, a scatter com-

munication pattern is clearly visible. However,

instead of using a collective operation (such

as MPI Scatter), the user has implemented

the same behavior based on point-to-point op-

erations (such as MPI Send and MPI Recv).

As a result, the optimization that may be avail-

able within the MPI library cannot be exploited

by the program.

The idea of our approach is to detect groups

of communication patterns, that may resemble

collective operations. If such groups are found,

the users can be informed and may be able

to obtain improved performance by replacing

these groups with corresponding collective op-

erations.

The feasibility of using patterns for paral-

lel program development has been shown in

a series of related work. In [8] and [3], the

authors describe tools for automatic or semi-

automatic parallelization based on a limited

set of patterns that repeatedly occur in parallel

programs. With Patterntool [6], the design and

implementation of parallel programs is even

based on the speciﬁcation of communication

patterns.

The pattern-oriented parallel debugger

Belvedere [7] facilitates the description,

manipulation, and animation of logically

structured patterns of process interactions for

highly parallel programs. Pattern matching on

program traces with the tool POET in order to

simplify the visual representation is described

in [13], while the Event Analysis and Recog-

nition Language EARL is described in [15].

These ideas are further extended to a general

communication pattern analysis approach for

parallel and distributed program analysis as

described in [12], while [10] utilizes pattern

matching for reducing the amount of event

data required during program analysis of large

scale program executions.

3 Pattern Matching in Program

Traces

In contrast to related event-based pattern

matching approaches, which operate on

streams of events and relations, our approach

facilitates the so-called Complete Call Graph

(CCG) [10]. While ordinary call graphs sum-

marize the function call hierarchy, a CCG con-

tains every instance of every function as a cor-

responding node, and one CCG is generated

for every process. All properties of the func-

tion calls which are needed for program analy-

sis, such as performance measurements, are at-

tached to the graph nodes. The CCG for each

process is generated by reading the traceﬁles

and mapping the events onto the CCG data

structure.

An example CCG for a selected process 0 is

shown in Figure 2. The execution of the pro-

gram is mapped onto this graph from left to

right. Each layer of the CCG represents the

call sequence for one particular unit. The top-

most node represents the execution of the main

program, which consists of several parts, each

displayed with their particular execution time.

Whenever needed, ﬁner levels of granularity

are chosen to provide more details for program

analysis activities. At the leaves of the graph

are atomic events, generated by e.g. message-

passing or input/output operations.

In the example of Figure 2, ﬁve MPI op-

erations have been performed. The origin of

the send events is always process 0 (as indi-

cated by msg 0 →x). Additionally, Fig-

ure 2 shows that each send operation transfers

the same number of bytes in the messages (100

bytes), and that the send events occur immedi-

ately after one another without any other com-

munication event disturbing the sequence.

Based on this CCG, the pattern matching

approach for detecting a scatter communica-

tion pattern operates as follows:

The CCG is traversed recursively propagat-

ing the communication relation R:p→q(i.e.

process psends to process q=p) upwards.

With this information the smallest sub-trees

containing a one-to-all relation are identiﬁed,

which is deﬁned as

∃p∈Pwith ∀q∈P,q=p:p→q.(1)

If such a sub-tree is detected propagating infor-

mation upwards is stopped and the actual pat-

tern matching is performed among the multiple

messages in that sub-tree only. By this means,

pattern matching is never applied to the global

scope usually containing a very huge number

of messages.

For every selected sub-tree of the CCG the

sequence Sof message operations in tempo-

ral order is examined closer: The goal is to

ﬁnd a sub-sequence s⊆Sof messages sent

by one process to every other process exactly

once without any other message passing op-

eration interfering. For example any interme-

diate receive operation would destroy such a

sub-sequence just as two messages to the same

recipient would.

This is achieved by increasing an initially

empty sub-sequence sthrough appending suc-

cessive messages from Suntil a complete scat-

ter pattern is found or the intended pattern is

violated. In both cases the pattern matching is

continued from appropriate positions until the

end of S.

msg 0 -> 1

100 bytes

MPI_Send (6) #1 28b

1 0 499408

msg 0 -> 2

100 bytes

MPI_Send

1 0 449435

msg 0 -> 3

100 bytes

MPI_Send

1 0 548917

msg 0 -> 4

100 bytes

MPI_Send

1 0 486698

msg 0 -> 5

100 bytes

MPI_Send

1 0 467668

bar

98 499409 133 449436 153 548918 146 486699 94 467669 117

foo

115 ... 2452872 ... 58

main

0 ... 20829060 ... 0

Figure 2: Complete Call Graph including Scatter Communication Pattern

For gather communication patterns this

works analogous: simply exchange send and

receive operations and swap sender and re-

ceiver processes.

The CCG of Figure 2 corresponds to the

space-time diagram of Figure 1 for process 0.

With the algorithm described above, the pat-

tern matcher is able to identify the group of

MPI Send operations in the sub-tree below

node bar. Based on this knowledge, the user

can be informed about the possibility of replac-

ing these sends with a more efﬁcient collective

MPI Scatter.MPI Scatter implements

a global operation which distributes data from

one processes to all other available processes

much like the detected group of send events

does. However, in contrast to the send events,

MPI Scatter may be able to achieve better

performance if it is optimized by the provider

of the underlying MPI implementation.

The current version of the CCG pattern

matcher is able to detect groups of send or re-

ceive events resembling MPI Scatter and

MPI Gather operations, respectively. How-

ever, more research needs to be done to

distinguish between communication patterns,

which are similar in their space-time diagram

representation but are different in their se-

mantics. For example, MPI Scatter and

MPI Bcast (for broadcasting a message) are

similar in shape, but while the former dis-

tributes different data to each process, the later

sends a copy of the exact same data. In order to

detect these differences, some more analysis of

the event traces is required. In this context, it

is also necessary to check for similarity in MPI

communicators and tags in order to correct the

amount of matched patterns.

4 Benchmarks

In order to provide an impression of the

achievable performance improvements, a num-

ber of benchmarks have been conducted on

various platforms. The collective operations

known as gather, scatter, broadcast, and all-

to-all have been emulated using point to point

messages only. This was done in a straight-

forward manner without optimizations like

tree hierarchies, for example.

The Figures 3 and 4 show the results of ex-

periments on JUMP [16], a Cluster of IBM

p690 nodes at the John von Neumann Insti-

tute for Computing (NIC) of the Research Cen-

ter J¨ulich, Germany. For every type of col-

lective operation in this graph, the run-time

ratio timesel f −made/timebuilt−in vs. the mes-

sage length for several numbers of participat-

ing MPI processes has been plotted.

For gather and scatter the performance ad-

vantage of built-in operations is relatively

small but nevertheless ranging up to 200 %

resp. 300 %. As expected, the broadcast can

proﬁt most (up to 750 %) from using built-

in operations instead of self-made because it

spreads a single message to all participants.

This applies particularly to large messages and

higher process counts. All-to-all achieves an

acceleration of 200 % and even over 400 %

with 32 processes.

To our suprise there are several cases among

the experiments2that revealed built-in opera-

tions being slower than the self-made coun-

terparts! Similar effects occurred on the SGI

O3K platforms as well.

As the built-in operations are free to do the

same as straightforward self-made implemen-

tations this is an unnecessary drawback. One

might consider a minor performance disadvan-

tage for very small messages acceptable. How-

ever middle sized messages with notable losses

are deﬁnitely not what a user may want to see!

After all, it is interesting to notice that MPI im-

plementations provided by vendors show such

effects in the ﬁrst place. Some more measure-

ments will be needed to identify the reasons for

this behavior.

5 Conclusions and Future

Work

Pattern matching in event traces seems a

promising approach for program analysis in

parallel and distributed programs. Based on

this assumption, this paper demonstrates a pos-

sibility of using pattern matching for perfor-

2All experiments have been run multiple times, the

run-time values were taken as the minima over all repe-

titions.

mance tuning. The idea of the CCG pattern

matcher is to detect groups of communication

events, which resemble available collective op-

erations. Detected groups may then be re-

placed by more efﬁcient, optimized function

calls, which should improve the performance

of the code.

This initial version of the pattern matcher

is already able to detect simple collective op-

erations that may resemble MPI Scatter

or MPI Gather functions. Our next goal

in this project is to distinguish between sim-

ilar patterns, such as MPI Scatter and

MPI Bcast, in order to increase the ac-

curacy of the pattern detection mechanism.

In addition, we intend to study more com-

plex patterns, such as MPI Allgather,or

MPI Alltoall, which promise an even

higher potential for performance optimization.

Acknowledgments Several persons con-

tributed to this work through their ideas

and comments, notably Bernhard Aichinger,

Christian Schaubschl¨ager, and Prof. Jens

Volkert from GUP Linz, Axel Rimnac from

the University Linz, and Holger Brunst from

ZHR TU Dresden, as well as Beniamino Di

Martino from UNINA, Italy, to whom we are

most thankful.

We would also like to thank the John von

Neuman Institute for Computing (NIC) at the

Research Center Juelich for access to their

IBM p690 machine Jump under project num-

ber #k2720000 to perform our measurements

as reported in this paper.

References

[1] H. Brunst, W. E. Nagel, and S. Seidl.

Performance Tuning on Parallel Systems:

All Problems Solved? In Proceedings of

PARA2000 - Workshop on Applied Par-

allel Computing, volume 1947 of LNCS,

pages 279–287. Springer-Verlag Berlin

Heidelberg New York, June 2000.

[2] H. Brunst, H.-Ch. Hoppe, W.E. Nagel,

and M. Winkler. Performance Otimiza-

100

200

300

400

500

600

700

800

10 100 1000 10000 100000 1e+06 1e+07 1e+08

ratio: self-made / built-in [%]

message volume [bytes]

gather run-times ratio on JUMP

2x16 processes

2x10 processes

2x8 processes

2x4 processes

2x2 processes

100

200

300

400

500

600

700

800

10 100 1000 10000 100000 1e+06 1e+07 1e+08

ratio: self-made / built-in [%]

message volume [bytes]

scatter run-times ratio on JUMP

2x16 processes

2x10 processes

2x8 processes

2x4 processes

2x2 processes

Figure 3: Comparisson of built-in vs. self-made gather (left) and scatter operations (right).

tion for Large Scale Computing: The

Scalable VAMPIR Approach. In Pro-

ceedings of ICCS2001, San Francisco,

USA, volume 2074 of LNCS, page 751ff.

Springer-Verlag, May 2001.

[3] B. Di Martino and B. Chapman. Pro-

gram Comprehension Techniques to Im-

prove Automatic Parallelization. In Pro-

ceedings of the Workshop on Automatic

Data Layout and Performance Predic-

tion, Center for Research on Parallel

Computation, Rice University, 1995.

[4] I. Foster. Designing and Building Parallel

Programs. Addison-Wesley, 1995.

[5] W. Gropp, E. Lusk, and A. Skjellum.

UsingMPI - 2nd Edition. MIT Press,

November 1999.

[6] B. Gruber, G. Haring, D. Kranzlm ¨uller,

and J. Volkert. Parallel Programming

with CAPSE - A Case Study. in Proceed-

ings PDP’96, 4th EUROMICRO Work-

shop on Parallel and Distributed Pro-

cessing, Braga, Portugal, pages 130–137,

January 1996.

[7] A.E. Hough and J.E. Cuny. Initial Ex-

periences with a Pattern-Oriented Paral-

lel Debugger. In Proceedings of the ACM

SIGPLAN/SIGOPS Workshop on Paral-

lel and Distributed Debugging, Madi-

son, Wisconsin, USA, SIGPLAN No-

tices, Vol. 24, No. 1, pp. 195–205, Jan-

uary 1989.

[8] Ch. Kessler. Pattern-driven Automatic

Parallelization. Scientiﬁc Programming,

Vol. 5, pages 251–274, 1996.

[9] A. Kn¨upfer. A New Data Compres-

sion Technique for Event Based Program

Traces. In Proccedings of ICCS 2003

100

200

300

400

500

600

700

800

10 100 1000 10000 100000 1e+06 1e+07 1e+08

ratio: self-made / built-in [%]

message volume [bytes]

broadcast run-times ratio on JUMP

2x16 processes

2x10 processes

2x8 processes

2x4 processes

2x2 processes

100

200

300

400

500

600

700

800

10 100 1000 10000 100000 1e+06 1e+07 1e+08

ratio: self-made / built-in [%]

message volume [bytes]

alltoall run-times ratio on JUMP

2x16 processes

2x10 processes

2x8 processes

2x4 processes

2x2 processes

Figure 4: Comparisson of built-in vs. self-made broadcast (left) and all-to-all operations (right).

in Melbourne/Australia, Springer LNCS

Vol. 2659, pages 956 – 965. June 2003.

[10] A. Kn¨upfer and Wolfgang E. Nagel.

Compressible Memory Data Structures

for Event Based Trace Data. Future Gen-

eration Computer Systems by Elsevier,

January 2004. [submitted]

[11] D. Kranzlm¨uller, S. Grabner, and

J. Volkert. Debugging with the MAD

Environment. Parallel Computing, Vol.

23, No. 1–2, pages 199–217, April 1997.

[12] D. Kranzlm¨uller. Communication Pat-

tern Analysis in Parallel and Distributed

Programs. In Proceedings of the

20th IASTED Intl. Multi-Conference Ap-

plied Informatics (AI 2002), International

Association of Science and Technol-

ogy for Development (IASTED), ACTA

Press, Innsbruck, Austria, pages 153–

158, February 2002.

[13] T. Kunz and M. Seuren. Fast Detection of

Communication Patterns in Distributed

Executions. In Proceedings of the 1997

Conference of The Centre for Advanced

Studies on Collaborative Research, IBM

Press, Toronto, Canada, 1997.

[14] M. Snir, S. Otto, S. Huss-Lederman,

D. Walker, J. Dongarra. MPI: The Com-

plete Reference. MIT Press, September

1998.

[15] F. Wolf and B. Mohr. EARL - A

Programmable and Extensible Toolkit

for Analyzing Event Traces of Message

Passing Programs. Technical report,

Forschungszentrum J ¨ulich GmbH, April

1998. FZJ-ZAM-IB-9803.

[16] U. Detert. Introduction to the

Jump Architecture. Presentation,

Forschungszentrum J ¨ulich GmbH, 2004.

http://jumpdoc.fz-juelich.de/

Modernizing parallel code with pattern analysis

Conference Paper

Feb 2021

Detection of Collective MPI Operation Patterns

Conference Paper

Sep 2004
Lect Notes Comput Sci

The Message Passing Interface standard MPI offers collective communication routines to perform commonly required operations on groups of processes. The usage of these operations is highly recommended due to their simplified and compact interface and their optimized performance. This paper describes a pattern matching approach to detect clusters of point-to-point communication functions in program traces, which may resemble collective operations. The extracted information represents an important indicator for performance tuning, if point-to-point operations can be replaced by their collective counterparts. The paper describes the pattern matching and a series of measurements, which underline the feasibility of this idea.

Conference Paper

Nov 2009

Event traces are required to correctly diagnose a n umber of performance problems that arise on today's highly p arallel systems. Unfortunately, the collection of event tra ces can produce a large volume of data that is difficult, o r even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determinin g the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate patt ern-based methods for reducing traces that will be used for p erformance analysis. We evaluate the different methods against several criteria, including size reduction, introduced erro r, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real applicatio n.

EARL - A Programmable and Extensible Toolkit for Analyzing Event Traces of Message Passing Programs.

Conference Paper

Full-text available

Apr 1999
Lect Notes Comput Sci

This paper describes a new meta-tool name EARL which consists of a new high-level trace analysis language and its interpreter which allows to easily construct new trace analysis tools. Because of its programmability and flexibility, EARL can be used for a wide range of event trace analysis tasks. It is especially well-suited for automatic and for application or domain specific trace analysis and program validation. We describe the abstract view on an event trace the EARL interpreter provides to the user, and give an overview about the EARL language. Finally, a set of EARL script examples are used to demonstrate the features of EARL.

Pattern-Driven Automatic Parallelization

Article

Full-text available

May 1996
SCI PROGRAMMING-NETH

Christoph Kessler

This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.

Debugging with the MAD environment

Article

Apr 1997
PARALLEL COMPUT

Debugging parallel programs can be tedious and difficult. Therefore the programmer needs support from tools, that provide features for error detection and performance analysis. The MAD environment is such a toolset. It helps the user in monitoring and analyzing message passing programs. Communication errors and performance bottlenecks are visualized based on an event graph. Source code connection provides a combination between visualized events and the original lines of code or a control and data flow representation. A main part of the environment is dedicated to race conditions. After evaluation of events, which might be reordered during successive program runs, localization of message races can be performed by means of trace driven simulation. All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface.

Fast detection of communication patterns in distributed executions.

Conference Paper

Jan 1997

Understanding distributed applications is a tedious and difficult task. Visualizations based on process-time diagrams are often used to obtain a better understanding of the execution of the application. The visualization tool we use is Poet, an event tracer developed at the University of Waterloo. However, these diagrams are often very complex and do not provide the user with the desired overview of the application. In our experience, such tools display repeated occurrences of non-trivial communication patterns, appearing throughout the trace data and cluttering the display space. This paper describes an event abstraction facility which tries to simplify the execution visualization shown by Poet by efficiently detecting and abstracting such patterns.A user can define patterns, subject to only very few constraints, and store them in a hierarchical pattern library. We also provide the user with the possibility to annotate the source code as a help in the abstraction process. We detect these communication patterns by employing an enhanced efficient multiple string matching algorithm. The results indicate that the matching process is indeed very fast. A user can experiment with multiple patterns at potentially different levels in the hierarchy, checking for their occurrence in the trace file, while trying to gain some understanding in a short period of time.

A New Data Compression Technique for Event Based Program Traces

Conference Paper

Jun 2003

Andreas Knüpfer

The paper presents an innovative solution to the problem of the very huge data sets that are regularly produced by performance tracing techniques — especially on HPC programs. It designs an adapted data compression scheme that takes advantage of regularities frequently found in program traces. Algorithms to reveal repetition patterns in a programs call structure and run time behavior are discussed in detail, solutions to some problems arising on practical application are addressed as well. Two examples demonstrate the capabilities of the approach and document its behavior. Finally, some thoughts are given regarding how the patterns revealed in the process of data compression may assist the automatic analysis of traces.

Performance Tuning on Parallel Systems: All Problems Solved?

Conference Paper

Jun 2000
Lect Notes Comput Sci

Performance tuning of parallel programs, considering the current status and future developments in parallel programming paradigms and parallel system architectures, remains an important topic even if the single CPU performance is doubling every 18 months. Based on a brief summary of state of the art parallel programming techniques, new performance tuning aspects will be identified. The main part of the paper concentrates on how to deal with these aspects by means of new performance analysis and tuning concepts. First tool developments are presented where part of these concepts are already implemented. Finally, an existing scientific parallel application will be presented with respect to its performance tuning stages which were carried out at our center.

Initial Experiences with a Pattern-Oriented Parallel Debugger.

Conference Paper

Nov 1988

Highly parallel programs are often best understood in terms of logical patterns of inter process communication. In order to debug such programs, the user must determine the extent to which the intended patterns occur during execution. To facilitate this, we have designed and implemented a pattern-oriented debugger in which abstract, user-defined communication events can be described and animated. We report here on our initial experiences with its use.

MPI: The complete reference

Book

Jan 1996

Parallel programming with CAPSE-a case study

Conference Paper

Feb 1996

The CAPSE environment for Computer Aided Parallel Software Engineering is intended to assist the developer in the crucial task of parallel programming. The methodology of CAPSE is based on direct manipulative graphical creation and editing of scalable workload characterizations of MIMD algorithms. This paper presents the basic concepts of this methodology and an example of a parallel Poisson solver. The workload characterization representing the computation and communication behavior of the algorithm is based on directed acyclic task graphs, which achieve scalability by composing the task graph of scalable basic patterns instead of single node and arcs. The composition and the usage of these basic patterns is described in the light of designing the Poisson solver algorithm. The resulting task graph is used to predict the program's performance on a nCUBE 2 distributed memory machine and the PAPS simulator

Communication Pattern Analysis in Parallel and Distributed Programs International Association of Science and Technology for Development (IASTED)

Jan 1997
153-158

D Kranzlmüller

D. Kranzlmüller. Communication Pattern Analysis in Parallel and Distributed Programs. In Proceedings of the 20th IASTED Intl. Multi-Conference Applied Informatics (AI 2002), International Association of Science and Technology for Development (IASTED), ACTA Press, Innsbruck, Austria, pages 153– 158, February 2002. [13] T. Kunz and M. Seuren. Fast Detection of Communication Patterns in Distributed Executions. In Proceedings of the 1997 Conference of The Centre for Advanced Studies on Collaborative Research, IBM Press, Toronto, Canada, 1997.

Pattern Matching of Collective MPI Operations.

Abstract and Figures

Recommended publications

An Introduction to Parallel Programming Using MPI

Detection of Collective MPI Operation Patterns

M09---Program analysis tools for massively parallel applications: how to achieve highest performance

Pattern Matching and I/O Replay for POSIX I/O in Parallel Programs

Runtime Message Uniquification for Accurate Communication Analysis on Incomplete MPI Event Traces