Content uploaded by Wolfgang E. Nagel
Author content
All content in this area was uploaded by Wolfgang E. Nagel
Content may be subject to copyright.
Pattern Matching of Collective MPI Operations
Dieter Kranzlm¨uller1,2, Andreas Kn¨upfer1, Wolfgang E. Nagel1
1Center for High Performance Computing
Dresden University of Technology, Germany
{knuepfer, nagel}@zhr.tu-dresden.de
2GUP - Institute of Graphics and Parallel Processing
Joh. Kepler University Linz, Austria/Europe
kranzlmueller@gup.jku.at
Abstract
Programming message passing systems can be te-
dious and error-prone, especially for the inexperi-
enced user facing the sheer amount of available
functionality in todays message passing libraries.
Instead of choosing the most optimal communica-
tion function, many users tend to apply only a small
set of well-known standard operations. This paper
describes a pattern matching approach based on
execution traces which detects connected groups of
point-to-point communication operations that may
resemble existing collective operations. After high-
lighting the detected patterns, users are able to im-
prove their codes by replacing the point-to-point
operations with more appropriate and efficient col-
lective alternatives.
Keywords: parallel programming, message pass-
ing, pattern detection, collective operations
1 Motivation
The Message Passing Interface standard
MPI [5] is probably the most used parallel
programming paradigm today [4]. This fact
is based on the characteristics of MPI, which
should provide a “practical, portable, efficient,
and flexible standard” for writing message-
passing programs1. Among these character-
istics, efficiency of MPI implementations is
achieved by allowing vendors the possibility to
1http://www-unix.mcs.anl.gov/mpi/
optimize MPI on their particular hardware or
even with dedicated hardware support, as long
as the same functionality is provided.
One example area for optimizations are col-
lective routines, such as broadcast, scatter or
gather operations. While it is possible to im-
plement the complete library of collective op-
erations using the MPI point-to-point commu-
nication and some auxiliary functions, most
vendors already provide optimized collective
operations, which utilize hardware specifics
(such as network topology) for improving the
runtime performance of the routines [14].
Unfortunately, the rich functionality of MPI
represents a substantial burden for the in-
experienced programmer. While experts have
a good idea which operations fit to the intended
program behavior on the underlying hardware
architecture, the learning curve for the novice
user approaching this level is quite steep. As
a consequence, many beginners tend to apply
well-known but non-optimal operations or im-
plement collective functionality by manually
grouping point-to-point communication state-
ments.
The approach described in this paper repre-
sents a solution for this problem based on the
idea of pattern matching in program traces. If
a tool is able to detect communication patterns
which resemble collective operations, it may
be possible to inform the user about more op-
Process 0 2 5 5 5 5 5 55
Process 1 2 12 10 12 55
Process 2 2 12 10 12 55
Process 3 2 10 12 55
Process 4 2 12 10 12 55
Process 5 2 12 10 55
functions
mpi
0.35 s0.3 s0.25 s
scatter06.vpt (0.231 s - 0.393 s = 0.162 s) Printed by Vampir ®
Figure 1: Scatter Communication Pattern in Vampir Space Time Diagram
timal alternative options when implementing
message passing programs.
The first results following this idea are pre-
sented in this paper, which is organized as fol-
lows: Section 2 discusses the idea and its rela-
tion to pattern matching for program analysis
activities. Afterwards, the current implementa-
tion is described together with a simple exam-
ple, before the current state of the implemen-
tation and an outlook on future goals in this
project is summarized.
2 General Idea
The situation described above has often been
observed during our work on the program anal-
ysis tools Vampir [1, 2] and MAD [11]. While
Vampir is well-known for its usage in perfor-
mance analysis, MAD’s major application is
the area of program debugging. Nevertheless,
both program analysis tools operate on traces
of the program’s execution, which are then vi-
sualized as space-time diagrams. The experi-
ence based on the observation of different pro-
gram executions for different groups of users
in different real-world application domains has
shown, that point-to-point communication is
often used where collective operations would
be much more beneficial.
An example of a space-time diagram as
produced by Vampir is shown in Figure 1.
The 6 processes are arranged vertically, while
the horizontal axis represents the observation
time. Different colors and shades indicate dif-
ferent states during the execution of the pro-
cesses. Arcs connecting corresponding states
on distinct processes represent communication
or synchronization operations.
In the example of Figure 1, a scatter com-
munication pattern is clearly visible. However,
instead of using a collective operation (such
as MPI Scatter), the user has implemented
the same behavior based on point-to-point op-
erations (such as MPI Send and MPI Recv).
As a result, the optimization that may be avail-
able within the MPI library cannot be exploited
by the program.
The idea of our approach is to detect groups
of communication patterns, that may resemble
collective operations. If such groups are found,
the users can be informed and may be able
to obtain improved performance by replacing
these groups with corresponding collective op-
erations.
The feasibility of using patterns for paral-
lel program development has been shown in
a series of related work. In [8] and [3], the
authors describe tools for automatic or semi-
automatic parallelization based on a limited
set of patterns that repeatedly occur in parallel
programs. With Patterntool [6], the design and
implementation of parallel programs is even
based on the specification of communication
patterns.
The pattern-oriented parallel debugger
Belvedere [7] facilitates the description,
manipulation, and animation of logically
structured patterns of process interactions for
highly parallel programs. Pattern matching on
program traces with the tool POET in order to
simplify the visual representation is described
in [13], while the Event Analysis and Recog-
nition Language EARL is described in [15].
These ideas are further extended to a general
communication pattern analysis approach for
parallel and distributed program analysis as
described in [12], while [10] utilizes pattern
matching for reducing the amount of event
data required during program analysis of large
scale program executions.
3 Pattern Matching in Program
Traces
In contrast to related event-based pattern
matching approaches, which operate on
streams of events and relations, our approach
facilitates the so-called Complete Call Graph
(CCG) [10]. While ordinary call graphs sum-
marize the function call hierarchy, a CCG con-
tains every instance of every function as a cor-
responding node, and one CCG is generated
for every process. All properties of the func-
tion calls which are needed for program analy-
sis, such as performance measurements, are at-
tached to the graph nodes. The CCG for each
process is generated by reading the tracefiles
and mapping the events onto the CCG data
structure.
An example CCG for a selected process 0 is
shown in Figure 2. The execution of the pro-
gram is mapped onto this graph from left to
right. Each layer of the CCG represents the
call sequence for one particular unit. The top-
most node represents the execution of the main
program, which consists of several parts, each
displayed with their particular execution time.
Whenever needed, finer levels of granularity
are chosen to provide more details for program
analysis activities. At the leaves of the graph
are atomic events, generated by e.g. message-
passing or input/output operations.
In the example of Figure 2, five MPI op-
erations have been performed. The origin of
the send events is always process 0 (as indi-
cated by msg 0 →x). Additionally, Fig-
ure 2 shows that each send operation transfers
the same number of bytes in the messages (100
bytes), and that the send events occur immedi-
ately after one another without any other com-
munication event disturbing the sequence.
Based on this CCG, the pattern matching
approach for detecting a scatter communica-
tion pattern operates as follows:
The CCG is traversed recursively propagat-
ing the communication relation R:p→q(i.e.
process psends to process q=p) upwards.
With this information the smallest sub-trees
containing a one-to-all relation are identified,
which is defined as
∃p∈Pwith ∀q∈P,q=p:p→q.(1)
If such a sub-tree is detected propagating infor-
mation upwards is stopped and the actual pat-
tern matching is performed among the multiple
messages in that sub-tree only. By this means,
pattern matching is never applied to the global
scope usually containing a very huge number
of messages.
For every selected sub-tree of the CCG the
sequence Sof message operations in tempo-
ral order is examined closer: The goal is to
find a sub-sequence s⊆Sof messages sent
by one process to every other process exactly
once without any other message passing op-
eration interfering. For example any interme-
diate receive operation would destroy such a
sub-sequence just as two messages to the same
recipient would.
This is achieved by increasing an initially
empty sub-sequence sthrough appending suc-
cessive messages from Suntil a complete scat-
ter pattern is found or the intended pattern is
violated. In both cases the pattern matching is
continued from appropriate positions until the
end of S.
msg 0 -> 1
100 bytes
MPI_Send (6) #1 28b
1 0 499408
msg 0 -> 2
100 bytes
MPI_Send
1 0 449435
msg 0 -> 3
100 bytes
MPI_Send
1 0 548917
msg 0 -> 4
100 bytes
MPI_Send
1 0 486698
msg 0 -> 5
100 bytes
MPI_Send
1 0 467668
bar
98 499409 133 449436 153 548918 146 486699 94 467669 117
foo
115 ... 2452872 ... 58
main
0 ... 20829060 ... 0
Figure 2: Complete Call Graph including Scatter Communication Pattern
For gather communication patterns this
works analogous: simply exchange send and
receive operations and swap sender and re-
ceiver processes.
The CCG of Figure 2 corresponds to the
space-time diagram of Figure 1 for process 0.
With the algorithm described above, the pat-
tern matcher is able to identify the group of
MPI Send operations in the sub-tree below
node bar. Based on this knowledge, the user
can be informed about the possibility of replac-
ing these sends with a more efficient collective
MPI Scatter.MPI Scatter implements
a global operation which distributes data from
one processes to all other available processes
much like the detected group of send events
does. However, in contrast to the send events,
MPI Scatter may be able to achieve better
performance if it is optimized by the provider
of the underlying MPI implementation.
The current version of the CCG pattern
matcher is able to detect groups of send or re-
ceive events resembling MPI Scatter and
MPI Gather operations, respectively. How-
ever, more research needs to be done to
distinguish between communication patterns,
which are similar in their space-time diagram
representation but are different in their se-
mantics. For example, MPI Scatter and
MPI Bcast (for broadcasting a message) are
similar in shape, but while the former dis-
tributes different data to each process, the later
sends a copy of the exact same data. In order to
detect these differences, some more analysis of
the event traces is required. In this context, it
is also necessary to check for similarity in MPI
communicators and tags in order to correct the
amount of matched patterns.
4 Benchmarks
In order to provide an impression of the
achievable performance improvements, a num-
ber of benchmarks have been conducted on
various platforms. The collective operations
known as gather, scatter, broadcast, and all-
to-all have been emulated using point to point
messages only. This was done in a straight-
forward manner without optimizations like
tree hierarchies, for example.
The Figures 3 and 4 show the results of ex-
periments on JUMP [16], a Cluster of IBM
p690 nodes at the John von Neumann Insti-
tute for Computing (NIC) of the Research Cen-
ter J¨ulich, Germany. For every type of col-
lective operation in this graph, the run-time
ratio timesel f −made/timebuilt−in vs. the mes-
sage length for several numbers of participat-
ing MPI processes has been plotted.
For gather and scatter the performance ad-
vantage of built-in operations is relatively
small but nevertheless ranging up to 200 %
resp. 300 %. As expected, the broadcast can
profit most (up to 750 %) from using built-
in operations instead of self-made because it
spreads a single message to all participants.
This applies particularly to large messages and
higher process counts. All-to-all achieves an
acceleration of 200 % and even over 400 %
with 32 processes.
To our suprise there are several cases among
the experiments2that revealed built-in opera-
tions being slower than the self-made coun-
terparts! Similar effects occurred on the SGI
O3K platforms as well.
As the built-in operations are free to do the
same as straightforward self-made implemen-
tations this is an unnecessary drawback. One
might consider a minor performance disadvan-
tage for very small messages acceptable. How-
ever middle sized messages with notable losses
are definitely not what a user may want to see!
After all, it is interesting to notice that MPI im-
plementations provided by vendors show such
effects in the first place. Some more measure-
ments will be needed to identify the reasons for
this behavior.
5 Conclusions and Future
Work
Pattern matching in event traces seems a
promising approach for program analysis in
parallel and distributed programs. Based on
this assumption, this paper demonstrates a pos-
sibility of using pattern matching for perfor-
2All experiments have been run multiple times, the
run-time values were taken as the minima over all repe-
titions.
mance tuning. The idea of the CCG pattern
matcher is to detect groups of communication
events, which resemble available collective op-
erations. Detected groups may then be re-
placed by more efficient, optimized function
calls, which should improve the performance
of the code.
This initial version of the pattern matcher
is already able to detect simple collective op-
erations that may resemble MPI Scatter
or MPI Gather functions. Our next goal
in this project is to distinguish between sim-
ilar patterns, such as MPI Scatter and
MPI Bcast, in order to increase the ac-
curacy of the pattern detection mechanism.
In addition, we intend to study more com-
plex patterns, such as MPI Allgather,or
MPI Alltoall, which promise an even
higher potential for performance optimization.
Acknowledgments Several persons con-
tributed to this work through their ideas
and comments, notably Bernhard Aichinger,
Christian Schaubschl¨ager, and Prof. Jens
Volkert from GUP Linz, Axel Rimnac from
the University Linz, and Holger Brunst from
ZHR TU Dresden, as well as Beniamino Di
Martino from UNINA, Italy, to whom we are
most thankful.
We would also like to thank the John von
Neuman Institute for Computing (NIC) at the
Research Center Juelich for access to their
IBM p690 machine Jump under project num-
ber #k2720000 to perform our measurements
as reported in this paper.
References
[1] H. Brunst, W. E. Nagel, and S. Seidl.
Performance Tuning on Parallel Systems:
All Problems Solved? In Proceedings of
PARA2000 - Workshop on Applied Par-
allel Computing, volume 1947 of LNCS,
pages 279–287. Springer-Verlag Berlin
Heidelberg New York, June 2000.
[2] H. Brunst, H.-Ch. Hoppe, W.E. Nagel,
and M. Winkler. Performance Otimiza-
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
gather run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
scatter run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
Figure 3: Comparisson of built-in vs. self-made gather (left) and scatter operations (right).
tion for Large Scale Computing: The
Scalable VAMPIR Approach. In Pro-
ceedings of ICCS2001, San Francisco,
USA, volume 2074 of LNCS, page 751ff.
Springer-Verlag, May 2001.
[3] B. Di Martino and B. Chapman. Pro-
gram Comprehension Techniques to Im-
prove Automatic Parallelization. In Pro-
ceedings of the Workshop on Automatic
Data Layout and Performance Predic-
tion, Center for Research on Parallel
Computation, Rice University, 1995.
[4] I. Foster. Designing and Building Parallel
Programs. Addison-Wesley, 1995.
[5] W. Gropp, E. Lusk, and A. Skjellum.
UsingMPI - 2nd Edition. MIT Press,
November 1999.
[6] B. Gruber, G. Haring, D. Kranzlm ¨uller,
and J. Volkert. Parallel Programming
with CAPSE - A Case Study. in Proceed-
ings PDP’96, 4th EUROMICRO Work-
shop on Parallel and Distributed Pro-
cessing, Braga, Portugal, pages 130–137,
January 1996.
[7] A.E. Hough and J.E. Cuny. Initial Ex-
periences with a Pattern-Oriented Paral-
lel Debugger. In Proceedings of the ACM
SIGPLAN/SIGOPS Workshop on Paral-
lel and Distributed Debugging, Madi-
son, Wisconsin, USA, SIGPLAN No-
tices, Vol. 24, No. 1, pp. 195–205, Jan-
uary 1989.
[8] Ch. Kessler. Pattern-driven Automatic
Parallelization. Scientific Programming,
Vol. 5, pages 251–274, 1996.
[9] A. Kn¨upfer. A New Data Compres-
sion Technique for Event Based Program
Traces. In Proccedings of ICCS 2003
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
broadcast run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
alltoall run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
Figure 4: Comparisson of built-in vs. self-made broadcast (left) and all-to-all operations (right).
in Melbourne/Australia, Springer LNCS
Vol. 2659, pages 956 – 965. June 2003.
[10] A. Kn¨upfer and Wolfgang E. Nagel.
Compressible Memory Data Structures
for Event Based Trace Data. Future Gen-
eration Computer Systems by Elsevier,
January 2004. [submitted]
[11] D. Kranzlm¨uller, S. Grabner, and
J. Volkert. Debugging with the MAD
Environment. Parallel Computing, Vol.
23, No. 1–2, pages 199–217, April 1997.
[12] D. Kranzlm¨uller. Communication Pat-
tern Analysis in Parallel and Distributed
Programs. In Proceedings of the
20th IASTED Intl. Multi-Conference Ap-
plied Informatics (AI 2002), International
Association of Science and Technol-
ogy for Development (IASTED), ACTA
Press, Innsbruck, Austria, pages 153–
158, February 2002.
[13] T. Kunz and M. Seuren. Fast Detection of
Communication Patterns in Distributed
Executions. In Proceedings of the 1997
Conference of The Centre for Advanced
Studies on Collaborative Research, IBM
Press, Toronto, Canada, 1997.
[14] M. Snir, S. Otto, S. Huss-Lederman,
D. Walker, J. Dongarra. MPI: The Com-
plete Reference. MIT Press, September
1998.
[15] F. Wolf and B. Mohr. EARL - A
Programmable and Extensible Toolkit
for Analyzing Event Traces of Message
Passing Programs. Technical report,
Forschungszentrum J ¨ulich GmbH, April
1998. FZJ-ZAM-IB-9803.
[16] U. Detert. Introduction to the
Jump Architecture. Presentation,
Forschungszentrum J ¨ulich GmbH, 2004.
http://jumpdoc.fz-juelich.de/