Conference PaperPDF Available

Pattern Matching of Collective MPI Operations.

Authors:
  • Ludwig-Maximilians-Universität München (LMU)

Abstract and Figures

Programming message passing systems can be tedious and error-prone, especially for the inexperienced user facing the sheer amount of available functionality in todays message passing libraries. Instead of choosing the most optimal communication function, many users tend to apply only a small set of well-known standard operations. This paper describes a pattern matching approach based on execution traces which detects connected groups of point-to-point communication operations that may resemble existing collective operations. After high-lighting the detected patterns, users are able to improve their codes by replacing the point-to-point operations with more appropriate and efficient collective alternatives.
Content may be subject to copyright.
Pattern Matching of Collective MPI Operations
Dieter Kranzlm¨uller1,2, Andreas Kn¨upfer1, Wolfgang E. Nagel1
1Center for High Performance Computing
Dresden University of Technology, Germany
{knuepfer, nagel}@zhr.tu-dresden.de
2GUP - Institute of Graphics and Parallel Processing
Joh. Kepler University Linz, Austria/Europe
kranzlmueller@gup.jku.at
Abstract
Programming message passing systems can be te-
dious and error-prone, especially for the inexperi-
enced user facing the sheer amount of available
functionality in todays message passing libraries.
Instead of choosing the most optimal communica-
tion function, many users tend to apply only a small
set of well-known standard operations. This paper
describes a pattern matching approach based on
execution traces which detects connected groups of
point-to-point communication operations that may
resemble existing collective operations. After high-
lighting the detected patterns, users are able to im-
prove their codes by replacing the point-to-point
operations with more appropriate and efcient col-
lective alternatives.
Keywords: parallel programming, message pass-
ing, pattern detection, collective operations
1 Motivation
The Message Passing Interface standard
MPI [5] is probably the most used parallel
programming paradigm today [4]. This fact
is based on the characteristics of MPI, which
should provide a “practical, portable, efcient,
and exible standard” for writing message-
passing programs1. Among these character-
istics, efciency of MPI implementations is
achieved by allowing vendors the possibility to
1http://www-unix.mcs.anl.gov/mpi/
optimize MPI on their particular hardware or
even with dedicated hardware support, as long
as the same functionality is provided.
One example area for optimizations are col-
lective routines, such as broadcast, scatter or
gather operations. While it is possible to im-
plement the complete library of collective op-
erations using the MPI point-to-point commu-
nication and some auxiliary functions, most
vendors already provide optimized collective
operations, which utilize hardware specics
(such as network topology) for improving the
runtime performance of the routines [14].
Unfortunately, the rich functionality of MPI
represents a substantial burden for the in-
experienced programmer. While experts have
a good idea which operations t to the intended
program behavior on the underlying hardware
architecture, the learning curve for the novice
user approaching this level is quite steep. As
a consequence, many beginners tend to apply
well-known but non-optimal operations or im-
plement collective functionality by manually
grouping point-to-point communication state-
ments.
The approach described in this paper repre-
sents a solution for this problem based on the
idea of pattern matching in program traces. If
a tool is able to detect communication patterns
which resemble collective operations, it may
be possible to inform the user about more op-
Process 0 2 5 5 5 5 5 55
Process 1 2 12 10 12 55
Process 2 2 12 10 12 55
Process 3 2 10 12 55
Process 4 2 12 10 12 55
Process 5 2 12 10 55
functions
mpi
0.35 s0.3 s0.25 s
scatter06.vpt (0.231 s - 0.393 s = 0.162 s) Printed by Vampir ®
Figure 1: Scatter Communication Pattern in Vampir Space Time Diagram
timal alternative options when implementing
message passing programs.
The rst results following this idea are pre-
sented in this paper, which is organized as fol-
lows: Section 2 discusses the idea and its rela-
tion to pattern matching for program analysis
activities. Afterwards, the current implementa-
tion is described together with a simple exam-
ple, before the current state of the implemen-
tation and an outlook on future goals in this
project is summarized.
2 General Idea
The situation described above has often been
observed during our work on the program anal-
ysis tools Vampir [1, 2] and MAD [11]. While
Vampir is well-known for its usage in perfor-
mance analysis, MAD’s major application is
the area of program debugging. Nevertheless,
both program analysis tools operate on traces
of the program’s execution, which are then vi-
sualized as space-time diagrams. The experi-
ence based on the observation of different pro-
gram executions for different groups of users
in different real-world application domains has
shown, that point-to-point communication is
often used where collective operations would
be much more benecial.
An example of a space-time diagram as
produced by Vampir is shown in Figure 1.
The 6 processes are arranged vertically, while
the horizontal axis represents the observation
time. Different colors and shades indicate dif-
ferent states during the execution of the pro-
cesses. Arcs connecting corresponding states
on distinct processes represent communication
or synchronization operations.
In the example of Figure 1, a scatter com-
munication pattern is clearly visible. However,
instead of using a collective operation (such
as MPI Scatter), the user has implemented
the same behavior based on point-to-point op-
erations (such as MPI Send and MPI Recv).
As a result, the optimization that may be avail-
able within the MPI library cannot be exploited
by the program.
The idea of our approach is to detect groups
of communication patterns, that may resemble
collective operations. If such groups are found,
the users can be informed and may be able
to obtain improved performance by replacing
these groups with corresponding collective op-
erations.
The feasibility of using patterns for paral-
lel program development has been shown in
a series of related work. In [8] and [3], the
authors describe tools for automatic or semi-
automatic parallelization based on a limited
set of patterns that repeatedly occur in parallel
programs. With Patterntool [6], the design and
implementation of parallel programs is even
based on the specication of communication
patterns.
The pattern-oriented parallel debugger
Belvedere [7] facilitates the description,
manipulation, and animation of logically
structured patterns of process interactions for
highly parallel programs. Pattern matching on
program traces with the tool POET in order to
simplify the visual representation is described
in [13], while the Event Analysis and Recog-
nition Language EARL is described in [15].
These ideas are further extended to a general
communication pattern analysis approach for
parallel and distributed program analysis as
described in [12], while [10] utilizes pattern
matching for reducing the amount of event
data required during program analysis of large
scale program executions.
3 Pattern Matching in Program
Traces
In contrast to related event-based pattern
matching approaches, which operate on
streams of events and relations, our approach
facilitates the so-called Complete Call Graph
(CCG) [10]. While ordinary call graphs sum-
marize the function call hierarchy, a CCG con-
tains every instance of every function as a cor-
responding node, and one CCG is generated
for every process. All properties of the func-
tion calls which are needed for program analy-
sis, such as performance measurements, are at-
tached to the graph nodes. The CCG for each
process is generated by reading the traceles
and mapping the events onto the CCG data
structure.
An example CCG for a selected process 0 is
shown in Figure 2. The execution of the pro-
gram is mapped onto this graph from left to
right. Each layer of the CCG represents the
call sequence for one particular unit. The top-
most node represents the execution of the main
program, which consists of several parts, each
displayed with their particular execution time.
Whenever needed, ner levels of granularity
are chosen to provide more details for program
analysis activities. At the leaves of the graph
are atomic events, generated by e.g. message-
passing or input/output operations.
In the example of Figure 2, ve MPI op-
erations have been performed. The origin of
the send events is always process 0 (as indi-
cated by msg 0 x). Additionally, Fig-
ure 2 shows that each send operation transfers
the same number of bytes in the messages (100
bytes), and that the send events occur immedi-
ately after one another without any other com-
munication event disturbing the sequence.
Based on this CCG, the pattern matching
approach for detecting a scatter communica-
tion pattern operates as follows:
The CCG is traversed recursively propagat-
ing the communication relation R:pq(i.e.
process psends to process q=p) upwards.
With this information the smallest sub-trees
containing a one-to-all relation are identied,
which is dened as
pPwith qP,q=p:pq.(1)
If such a sub-tree is detected propagating infor-
mation upwards is stopped and the actual pat-
tern matching is performed among the multiple
messages in that sub-tree only. By this means,
pattern matching is never applied to the global
scope usually containing a very huge number
of messages.
For every selected sub-tree of the CCG the
sequence Sof message operations in tempo-
ral order is examined closer: The goal is to
nd a sub-sequence sSof messages sent
by one process to every other process exactly
once without any other message passing op-
eration interfering. For example any interme-
diate receive operation would destroy such a
sub-sequence just as two messages to the same
recipient would.
This is achieved by increasing an initially
empty sub-sequence sthrough appending suc-
cessive messages from Suntil a complete scat-
ter pattern is found or the intended pattern is
violated. In both cases the pattern matching is
continued from appropriate positions until the
end of S.
msg 0 -> 1
100 bytes
MPI_Send (6) #1 28b
1 0 499408
msg 0 -> 2
100 bytes
MPI_Send
1 0 449435
msg 0 -> 3
100 bytes
MPI_Send
1 0 548917
msg 0 -> 4
100 bytes
MPI_Send
1 0 486698
msg 0 -> 5
100 bytes
MPI_Send
1 0 467668
bar
98 499409 133 449436 153 548918 146 486699 94 467669 117
foo
115 ... 2452872 ... 58
main
0 ... 20829060 ... 0
Figure 2: Complete Call Graph including Scatter Communication Pattern
For gather communication patterns this
works analogous: simply exchange send and
receive operations and swap sender and re-
ceiver processes.
The CCG of Figure 2 corresponds to the
space-time diagram of Figure 1 for process 0.
With the algorithm described above, the pat-
tern matcher is able to identify the group of
MPI Send operations in the sub-tree below
node bar. Based on this knowledge, the user
can be informed about the possibility of replac-
ing these sends with a more efcient collective
MPI Scatter.MPI Scatter implements
a global operation which distributes data from
one processes to all other available processes
much like the detected group of send events
does. However, in contrast to the send events,
MPI Scatter may be able to achieve better
performance if it is optimized by the provider
of the underlying MPI implementation.
The current version of the CCG pattern
matcher is able to detect groups of send or re-
ceive events resembling MPI Scatter and
MPI Gather operations, respectively. How-
ever, more research needs to be done to
distinguish between communication patterns,
which are similar in their space-time diagram
representation but are different in their se-
mantics. For example, MPI Scatter and
MPI Bcast (for broadcasting a message) are
similar in shape, but while the former dis-
tributes different data to each process, the later
sends a copy of the exact same data. In order to
detect these differences, some more analysis of
the event traces is required. In this context, it
is also necessary to check for similarity in MPI
communicators and tags in order to correct the
amount of matched patterns.
4 Benchmarks
In order to provide an impression of the
achievable performance improvements, a num-
ber of benchmarks have been conducted on
various platforms. The collective operations
known as gather, scatter, broadcast, and all-
to-all have been emulated using point to point
messages only. This was done in a straight-
forward manner without optimizations like
tree hierarchies, for example.
The Figures 3 and 4 show the results of ex-
periments on JUMP [16], a Cluster of IBM
p690 nodes at the John von Neumann Insti-
tute for Computing (NIC) of the Research Cen-
ter J¨ulich, Germany. For every type of col-
lective operation in this graph, the run-time
ratio timesel f made/timebuiltin vs. the mes-
sage length for several numbers of participat-
ing MPI processes has been plotted.
For gather and scatter the performance ad-
vantage of built-in operations is relatively
small but nevertheless ranging up to 200 %
resp. 300 %. As expected, the broadcast can
prot most (up to 750 %) from using built-
in operations instead of self-made because it
spreads a single message to all participants.
This applies particularly to large messages and
higher process counts. All-to-all achieves an
acceleration of 200 % and even over 400 %
with 32 processes.
To our suprise there are several cases among
the experiments2that revealed built-in opera-
tions being slower than the self-made coun-
terparts! Similar effects occurred on the SGI
O3K platforms as well.
As the built-in operations are free to do the
same as straightforward self-made implemen-
tations this is an unnecessary drawback. One
might consider a minor performance disadvan-
tage for very small messages acceptable. How-
ever middle sized messages with notable losses
are denitely not what a user may want to see!
After all, it is interesting to notice that MPI im-
plementations provided by vendors show such
effects in the rst place. Some more measure-
ments will be needed to identify the reasons for
this behavior.
5 Conclusions and Future
Work
Pattern matching in event traces seems a
promising approach for program analysis in
parallel and distributed programs. Based on
this assumption, this paper demonstrates a pos-
sibility of using pattern matching for perfor-
2All experiments have been run multiple times, the
run-time values were taken as the minima over all repe-
titions.
mance tuning. The idea of the CCG pattern
matcher is to detect groups of communication
events, which resemble available collective op-
erations. Detected groups may then be re-
placed by more efcient, optimized function
calls, which should improve the performance
of the code.
This initial version of the pattern matcher
is already able to detect simple collective op-
erations that may resemble MPI Scatter
or MPI Gather functions. Our next goal
in this project is to distinguish between sim-
ilar patterns, such as MPI Scatter and
MPI Bcast, in order to increase the ac-
curacy of the pattern detection mechanism.
In addition, we intend to study more com-
plex patterns, such as MPI Allgather,or
MPI Alltoall, which promise an even
higher potential for performance optimization.
Acknowledgments Several persons con-
tributed to this work through their ideas
and comments, notably Bernhard Aichinger,
Christian Schaubschl¨ager, and Prof. Jens
Volkert from GUP Linz, Axel Rimnac from
the University Linz, and Holger Brunst from
ZHR TU Dresden, as well as Beniamino Di
Martino from UNINA, Italy, to whom we are
most thankful.
We would also like to thank the John von
Neuman Institute for Computing (NIC) at the
Research Center Juelich for access to their
IBM p690 machine Jump under project num-
ber #k2720000 to perform our measurements
as reported in this paper.
References
[1] H. Brunst, W. E. Nagel, and S. Seidl.
Performance Tuning on Parallel Systems:
All Problems Solved? In Proceedings of
PARA2000 - Workshop on Applied Par-
allel Computing, volume 1947 of LNCS,
pages 279–287. Springer-Verlag Berlin
Heidelberg New York, June 2000.
[2] H. Brunst, H.-Ch. Hoppe, W.E. Nagel,
and M. Winkler. Performance Otimiza-
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
gather run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
scatter run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
Figure 3: Comparisson of built-in vs. self-made gather (left) and scatter operations (right).
tion for Large Scale Computing: The
Scalable VAMPIR Approach. In Pro-
ceedings of ICCS2001, San Francisco,
USA, volume 2074 of LNCS, page 751ff.
Springer-Verlag, May 2001.
[3] B. Di Martino and B. Chapman. Pro-
gram Comprehension Techniques to Im-
prove Automatic Parallelization. In Pro-
ceedings of the Workshop on Automatic
Data Layout and Performance Predic-
tion, Center for Research on Parallel
Computation, Rice University, 1995.
[4] I. Foster. Designing and Building Parallel
Programs. Addison-Wesley, 1995.
[5] W. Gropp, E. Lusk, and A. Skjellum.
UsingMPI - 2nd Edition. MIT Press,
November 1999.
[6] B. Gruber, G. Haring, D. Kranzlm ¨uller,
and J. Volkert. Parallel Programming
with CAPSE - A Case Study. in Proceed-
ings PDP’96, 4th EUROMICRO Work-
shop on Parallel and Distributed Pro-
cessing, Braga, Portugal, pages 130–137,
January 1996.
[7] A.E. Hough and J.E. Cuny. Initial Ex-
periences with a Pattern-Oriented Paral-
lel Debugger. In Proceedings of the ACM
SIGPLAN/SIGOPS Workshop on Paral-
lel and Distributed Debugging, Madi-
son, Wisconsin, USA, SIGPLAN No-
tices, Vol. 24, No. 1, pp. 195–205, Jan-
uary 1989.
[8] Ch. Kessler. Pattern-driven Automatic
Parallelization. Scientic Programming,
Vol. 5, pages 251–274, 1996.
[9] A. Kn¨upfer. A New Data Compres-
sion Technique for Event Based Program
Traces. In Proccedings of ICCS 2003
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
broadcast run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
0
100
200
300
400
500
600
700
800
10 100 1000 10000 100000 1e+06 1e+07 1e+08
ratio: self-made / built-in [%]
message volume [bytes]
alltoall run-times ratio on JUMP
2x16 processes
2x10 processes
2x8 processes
2x4 processes
2x2 processes
Figure 4: Comparisson of built-in vs. self-made broadcast (left) and all-to-all operations (right).
in Melbourne/Australia, Springer LNCS
Vol. 2659, pages 956 965. June 2003.
[10] A. Kn¨upfer and Wolfgang E. Nagel.
Compressible Memory Data Structures
for Event Based Trace Data. Future Gen-
eration Computer Systems by Elsevier,
January 2004. [submitted]
[11] D. Kranzlm¨uller, S. Grabner, and
J. Volkert. Debugging with the MAD
Environment. Parallel Computing, Vol.
23, No. 1–2, pages 199–217, April 1997.
[12] D. Kranzlm¨uller. Communication Pat-
tern Analysis in Parallel and Distributed
Programs. In Proceedings of the
20th IASTED Intl. Multi-Conference Ap-
plied Informatics (AI 2002), International
Association of Science and Technol-
ogy for Development (IASTED), ACTA
Press, Innsbruck, Austria, pages 153–
158, February 2002.
[13] T. Kunz and M. Seuren. Fast Detection of
Communication Patterns in Distributed
Executions. In Proceedings of the 1997
Conference of The Centre for Advanced
Studies on Collaborative Research, IBM
Press, Toronto, Canada, 1997.
[14] M. Snir, S. Otto, S. Huss-Lederman,
D. Walker, J. Dongarra. MPI: The Com-
plete Reference. MIT Press, September
1998.
[15] F. Wolf and B. Mohr. EARL - A
Programmable and Extensible Toolkit
for Analyzing Event Traces of Message
Passing Programs. Technical report,
Forschungszentrum J ¨ulich GmbH, April
1998. FZJ-ZAM-IB-9803.
[16] U. Detert. Introduction to the
Jump Architecture. Presentation,
Forschungszentrum J ¨ulich GmbH, 2004.
http://jumpdoc.fz-juelich.de/
Conference Paper
The Message Passing Interface standard MPI offers collective communication routines to perform commonly required operations on groups of processes. The usage of these operations is highly recommended due to their simplified and compact interface and their optimized performance. This paper describes a pattern matching approach to detect clusters of point-to-point communication functions in program traces, which may resemble collective operations. The extracted information represents an important indicator for performance tuning, if point-to-point operations can be replaced by their collective counterparts. The paper describes the pattern matching and a series of measurements, which underline the feasibility of this idea.
Conference Paper
Event traces are required to correctly diagnose a n umber of performance problems that arise on today's highly p arallel systems. Unfortunately, the collection of event tra ces can produce a large volume of data that is difficult, o r even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determinin g the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate patt ern-based methods for reducing traces that will be used for p erformance analysis. We evaluate the different methods against several criteria, including size reduction, introduced erro r, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real applicatio n.
Conference Paper
Full-text available
This paper describes a new meta-tool name EARL which consists of a new high-level trace analysis language and its interpreter which allows to easily construct new trace analysis tools. Because of its programmability and flexibility, EARL can be used for a wide range of event trace analysis tasks. It is especially well-suited for automatic and for application or domain specific trace analysis and program validation. We describe the abstract view on an event trace the EARL interpreter provides to the user, and give an overview about the EARL language. Finally, a set of EARL script examples are used to demonstrate the features of EARL.
Article
Full-text available
This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Article
Debugging parallel programs can be tedious and difficult. Therefore the programmer needs support from tools, that provide features for error detection and performance analysis. The MAD environment is such a toolset. It helps the user in monitoring and analyzing message passing programs. Communication errors and performance bottlenecks are visualized based on an event graph. Source code connection provides a combination between visualized events and the original lines of code or a control and data flow representation. A main part of the environment is dedicated to race conditions. After evaluation of events, which might be reordered during successive program runs, localization of message races can be performed by means of trace driven simulation. All the tools in the MAD environment follow an extensible and modular debugging strategy based on a graphical user interface.
Conference Paper
Understanding distributed applications is a tedious and difficult task. Visualizations based on process-time diagrams are often used to obtain a better understanding of the execution of the application. The visualization tool we use is Poet, an event tracer developed at the University of Waterloo. However, these diagrams are often very complex and do not provide the user with the desired overview of the application. In our experience, such tools display repeated occurrences of non-trivial communication patterns, appearing throughout the trace data and cluttering the display space. This paper describes an event abstraction facility which tries to simplify the execution visualization shown by Poet by efficiently detecting and abstracting such patterns.A user can define patterns, subject to only very few constraints, and store them in a hierarchical pattern library. We also provide the user with the possibility to annotate the source code as a help in the abstraction process. We detect these communication patterns by employing an enhanced efficient multiple string matching algorithm. The results indicate that the matching process is indeed very fast. A user can experiment with multiple patterns at potentially different levels in the hierarchy, checking for their occurrence in the trace file, while trying to gain some understanding in a short period of time.
Conference Paper
The paper presents an innovative solution to the problem of the very huge data sets that are regularly produced by performance tracing techniques — especially on HPC programs. It designs an adapted data compression scheme that takes advantage of regularities frequently found in program traces. Algorithms to reveal repetition patterns in a programs call structure and run time behavior are discussed in detail, solutions to some problems arising on practical application are addressed as well. Two examples demonstrate the capabilities of the approach and document its behavior. Finally, some thoughts are given regarding how the patterns revealed in the process of data compression may assist the automatic analysis of traces.
Conference Paper
Performance tuning of parallel programs, considering the current status and future developments in parallel programming paradigms and parallel system architectures, remains an important topic even if the single CPU performance is doubling every 18 months. Based on a brief summary of state of the art parallel programming techniques, new performance tuning aspects will be identified. The main part of the paper concentrates on how to deal with these aspects by means of new performance analysis and tuning concepts. First tool developments are presented where part of these concepts are already implemented. Finally, an existing scientific parallel application will be presented with respect to its performance tuning stages which were carried out at our center.
Conference Paper
Highly parallel programs are often best understood in terms of logical patterns of inter process communication. In order to debug such programs, the user must determine the extent to which the intended patterns occur during execution. To facilitate this, we have designed and implemented a pattern-oriented debugger in which abstract, user-defined communication events can be described and animated. We report here on our initial experiences with its use.
Conference Paper
The CAPSE environment for Computer Aided Parallel Software Engineering is intended to assist the developer in the crucial task of parallel programming. The methodology of CAPSE is based on direct manipulative graphical creation and editing of scalable workload characterizations of MIMD algorithms. This paper presents the basic concepts of this methodology and an example of a parallel Poisson solver. The workload characterization representing the computation and communication behavior of the algorithm is based on directed acyclic task graphs, which achieve scalability by composing the task graph of scalable basic patterns instead of single node and arcs. The composition and the usage of these basic patterns is described in the light of designing the Poisson solver algorithm. The resulting task graph is used to predict the program's performance on a nCUBE 2 distributed memory machine and the PAPS simulator
Communication Pattern Analysis in Parallel and Distributed Programs International Association of Science and Technology for Development (IASTED)
  • D Kranzlmüller
D. Kranzlmüller. Communication Pattern Analysis in Parallel and Distributed Programs. In Proceedings of the 20th IASTED Intl. Multi-Conference Applied Informatics (AI 2002), International Association of Science and Technology for Development (IASTED), ACTA Press, Innsbruck, Austria, pages 153– 158, February 2002. [13] T. Kunz and M. Seuren. Fast Detection of Communication Patterns in Distributed Executions. In Proceedings of the 1997 Conference of The Centre for Advanced Studies on Collaborative Research, IBM Press, Toronto, Canada, 1997.