William Gropp
University of Illinois, Urbana-Champaign | UIUC · Department of Computer Science

About

375

Publications

53,497

Reads

25,899

Citations

Skills and Expertise

Domain Decomposition

Scientific Computation

Scientific Computing

Scientific Programming

High Performance Computing

Message Passing

Supercomputing

Publications

Preparing Algebraic Multigrid for Exascale

Article

Full-text available

Nov 2015

Algebraic Multigrid (AMG) solvers are an essential component of many large-scale scientific simulation codes. Their continued numerical scalability and efficient implementation is critical for preparing these codes for exascale. Our experiences on modern multi-core machines show that significant challenges must be addressed for AMG to perform well...

Series Foreword

Chapter

Nov 2015

An overview of the most prominent contemporary parallel processing programming models, written in a unique tutorial style. With the coming of the parallel computing era, computer scientists have turned their attention to designing programming models that are suited for high-performance parallel computing and supercomputing systems. Programming para...

Message Passing Interface

Chapter

Nov 2015

Rethinking Key-Value Store for Parallel I/O Optimization

Article

Apr 2015

Key-Value Stores (KVStore) are being widely used as the storage system for large-scale Internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems (PFS) are the dominant storage systems. In this study, we carefully examine the architecture difference and performance characteristics of PFS a...

Collective Algorithms for Multiported Torus Networks

Article

Feb 2015

Modern supercomputers with torus networks allow each node to simultaneously pass messages on all of its links. However, most collective algorithms are designed to only use one link at a time. In this work, we present novel multiported algorithms for the scatter, gather, all-gather, and reduce-scatter operations. Our algorithms can be combined to cr...

Locality-Optimized Mixed Static/Dynamic Scheduling for Improving Load Balancing on SMPs

Article

Sep 2014

Application performance can be degraded significantly due to node-local load imbalances during application execution. Prior work suggested using a mixed static/dynamic scheduling approach for handling this problem, specifically in the context of fine-grained, transient load imbalances. Here, we consider an alternate strategy for more general load i...

Decoupled I/O for Data-Intensive High Performance Computing

Conference Paper

Full-text available

Sep 2014

The I/O bottleneck issue has been acknowledged as one of main performance issues of high performance com-puting (HPC) systems for data-intensive scientific applications, and has attracted intensive studies in recent years. With the enlarging gap between the computing bandwidth and I/O bandwidth in projected next-generation HPC systems, this issue w...

Enabling the environmentally clean air transportation of the future: A vision of computational fluid dynamics in 2030

Article

Full-text available

Aug 2014

As global air travel expands rapidly to meet demand generated by economic growth, it is essential to continue to improve the efficiency of air transportation to reduce its carbon emissions and address concerns about climate change. Future transports must be 'cleaner' and designed to include technologies that will continue to lower engine emissions...

Toward Exascale Resilience: 2014 update

Article

Mar 2014

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various...

Applications of the streamed storage format for sparse matrix operations

Article

Feb 2014

The streamed storage format for sparse matrices showed good performance improvement for sparse matrix and vector multiply (SpMV) compared with compressed sparse row (CSR) and block CSR (BCSR) formats, particularly on IBM Power processors. We extend the format to exploit single instruction multiple data (SIMD) instructions in order to utilize the ve...

Special Issue: SC13 – The International Conference for High Performance Computing, Networking, Storage and Analysis

Article

Full-text available

Jan 2014

The technical papers program for SC13 received 449 submissions of which 90 where selected for the program giving an acceptance rate of 20%. A rigorous peer review process, including author rebuttals and a 1.5 day face-to-face program committee meeting ensured that selected papers were the very best in our field. One of the tasks at the face-to-face...

Toward exascale resilience: 2014 update. Supercomput. Front

Article

Jan 2014

PETSc web page

Article

Jan 2014

MPI + MPI: A new hybrid approach to parallel programming with MPI plus shared memory

Article

Full-text available

Dec 2013

Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmer...

Optimization Strategies for MPI-Interoperable Active Messages

Conference Paper

Dec 2013

Data-intensive applications, such as those in bioinformatics and social network analysis, differ from traditional scientific applications in that they often involve data-driven and irregular computation/communication patterns, making them ill-suited for traditional data movement approaches. Active Messages (AM) is an alternative programming model t...

MPI-Interoperable Generalized Active Messages

Conference Paper

Dec 2013

Data-intensive applications have become increasingly important in recent years, yet traditional data movement approaches for scientific computation are not well suited for such applications. The Active Message (AM) model is an alternative communication paradigm that is better suited for such applications by allowing computation to be dynamically mo...

Programming for Exascale Computers

Article

Nov 2013

Exascale systems will present programmers with many challenges. The authors review the parallel programming models that are appropriate for such systems and the challenges that implementations of those models face in an exascale system. They also discuss the feasibility of using existing programming systems, thus preserving the investment in legacy...

Analysis of topology-dependent MPI performance on Gemini networks

Conference Paper

Sep 2013

Current HPC systems utilize a variety of interconnection networks, with varying features and communication characteristics. MPI normalizes these interconnects with a common interface used by most HPC applications. However, network properties can have a significant impact on application performance. We explore the impact of the interconnect on appli...

Runtime system design of decoupled execution paradigm for data-intensive high-end computing

Conference Paper

Sep 2013

High performance computing are widely used for scientific discoveries by running scientific computation programs. Many of these applications are getting more and more data intensive [1]. They generate or access huge amount of data during some execution phases. However, traditional supercomputers are designed for computing-intensive tasks. They usua...

Parallel Adaptive Deflated GMRES

Conference Paper

May 2013

Many scientific libraries are currently based on the GMRES method as a Krylov subspace iterative method for solving large linear systems. The restarted formulation known as GMRES(m) has been extensively studied and several approaches have been proposed to reduce the negative effects due to the restarting procedure. A common effect in GMRES(m) is a...

Performance Analysis of the Lattice Boltzmann Model Beyond Navier-Stokes

Conference Paper

Full-text available

May 2013

The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations of the continuum Boltzmann equation enable not only recovery of the Navier-Stokes hydro...

Toward Asynchronous and MPI-Interoperable Active Messages

Conference Paper

May 2013

Many new large-scale applications have emerged recently and become important in areas such as bioinformatics and social networks. These applications are often data-intensive and involve irregular communication patterns and complex operations on remote processes. Active messages have proven effective for parallelizing such nontraditional application...

Systematic Reduction of Data Movement in Algebraic Multigrid Solvers

Conference Paper

May 2013

Algebraic Multigrid (AMG) solvers find wide use in scientific simulation codes. Their ideal computational complexity makes them especially attractive for solving large problems on parallel machines. However, they also involve a substantial amount of data movement, posing challenges to performance and scalability. In this paper, we present an algori...

Multiphysics Simulations: Challenges and Opportunities

Article

Full-text available

Feb 2013

We consider multiphysics applications from algorithmic and architectural perspectives, where ‘‘algorithmic’’ includes both mathematical analysis and computational complexity, and ‘‘architectural’’ includes both software and hardware environments. Many diverse multiphysics applications can be reduced, en route to their computational simulation, to a...

Abstract: Slack-Conscious Lightweight Loop Scheduling for Improving Scalability of Bulk-synchronous MPI Applications

Conference Paper

Full-text available

Nov 2012

Due to the strict communication dependences in the global collective communication of MPI applications, noise that delays one process can amplify across processes in a large run. The amount of overhead that noise amplification causes can increase dramatically as we scale the application to a very large numbers of processes (10,000 or more). For hyb...

A Case for Optimistic Coordination in HPC Storage Systems

Conference Paper

Nov 2012

High-performance computing (HPC) storage systems rely on access coordination to ensure that concurrent updates do not produce incoherent results. HPC storage systems typically employ pessimistic distributed locking to provide this functionality in cases where applications cannot perform their own coordination. This approach, however, introduces sig...

Performance Modeling of Algebraic Multigrid on Blue Gene/Q: Lessons Learned

Conference Paper

Nov 2012

The IBM Blue Gene/Q represents a large step in the evolution of massively parallel machines. It features 16-core compute nodes, with additional parallelism in the form of four simultaneous hardware threads per core, connected together by a five-dimensional torus network. Machines are being built with core counts in the hundreds of thousands, with t...

Advanced MPI including new MPI-3 features

Conference Paper

Sep 2012

This tutorial will cover several advanced topics in MPI. We will cover one-sided communication, dynamic processes, multithreaded communication and hybrid programming, and parallel I/O. We will also discuss new features in the newest version of MPI, MPI-3, which is expected to be officially released a few days before this tutorial. The tutorial will...

Efficient Multithreaded Context ID Allocation in MPI

Conference Paper

Sep 2012

An important aspect of support for multithreaded MPI executions is the management of communication context identifiers (IDs), which are used to associate MPI communication operations with a communicator. New communicator creation functionality in MPI 3.0 adds complexity to this core resource management problem. We present an efficient algorithm for...

Leveraging MPI’s One-Sided Communication Interface for Shared-Memory Programming

Conference Paper

Full-text available

Sep 2012

Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizi...

Adaptive Strategy for One-Sided Communication in MPICH2

Conference Paper

Sep 2012

The one-sided communication model supported by MPI-2 can be more convenient to use than the regular two-sided communication model and has potential to provide better performance. The MPI-2 standard gives flexibility about when RMA operations can be issued and completed. The current MPICH2 implementation employs a lazy approach, in which operations...

Lecture Notes in Computer Science

Conference Paper

Sep 2012

William Gropp

The Message Passing Interface (MPI) was developed over eighteen years ago and continues to be the preferred programming model for scientific computing. Contributing to that success was a combination of forward-looking features, precise definition, and judgment based on the experience of developers, vendors and users. Today, MPI continues to adapt t...

Faster Topology-aware Collective Algorithms Through Non-minimal Communication

Conference Paper

Sep 2012

Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-mi...

Modeling the Performance of an Algebraic Multigrid Cycle Using Hybrid MPI/OpenMP

Conference Paper

Sep 2012

The rise of multicore cluster architectures has led to intense interest in using a combination of MPI and OpenMP to more effectively program these machines. We present a performance model for hybrid implementation of the solve cycle of algebraic multigrid (AMG), a popular iterative solver for large sparse linear systems and a key component of many...

A Decoupled Execution Paradigm for Data-Intensive High-End Computing

Conference Paper

Sep 2012

High-end computing (HEC) applications in critical areas of science and technology tend to be more and more data intensive. I/O has become a vital performance bottleneck of modern HEC practice. Conventional HEC execution paradigms, however, are computing-centric for computation intensive applications. They are designed to utilize memory and CPU perf...

Adaptive thread distributions for SpMV on a GPU

Conference Paper

Jul 2012

We present a simple auto-tuning method to improve the performance of sparse matrix-vector multiply (SpMV) on a GPU. The sparse matrix, stored in CSR format, is sorted in increasing order of the number of nonzero elements per row and partitioned into several ranges. The number of GPU threads per row (TPR) is then assigned for different ranges of the...

Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications

Article

Full-text available

May 2012

In this paper we present an analytical model to predict the performance of general purpose applications on a GPU ar-chitecture. The model is designed to provide performance in-formation to an auto-tuning compiler and assist it narrow the search to the more promising implementations. This work is based on the NVIDIA GPUs using CUDA (Compute Unified...

Best Algorithms plus Best Computers = Powerful Match

Article

May 2012

William Gropp

The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts. twitter Follow us on Twitter at http://twitter.com/blogCACM ...

Multiphysics Simulations: Challenges and Opportunities

Book

Full-text available

Jan 2012

We consider multiphysics applications from algorithmic and architectural perspectives, where “algorithmic” includes both mathematical analysis and computational complexity and “architectural” includes both software and hardware environments. Many diverse multiphysics applications can be reduced, en route to their computational simulation, to a comm...

Multiphysics simulations: Challenges and opportunities

Article

Jan 2012

PETSc users manual

Article

Full-text available

Jan 2012

Performance Expectations and Guidelines for MPI Derived Datatypes

Conference Paper

Full-text available

Dec 2011

MPI’s derived datatypes provide a powerful mechanism for concisely describing arbitrary, noncontiguous layouts of user data for use in MPI communication. This paper formulates self-consistent performance guidelines for derived datatypes. Such guidelines make performance expectations for derived datatypes explicit and suggest relevant optimizations...

Formal Analysis of MPI-based Parallel Programs

Article

Full-text available

Dec 2011

Most parallel computing applications in highperformance computing use the Message Passing Interface (MPI) API. Given the fundamental importance of parallel computing to science and engineering research, application correctness is paramount. MPI was originally developed around 1993 by the MPI Forum, a group of vendors, parallel programming researche...

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

Conference Paper

Full-text available

Dec 2011

Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some appl...

Avoiding hot-spots on two-level direct networks

Conference Paper

Nov 2011

A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of...

Performance modeling for systematic performance tuning

Article

Full-text available

Nov 2011

The performance of parallel scientific applications depends on many factors which are determined by the execution environment and the parallel application. Especially on large parallel systems, it is too expensive to explore the solution space with series of experiments. Deriving analytical models for applications and platforms allow estimating and...

Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization

Article

Full-text available

Oct 2011

We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to signific...

Multi-core and Network Aware MPI Topology Functions

Conference Paper

Full-text available

Sep 2011

MPI standard offers a set of topology-aware interfaces that can be used to construct graph and Cartesian topologies for MPI applications. These interfaces have been mostly used for topology construction and not for performance improvement. To optimize the performance, in this paper we use graph embedding and node/network architecture discovery modu...

Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes

Conference Paper

Full-text available

Jun 2011

The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in...

HadoopJitter: The Ghost in the Cloud and How to Tame It

Article

Full-text available

Jun 2011

The small performance variation within each node of a cloud computing infrastructure (i.e. cloud) can be a fundamental impediment to scalability of a high-performance application. This performance variation (referred to as jitter) particularly impacts overall performance of scientific workloads running on a cloud. Studies show that the primary sour...

Performance modeling as the key to extreme scale computing

Conference Paper

May 2011

William Gropp

Parallel computing is primarily about achieving greater performance than is possible without using parallelism. Especially for the high-end, where systems cost tens to hundreds of millions of dollars, making the best use of these valuable and scarce systems is important. Yet few applications really understand how well they are performing with respe...

Modeling the performance of an algebraic multigrid cycle on HPC platforms

Conference Paper

May 2011

Now that the performance of individual cores has plateaued, future supercomputers will depend upon increasing parallelism for performance. Processor counts are now in the hundreds of thousands for the largest machines and will soon be in the millions. There is an urgent need to model application performance at these scales and to understand what ch...

LACIO: A new collective I/O strategy for parallel I/O systems

Conference Paper

May 2011

Parallel applications benefit considerably from the rapid advance of processor architectures and the available mas- sive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is...

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing

Article

Mar 2011

Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still be power efficient.

MPI on millions of cores

Article

Full-text available

Mar 2011

Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to...

Optimizing Sparse Data Structures for Matrix-vector Multiply

Article

Feb 2011

Sparse matrix—vector multiply is an important operation in a wide range of problems. One of the key factors determining the performance of this operation is sustained memory bandwidth. In the IBM POWER architecture, there is a hardware component called a prefetch data stream that can significantly increase sustained memory bandwidth. We have develo...

Scalable Memory Use in MPI: A Case Study with

Conference Paper

Jan 2011

One of the factors that can limit the scalability of MPI to exascale is the amount of memory consumed by the MPI implementation. In fact, some researchers believe that existing MPI implementations, if used unchanged, will themselves consume a large fraction of the available system memory at exascale. To investigate and address this issue, we undert...

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

Conference Paper

Full-text available

Oct 2010

Parallel programming models on large-scale systems require a scalable system for managing the processes that make up the execution of a parallel program. The process-management system must be able to launch millions of processes quickly when starting a parallel program and must provide mechanisms for the processes to exchange the information needed...

Load Balancing for Regular Meshes on SMPs with MPI

Conference Paper

Full-text available

Sep 2010

Domain decomposition for regular meshes on parallel computers has traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines with 10000+ node...

Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems

Conference Paper

Sep 2010

With the ever-increasing numbers of cores per node on HPC systems, applications are increasingly using threads to exploit the shared memory within a node, combined with MPI across nodes. Achieving high performance when a large number of concurrent threads make MPI calls is a challenging task for an MPI implementation. We describe the design and imp...

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues

Conference Paper

Full-text available

Sep 2010

Designing and tuning parallel applications with MPI, particularly at large scale, requires understanding the performance implications of different choices of algorithms and implementation options. Which algorithm is better depends in part on the performance of the different possible communication approaches, which in turn can depend on both the sys...

A Scalable MPI_Comm_split Algorithm for Exascale Computing

Conference Paper

Sep 2010

Existing algorithms for creating communicators in MPI programs will not scale well to future exascale supercomputers containing millions of cores. In this work, we present a novel communicator-creation algorithm that does scale well into millions of processes using three techniques: replacing the sorting at the end of MPI_Comm_split with merging as...

Teaching parallel programming: a roundtable discussion

Article

Sep 2010

In this roundtable, three professors of parallel programming share their perspective on teaching and learning the computing technique.

Minimizing MPI Resource Contention in Multithreaded Multicore Environments

Conference Paper

Sep 2010

With the ever-increasing numbers of cores per node in high-performance computing systems, a growing number of applications are using threads to exploit shared memory within a node and MPI across nodes. This hybrid programming model needs efficient support for multithreaded MPI communication. In this paper, we describe the optimization of one aspect...

Keynote

Conference Paper

Jun 2010

William Gropp

Self-Consistent MPI Performance Guidelines

Article

Jun 2010

Message passing using the Message-Passing Interface (MPI) is at present the most widely adopted framework for programming parallel applications for distributed memory and clustered parallel systems. For reasons of (universal) implementability, the MPI standard does not state any specific performance guarantees, but users expect MPI implementations...

An introductory exascale feasibility study for FFTs and multigrid

Conference Paper

Full-text available

May 2010

The coming decade is going to see a push towards exascale computing. Assuming gigahertz cores, this means exascale systems will have between 100 million and 1 billion of them to achieve this level of performance. At this scale, some important questions need to be answered on the applications end. What applications are feasible at this scale? What n...

An adaptive performance modeling tool for GPU architectures

Conference Paper

May 2010

This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers be...

The Importance of Non-Data-Communication Overheads in MPI

Article

Full-text available

Feb 2010

With processor speeds no longer doubling every 18-24 months owing to the exponential increase in power consumption and heat dissipation, modern HEC systems tend to rely lesser on the performance of single processing units. Instead, they rely on achieving high-performance by using the parallelism of a massive number of low-frequency/low-power proces...

A Pipelined Algorithm for Large, Irregular All-Gather Problems

Article

Feb 2010

We describe and evaluate a new pipelined algorithm for large, irregular all-gather problems. In the irregular allgather problem each process in a set of processes contributes individual data of possibly different size, and all processes have to collect all data from all processes. The pipelined algorithm is useful for the implementation of the MPI_...

Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

Article

Feb 2010

As high-end computing systems continue to grow in scale, recent advances in multi- and many-core architectures have pushed such growth toward more dense architectures, that is, more processing elements per physical node, rather than more physical nodes themselves. Although a large number of scientific applications have relied so far on an MPI-every...

An Adaptive Performance Modeling Tool for GPU Architectures

Conference Paper

Jan 2010

Enabling the Next Generation of Scalable Clusters

Conference Paper

Jan 2010

William Gropp

Data-intensive parallel applications on clouds need to deploy large data sets from the cloud's storage facility to all compute nodes as fast as possible. Many multicast algorithms have been proposed for clusters and grid environments. The most common ...

MPI at Exascale

Article

Full-text available

Jan 2010

With petascale systems already available, researchers are devoting their attention to the issues needed to reach the next major level in performance, namely, exascale. Explicit message passing using the Message Passing Interface (MPI) is the most commonly used model for programming petascale systems today. In this paper, we investigate what is need...

PETSc Users Manual Revision 3.1

Article

Full-text available

Jan 2010

Performance Evaluation and Enhancement of Dendro

Article

Full-text available

Jan 2010

DENDRO is a collection of tools for solving Finite Element problems in parallel. This package is written in C++ using the standard template library (STL) and uses the Message Passing (MPI). Dendro uses an octree data-structure to solve image-registration problems using finite element techniques. For analyzing the behavior of the package in terms of...

Test Suite for Evaluating Performance of Multithreaded MPI Communication

Article

Dec 2009

As parallel systems are commonly being built out of increasingly large multicore chips, application programmers are exploring the use of hybrid programming models combining MPI across nodes and multithreading within a node. Many MPI implementations, however, are just starting to support multithreaded MPI communication, often focussing on correctnes...

Software for Petascale Computing Systems

Article

Nov 2009

William Gropp

Developing software for highly scalable systems with nearly a million processors or cores raises unique challenges. To succeed, application developers must reconsider both their code's structure and the tools they use to develop, tune, and run that code. Petascale systems aren't just bigger versions of the current terascale systems. The degree of c...

On the Need for a Consortium of Capability Centers

Article

Oct 2009

Users of high-performance computing systems face many challenges, particularly as they design and develop their software to run at multiple facilities. This can lead to a “greatest common denominator” strategy that slows innovation and the adoption of newer techniques. In addition, these systems typically push the limits — leading to problems with...

Processing MPI datatypes outside MPI

Conference Paper

Full-text available

Sep 2009

The MPI datatype functionality provides a powerful tool for describing structured memory and file regions in parallel applications, enabling noncontiguous data to be operated on by MPI communication and I/O routines. However, no facilities are provided by the MPI standard to allow users to efficiently manipulate MPI datatypes in their own codes. W...

MPI on a Million Processors

Conference Paper

Full-text available

Sep 2009

Petascale machines with close to a million processors will soon be available. Although MPI is the dominant programming model today, some researchers and users wonder (and perhaps even doubt) whether MPI will scale to such large processor counts. In this paper, we examine this issue of how scalable is MPI. We first examine the MPI specification itse...

MPI at Exascale: Challenges for Data Structures and Algorithms

Conference Paper

Sep 2009

William Gropp

Petascale computing is here and MPI continues to succeed as a effective, scalable programming model, despite previous predictions that MPI could not work at the Petascale. As the high-performance computing community considers Exascale systems, will MPI continue to be suitable? Already, a number of studies have looked at the behavior of MPI as the n...

Hierarchical Collectives in MPICH2

Conference Paper

Sep 2009

Most parallel systems on which MPI is used are now hierar- chical: some processors are much closer to others in terms of interconnect performance. One of the most common such examples are systems whose nodes are symmetric multiprocessors (including "multicore" processors). A number of papers have developed algorithms and implementations that exploi...

Investigating High Performance RMA Interfaces for the MPI-3 Standard

Conference Paper

Full-text available

Sep 2009

The MPI-2 Standard, released in 1997, defined an interface for one-sided communication, also known as remote memory access (RMA). It was designed with the goal that it should permit efficient implementations on multiple platforms and networking technologies, and also in heterogeneous environments and non-cache-coherent sys- tems. Nonetheless, even...

Toward message passing for a million processes: Characterizing MPI on a massive scale blue gene/P

Article

Full-text available

Sep 2009

Upcoming exascale capable systems are expected to comprise more than amillion processing elements. As researchers continue to work toward architecting these systems, it is becoming increasingly clear that these systems will utilize asignificant amount of shared hardware between processing units; this includes shared caches, memory and network compo...

PETSc Developers Manual

Article

Full-text available

Feb 2009

Formal methods applied to high-performance computing software design: A case study of MPI one-sided communication-based locking

Article

Dec 2009

There is a growing need to address the complexity of verifying the numerous concurrent protocols employed in the high-performance computing software. Today's approaches for verification consist of testing detailed implementations of these protocols. Unfortunately, this approach can seldom show the absence of bugs, and often results in serious bugs...

Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand

Article

Full-text available

Jan 2009

As high-end computing systems continue to grow in scale, the per-formance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) h...

A Pipelined Algorithm for Large, Irregular Allgather Problems

Article

Jan 2009

Improving the Performance of Tensor Matrix Vector Multiplication in Cumulative Reaction Probability Based Quantum Chemistry Codes

Conference Paper

Full-text available

Dec 2008

Cumulative reaction probability (CRP) calculations providea viable computational approach to estimate reaction rate coefficients.However, in order to give meaningful results these calculations shouldbe done in many dimensions (ten to fifteen). This makes CRP codesmemory intensive. For this reason, these codes use iterative methods tosolve the linea...

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems

Conference Paper

Full-text available

Dec 2008

Parallel 3D FFT is a commonly used numerical method in scientific computing. P3DFFT is a recently proposed implementation of parallel 3D FFT that is designed to allow scalability to massively large systems such as Blue Gene. While there has been recent work that demonstrates such scalability on regular cartesian meshes (equal length in each dimensi...

Parallel I/O prefetching using MPI file caching and I/O signatures

Conference Paper

Full-text available

Nov 2008

Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermin...

Hiding I/O latency with pre-execution prefetching for parallel applications

Conference Paper

Full-text available

Nov 2008

Parallel applications are usually able to achieve high computational performance but suffer from large latency in I/O accesses. I/O prefetching is an effective solution for masking the latency. Most of existing I/O prefetching techniques, however, are conservative and their effectiveness is limited by low accuracy and coverage. As the processor-I/O...

A Formal Approach to Detect Functionally Irrelevant Barriers in MPI Programs

Conference Paper

Full-text available

Sep 2008

We examine the unsolved problem of automatically and ef- ficiently detecting functionally irrelevant barriers in MPI programs. A functionally irrelevant barrier is a set of MPI_Barrier calls, one per MPI process, such that their removal does not alter the overall MPI commu- nication structure of the program. Static analysis methods are incapable of...

Non-data-communication Overheads in MPI: Analysis on Blue Gene/P

Conference Paper

Full-text available

Sep 2008

Modern HEC systems, such as Blue Gene/P, rely on achiev- ing high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local pre- and post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of ser...

Self-consistent MPI-IO Performance Requirements and Expectations

Conference Paper

Full-text available

Sep 2008

We recently introduced the idea of self-consistent perfor-mance requirements for MPI communication. Such requirements provide a means to ensure consistent behavior of an MPI library, thereby ensur-ing a degree of performance portability by making it unnecessary for a user to perform implementation-dependent optimizations by hand. For the collective...

Implementing Efficient Dynamic Formal Verification Methods for MPI Programs

Conference Paper

Full-text available

Sep 2008

We examine the problem of verifying MPI programs for the absence of deadlocks and local assertion violations through dynamic (runtime) formal verication in which the processes of a given MPI pro- gram are executed under the control of an interleaving scheduler. The development of such an algorithm requires several challenges to be over- come in ens...

A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems

Conference Paper

Full-text available

Sep 2008

We present and evaluate a new, simple, pipelined algorithm for large, irregular all-gather problems, useful for the implementation of the MPI_Allgatherv collective operation of MPI. The algorithm can be viewed as an adaptation of a linear ring algorithm for regular all-gather problems for single-ported, clustered multiprocessors to the irregular pr...

Toward Efficient Support for Multithreaded MPI Communication

Conference Paper

Sep 2008

To make the most effective use of parallel machines that are being built out of increasingly large multicore chips, researchers are ex- ploring the use of programming models comprising a mixture of MPI and threads. Such hybrid models require efficient support from anMPI imple- mentation for MPI messages sent from multiple threads simultaneously. In...

MPI and Hybrid Programming Models for Petascale Computing

Conference Paper

Sep 2008

William Gropp

In 2011, the National Center for Supercomputing Applications at the University of Illinois will begin operation of the Blue Waters petascale computing system. This system, funded by the National Science Foundation, will deliver a sustained performance of one to two petaflops for many applications in science and engineering. Blue Waters will support...