MPI: the complete reference-2nd Edition

Verified Scalable Parallel Computing with Why3

Preprint

Jul 2023

BSML is a pure functional library for the multi-paradigm language OCaml. BSML embodies the principles of the Bulk Synchronous Parallel (BSP) model, a model of scalable parallel computing. We propose a formalization of BSML primitives with WhyML, the specification language of Why3 and specify and prove the correctness of most of the BSML standard library. Finally, we develop and verify the correctness of a small BSML application.

On the use of models for high-performance scientific computing applications: an experience report

Article

Full-text available

Feb 2018
SOFTW SYST MODEL

This paper reports on a four-year project that aims to raise the abstraction level through the use of model-driven engineering (MDE) techniques in the development of scientific applications relying on high-performance computing. The development and maintenance of high-performance scientific computing software is reputedly a complex task. This complexity results from the frequent evolutions of supercomputers and the tight coupling between software and hardware aspects. Moreover, current parallel programming approaches result in a mixing of concerns within the source code. Our approach relies on the use of MDE and consists in defining domain-specific modeling languages targeting various domain experts involved in the development of HPC applications, allowing each of them to handle their dedicated model in a both user-friendly and hardware-independent way. The different concerns are separated thanks to the use of several models as well as several modeling viewpoints on these models. Depending on the targeted execution platforms, these abstract models are translated into executable implementations by means of model transformations. To make all of these effective, we have developed a tool chain that is also presented in this paper. The approach is assessed through a multi-dimensional validation that focuses on its applicability, its expressiveness and its efficiency. To capitalize on the gained experience, we analyze some lessons learned during this project.

Porting MPI applications to the Friend-to-Friend computing framework

Article

Andres Luuk

Time-parallel simulation of the decay of homogeneous turbulence using Parareal with spatial coarsening

Article

Full-text available

Jun 2018
Comput Visual Sci

Direct Numerical Simulation of turbulent flows is a computationally demanding problem that requires efficient parallel algorithms. We investigate the applicability of the time-parallel Parareal algorithm to an instructional case study related to the simulation of the decay of homogeneous isotropic turbulence in three dimensions. We combine a Parareal variant based on explicit time integrators and spatial coarsening with the space-parallel Hybrid Navier–Stokes solver. We analyse the performance of this space–time parallel solver with respect to speedup and quality of the solution. The results are compared with reference data obtained with a classical explicit integration, using an error analysis which relies on the energetic content of the solution. We show that a single Parareal iteration is able to reproduce with high fidelity the main statistical quantities characterizing the turbulent flow field.

BSP-Why, un outil pour la vérification déductive de programmes BSP : machine-checked semantics and application to distributed state-space algorithms

Article

Oct 2013

Jean Fortin

This thesis takes part in the formal verification of parallel programs. The aim of formal verification is to ensure that a program will run as it should, without making mistakes, blocking, or terminating abnormally. This is even more important in the parallel computation field, where the cost of calculations can be very high. The BSP model (Bulk Synchronous Parallelism) is a model of parallelism well suited for the use of formal methods. It guarantees a structure in the parallel program, by organising it into super-steps, each of them consisting of a phase of computations, then communications between the processes. In this thesis, we chose to extend an existing tool to adapt it for the proof of BSP programs. We based ourselves on Why, a VCG (verification condition generator) that has the advantage of being able to interface with several automatic provers and proof assistants to discharge the proof obligations. There are multiple contributions in this thesis. In a first part, we present a comparison of the existing BSP libraries, in order to show the most used BSP primitives, which are the most interesting to formalise. We then present BSP-Why, our tool for the proof of BSP programs. This tools uses a generation of a sequential program to simulate the parallel program in input, thus allowing the use of Why and the numerous associated provers to prove the proof obligations. We then show how BSP-Why can be used to prove the correctness of some basic BSP algorithms, and also on a more complex example, the generation of the state-space (model-checking) of systems, especially for security protocols. Finally, in order to ensure the greatest confidence in the BSP-Why tool, we give a formalisation of the language semantics, in the Coq proof assistant. We also prove the correctness of the transformation used to go from a parallel program to a sequential program

Computing Dot – Product on Heterogeneous Master – Worker Platforms

Article

Full-text available

Apr 2013

This paper is focused on designing two parallel dot product imple- mentations for heterogeneous master-worker platforms. These implementations are based on the data allocation and dynamic load balancing strategies. The first implementation is the dynamic master - worker with allocation of vectors where the master distributes vectors (data) and computations to the workers whereas the second one is the dynamic master - worker with allocation of vector pointers where the vectors are supposed to be replicated among participating resources beforehand and the master distributes computations to the workers. We also report a set of numerical experiments on a heterogeneous platform where computational resources have different computing powers. Also, the workers are connected to the master by links of same capacities. The obtained experimental results demonstrate that the dynamic allocation of vector point- ers achieve better performance than the original one for computing dot product computation. The paper also presents and verifies an accurate timing model to predict the performance of the proposed implementations on clusters of heterogeneous workstations. Through this model the viability of the proposed implementations can be revealed without the extra effort that would be needed to carry out real testing.

Bulk Synchronous Parallel ML 0.5 Reference Manual

Technical Report

Full-text available

Jan 2010

On Open and Strong-Scaling Tools for Atom Probe Crystallography: High-Throughput Methods for Indexing Crystal Structure and Orientation

Preprint

Sep 2020

Volumetric crystal structure indexing and orientation mapping are key data processing steps for virtually any quantitative study of spatial correlations between the local chemistry and the microstructure of a material. For electron and X-ray diffraction methods it is possible to develop indexing tools which compare measured and analytically computed patterns to decode the structure and relative orientation within local regions of interest. Consequently, a number of numerically efficient and automated software tools exist to solve the above characterisation tasks. For atom probe tomography (APT) experiments, however, the strategy of making comparisons between measured and analytically computed patterns is less robust because many APT datasets may contain substantial noise. Given that general enough predictive models for such noise remain elusive, crystallography tools for APT face several limitations: Their robustness to noise, and therefore, their capability to identify and distinguish different crystal structures and orientation is limited. In addition, the tools are sequential and demand substantial manual interaction. In combination, this makes robust uncertainty quantifying with automated high-throughput studies of the latent crystallographic information a difficult task with APT data. To improve the situation, we review the existent methods and discuss how they link to those in the diffraction communities. With this we modify some of the APT methods to yield more robust descriptors of the atomic arrangement. We report how this enables the development of an open-source software tool for strong-scaling and automated identifying of crystal structure and mapping crystal orientation in nanocrystalline APT datasets with multiple phases.

Scalable and Self-Adaptive Service Selection Method for the Internet of Things

Article

Full-text available

Jun 2017

Evaluating Scalable Distributed Erlang for Scalability and Reliability

Article

Full-text available

Jan 2017

Large scale servers with hundreds of hosts and tens of thousands of cores are becoming common. To exploit these platforms software must be both scalable and reliable, and distributed actor languages like Erlang are a proven technology in this area. While distributed Erlang conceptually supports the engineering of large scale reliable systems, in practice it has some scalability limits that force developers to depart from the standard language mechanisms at scale. In earlier work we have explored these scalability limitations, and addressed them by providing a Scalable Distributed (SD) Erlang library that partitions the network of Erlang Virtual Machines (VMs) into scalable groups (s groups). This paper presents the first systematic evaluation of SD Erlang s groups and associated tools, and how they can be used. We present a comprehensive evaluation of the scalability and reliability of SD Erlang using three typical benchmarks and a case study. We demonstrate that s groups improve the scalability of reliable and unreliable Erlang applications on up to 256 hosts (6144 cores). We show that SD Erlang preserves the class-leading distributed Erlang reliability model, but scales far better than the standard model. We present a novel, systematic, and tool-supported approach for refactoring distributed Erlang applications into SD Erlang. We outline the new and improved monitoring, debugging and deployment tools for large scale SD Erlang applications. We demonstrate the scaling characteristics of key tools on systems comprising up to 10K Erlang VMs.

An open-source family of tools to reproduce MPI-based workloads in interconnection network simulators

Article

Full-text available

Dec 2016
J SUPERCOMPUT

Simulation is often used in order to evaluate the behavior and the performance of computing systems. Specifically, in the field of high-performance interconnection networks for HPC clusters the simulation has been extensively considered to verify and validate network operation models and to evaluate their performance. Nevertheless, experiments conducted to evaluate network performance using simulation tools should be fed with realistic network traffic from real benchmarks and/or applications. This approach has grown in popularity because it allows to evaluate the simulation model under realistic traffic situations. In this paper, we propose a family of tools for modeling realistic workloads which capture the behavior of MPI applications into self-related traces called VEF traces. The main novelty of this approach is that it replays the MPI collective operations with their corresponding messages, offering an MPI message-based task simulation framework. The proposed framework neither provides a network simulator nor depends on any specific simulation platform. Besides, this framework allows us to use the generated traces by any third-party network simulator working at message level.

Online Dynamic Security Assessment

Chapter

Jul 2014

Jorge L. Jardim

This chapter describes an implementation of an online dynamic security assessment system based on time-domain simulation, in which dynamic models are the same as those used offline for planning studies with all necessary details. It contains a review of the adopted methods and algorithms (power flow, continuation power flow, time-domain simulation, energy functions, single machine equivalent, and Prony spectral decomposition) focusing on the main issues related to their numerical and computational performance. The utilization of these methods to execute complex security tasks such as dynamic contingency analysis and security region computations is described. Aspects of high-performance computation (fine- and coarse-grain parallelization) are discussed. Some practical results obtained online and comparisons of online with offline planning cases are shown.

Algorithmic skeletons for efficient programming and execution of parallel codes

Article

Dec 2013

Joeffrey Legaux

Parallel architectures have now reached every computing device, but software developers generally lackthe skills to program them through explicit models such as MPI or the Pthreads. There is a need for moreabstract models such as the algorithmic skeletons which are a structured approach. They can be viewed ashigher order functions that represent the behaviour of common parallel algorithms, and those are combinedby the programmer to generate parallel programs. Programmers want to obtain better performances through the usage of parallelism, but the development time implied is also an important factor. Algorithmic skeletons provide interesting results in both those fields. The Orléans Skeleton Library or OSL provides a set of algorithmic skeletons for data parallelism within the bulk synchronous parallel model for the C++ language. It uses advanced metaprogramming techniques to obtain good performances. We improved OSL in order to obtain better performances from its generated programs, and extended its expressivity. We wanted to analyze the ratio between the performance of programs and the development effort needed within OSL and other parallel programming models. The comparison between parallel programs written within OSL and their equivalents in low level parallel models shows a better productivity for high level models : they are easy to use for the programmers while providing decent performances.

Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing

Conference Paper

Full-text available

Jun 2015

One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs – we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.

Efficient parallel implementation of the nonparaxial beam propagation method

Article

Aug 2014
PARALLEL COMPUT

An efficient parallel implementation of a nonparaxial beam propagation method for the numerical study of the nonlinear Helmholtz equation is presented. Our solution focuses on minimizing communication and computational demands of the method which are dependent on a nonparaxiality parameter. Performance tests carried out on different types of parallel systems behave according theoretical predictions and show that our proposal exhibits a better behavior than those solutions based on the use of conventional parallel fast Fourier transform implementations. The application of our design is illustrated in a particularly demanding scenario: the study of dark solitons at interfaces separating two defocusing Kerr media, where it is shown to play a key role.

DACIDR: Deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences

Conference Paper

Full-text available

Oct 2012

The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters). In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods.

Cloud Computing for Rigorous Coupled-Wave Analysis

Article

Full-text available

Jul 2012

Design and analysis of complex nanophotonic and nanoelectronic structures require significant computing resources. Cloud computing infrastructure allows distributed parallel applications to achieve greater scalability and fault tolerance. The problems of effective use of high-performance computing systems for modeling and simulation of subwavelength diffraction gratings are considered. Rigorous coupled-wave analysis (RCWA) is adapted to cloud computing environment. In order to accomplish this, data flow of the RCWA is analyzed and CPU-intensive operations are converted to data-intensive operations. The generated data sets are structured in accordance with the requirements of MapReduce technology.

Towards system-level electromagnetic field simulation on computing clouds

Article

Full-text available

Oct 2011

Cloud computing is a potential paradigm-shifter for system-level electronic design automation tools for chip-package-board design. However, exploiting the true power of on-demand scalable computing is as yet an unmet challenge. We examine electromagnetic (EM) field simulation on cloud platforms.

On Some Statistical Methods for Parallel Computation

Article

Full-text available

Jan 2006

Edward J. Wegman

A Thread-Safe Communication Mechanism for Message-Passing Interface based on MPI standard

Article

Full-text available

Dec 2009

Current high-performance applications development is increasing due to breakthrough advances in microprocessor and power management technologies, network speed and reliability. By this way, distributed parallel applications make use of message-passing interface and multithreaded programming libraries. Nevertheless, drawbacks in message-passing implementations limit the use of thread-safe network communication. This paper presents a thread-safe message-passing interface based on MPI Standard assuring correct message ordering and sender/receiver synchronization.

Efficient computation of N-S equation with free surface flow around an ACV on ShirazUCFD grid

Article

Full-text available

Jan 2009

This paper presents the application of a parallel high accuracy simulation code for Incompressible Navier-Stokes solution with free surface flow around an ACV (Air Cushion Vehicle) on ShirazUCFD Grid environment. The parallel finite volume code is developed for incompressible Navier-Stokes solver on general curvilinear coordinates system for modeling free surface flows. A single set of dimensionless equations is derived to handle both liquid and air phases in viscous incompressible free surface flow in general curvilinear coordinates. The volume of fluid (VOF) method with lagrangian propagation in computational domain for modeling the free surface flow is implemented. The parallelization approach uses a domain decomposition method for the subdivision of the numerical grid, the SPMD program model and MPICH-G2 as the message passing environment is used to obtain a portable application.

Linux Cluster Communication Software Performance Evaluation for Scientific Problems

Article

Jan 2004

A Linux cluster with Gigabit Ethernet interconnect is a local and accessible re- source for solving scientific problems, including finite-element simulations. As general- purpose protocol stacks are not designed for parallel computing, the delivered through- put and latency may be significantly below that suggested by the hardware specifi- cation of the interconnect. The paper evaluates communication software across the protocol stack from high-level communication harnesses, such as MPI, Charm++ and MOSIX, through intermediate communication primitives, such as sockets, streams, and pipes, to underlying protocols such as TCP and TIPC, as well as low-level User- level Network Interfaces such as GAMMA. The evaluation is not only in terms of short message latency and throughput but in terms of the efficiency of the solution, that is the throughput per computation overhead. There are a number of findings, such as the rel- ative efficiency of TIPC with Open MPI; confirmation of the advantage of GAMMA with MPI; the weaker performance of the single system image software under test; and the enhancement from adding application-level load-balancing to the system-level load-balancing available in some communication harnesses.

High Productivity Computing Systems (HPCS) Library Study Effort

Article

Full-text available

Mar 2008

The research team explores a rich feature set, large algorithmic variety, and detailed implementation considerations for one of the most fundamental computational kernels of computational science: LU factorization of a dense matrix by Gaussian elimination with partial pivoting. For the target implementation platforms and systems, they analyze and compare established shared and distributed memory environments as well as relatively new Partitioned Global Address Space programming languages, which include those coming from the High Productivity Computing Systems (HPCS) project. To give quantitative measures of each hardware platform metrics, combined with implementation characteristics, they compare scalability, raw and relative performance as well as the source code features, functionality, and absolute size breakdown as measured by Source Lines of Code (SLOC).

Parallelization of the mars value function approximation in a decision-making framework for wastewater treatment

Article

Full-text available

In this paper, a parallelized version of multivariate adaptive regression splines (MARS, Friedman 1991) is developed and utilized within a decision-making framework (DMF) based on an OA/MARS continuous-state stochastic dynamic programming (SDP) method (Chen et al. 1999). The DMF is used to evaluate current and emerging technologies for the multi-level liquid line of a wastewater treatment system, involving up to eleven levels of treatment and monitoring ten pollutants moving through the levels of treatment. At each level, one technology unit is selected out of set of options which includes the empty unit. The parallel-MARS algorithm enables the computational efficiency to solve this ten-dimensional SDP problem using a new solution method which employs orthogonal array-based Latin hypercube designs and a much higher number of eligible knots.

Self-consistent MPI Performance Requirements

Conference Paper

Full-text available

Sep 2007

The MPI Standard does not make any performance guarantees, but users expect (and like) MPI implementations to deliver good performance. A common-sense expectation of performance is that an MPI function should perform no worse than a combination of other MPI functions that can implement the same functionality. In this paper, we formulate some performance requirements and conditions that good MPI implementations can be expected to fulfill by relating aspects of the MPI standard to each other. Such a performance formulation could be used by benchmarks and tools, such as SKaMPI and Perfbase, to automatically verify whether a given MPI implementation fulfills basic performance requirements. We present examples where some of these requirements are not satisfied, demonstrating that there remains room for improvement in MPI implementations.

EGEP: An Event Tracker Enhanced Gene Expression Programming for Data Driven System Engineering Problems

Article

Full-text available

Apr 2019

Gene expression programming (GEP) is a data driven evolutionary technique that is well suited to correlation mining of system components. With the rapid development of industry 4.0, the number of components in a complex industrial system has increased significantly with a high complexity of correlations. As a result, a major challenge in employing GEP to solve system engineering problems lies in computation efficiency of the evolution process. To address this challenge, this paper presents EGEP, an event tracker enhanced GEP, which filters irrelevant system components to ensure the evolution process to converge quickly. Furthermore, we introduce three theorems to mathematically validate the effectiveness of EGEP based on a GEP schema theory. Experiment results also confirm that EGEP outperforms the GEP with a shorter computation time in an evolution.

Hierarchical Matrix ($\mathcal{H}$-Matrix) Techniques on Massively Parallel Computers

Thesis

Full-text available

Dec 2012

Mohammad Izadi

http://www.mathematik.uni-leipzig.de/Media/DissAbstracts/abstract.khaleghabadi.pdf https://ul.qucosa.de/api/qucosa%3A11798/attachment/ATT-0/

Mixing of Centralized and Decentralized Methods Using Demand Driven Load Collection in Dynamic Load Balancing Algorithm in Cluster System

Experiment Findings

Full-text available

Jan 2018

Sharada Santosh Patil

The load balancing algorithm distributing homogeneous load among the cluster hence increases the speed of high performance clustered system due to its parallel computation capabilities because of its compute nodes. The most attractive point of load balancing algorithm is to distribute load of heavily loaded compute node among lightly loaded compute nodes during the execution, which is called as process migration. This process migration time can be saved using new method in this algorithm. Hence some policies are needed to consider at the time of load transfer decision as well as at the time of process migration. The construction of dynamic load balancing algorithm requires MPI instructions to achieve needs parallel programming. The parallel programming on the cluster can be execute using massage passing interface (MPI) or application programming interface (API). This paper uses only MPI library to build new load balancing algorithm. Due to very highly variable workload of a cluster system, the difficulty of load balancing is also increasing across its compute nodes. This paper proposes new approach of existing dynamic load balancing algorithm, which is implemented on Rock cluster and maximum time it gives the better performance. This algorithm uses demand driven load collection methods so its speed is increases. This paper focused and on comparison between previous dynamic load balancing algorithm and also gives performance of new dynamic load balancing algorithm.

Adding Approximate Counters

Article

Oct 2017

We describe a general framework for adding the values of two approximate counters to produce a new approximate counter value whose expected estimated value is equal to the sum of the expected estimated values of the given approximate counters. (To the best of our knowledge, this is the first published description of any algorithm for adding two approximate counters.) We then work out implementation details for five different kinds of approximate counter and provide optimized pseudocode. For three of them, we present proofs that the variance of a counter value produced by adding two counter values in this way is bounded, and in fact is no worse, or not much worse, than the variance of the value of a single counter to which the same total number of increment operations have been applied. Addition of approximate counters is useful in massively parallel divide-and-conquer algorithms that use a distributed representation for large arrays of counters. We describe two machine-learning algorithms for topic modeling that use millions of integer counters and confirm that replacing the integer counters with approximate counters is effective, speeding up a GPU-based implementation by over 65% and a CPU-based implementation by nearly 50%, as well as reducing memory requirements, without degrading their statistical effectiveness.

Dynamic Load Balancing Using Periodically Load Collection with Past Experience Policy on Linux Cluster System

Article

Full-text available

Mar 2017

Fast execution of the applications achieved through parallel execution of the processes. This is very easily achieved by high performance cluster (HPC) through concurrent processing with the help of its compute nodes. The HPC cluster provides super computing power using execution of dynamic load balancing algorithm on compute nodes of the clusters. The main objective of dynamic load balancing algorithm is to distribute even workload among the compute nodes for increasing overall efficiency of the clustered system. The logic of dynamic load balancing algorithm needs parallel programming. The parallel programming on the HPC cluster can achieve through massage passing interface in C programming. The MPI library plays very important role to build new load balancing algorithm. The workload on a HPC cluster system can be highly variable, increasing the difficulty of balancing the load across its compute nodes. This paper proposes new idea of existing dynamic load balancing algorithm, by mixing centralized and decentralized approach which is implemented on Rock cluster and maximum time it gives the better performance. This paper also gives comparison between previous dynamic load balancing algorithm and new dynamic load balancing algorithm.

Dynamic Load Balancing Using Periodically Load Collection with Past Experience Policy on Linux Cluster System

Article

Jan 2017

Fast execution of the applications achieved through parallel execution of the processes. This is very easily achieved by high performance cluster (HPC) through concurrent processing with the help of its compute nodes. The HPC cluster provides super computing power using execution of dynamic load balancing algorithm on compute nodes of the clusters. The main objective of dynamic load balancing algorithm is to distribute even workload among the compute nodes for increasing overall efficiency of the clustered system. The logic of dynamic load balancing algorithm needs parallel programming. The parallel programming on the HPC cluster can achieve through massage passing interface in C programming. The MPI library plays very important role to build new load balancing algorithm. The workload on a HPC cluster system can be highly variable, increasing the difficulty of balancing the load across its compute nodes. This paper proposes new idea of existing dynamic load balancing algorithm, by mixing centralized and decentralized approach which is implemented on Rock cluster and maximum time it gives the better performance. This paper also gives comparison between previous dynamic load balancing algorithm and new dynamic load balancing algorithm.

A Fine-Granular Programming Scheme for Irregular Scientific Applications

Conference Paper

Aug 2016

HPC systems are widely used for accelerating calculation-intensive irregular applications, e.g., molecular dynamics (MD) simulations, astrophysics applications, and irregular grid applications. As the scalability and complexity of current HPC systems keeps growing, it is difficult to parallelize these applications in an efficient fashion due to irregular communication patterns, load imbalance issues, dynamic characteristics, and many more. This paper presents a fine granular programming scheme, on which programmers are able to implement parallel scientific applications in a fine granular and SPMD (single program multiple data) fashion. Different from current programming models starting from the global data structure, this programming scheme provides a high-level and object-oriented programming interface that supports writing applications by focusing on the finest granular elements and their interactions. Its implementation framework takes care of the implementation details e.g., the data partition, automatic EP aggregation, memory management, and data communication. The experimental results on SuperMUC show that the OOP implementations of multi-body and irregular applications have little overhead compared to the manual implementations using C++ with OpenMP or MPI. However, it improves the programming productivity in terms of the source code size, the coding method, and the implementation difficulty.

PGAS (Partitioned Global Address Space) Languages

Chapter

Jan 2011

George Almasi

The OASIS Coupler

Chapter

Oct 2012

In 1991, CERFACS was commissioned by the French climate modelling community to perform the technical assembling of an ocean General Circulation Model (GCM), Océan Parallélisé (OPA) developed by the Laboratoire d’Océanographie Dynamique et de Climatologie (LODYC), and two different atmospheric GCMs, Action de Recherche Petite Echelle Grande Echelle (ARPEGE) and the Laboratoire de Météorologie Dynamique zoom (LMDz) model developed respectively by Météo-France and the Laboratoire de Météorologie Dynamique (LMD).

Programmiermodelle für verteilten Speicher

Chapter

May 2015

Bei den Modellen für verteilten Speicher unterscheiden wir wieder, wie bei den Modellen für gemeinsamen Speicher, in nebenläufige und kooperative Modelle. Die kooperativen Modelle untergliedern sich in nachrichten-basierte Modelle, wobei ein Prozess einem anderen Prozess eine Nachricht zusenden (send) kann, die dieser dann empfängt (receive). Der Aufruf eines Dienstes wird dabei in eine Nachricht verpackt. Entfernte Aufrufe, wobei der Dienst direkt aufgerufen wird. Der Dienst kann sein eine Prozedur, eine Methode eines Objektes, eine durch sein Interface spezifizierte Methode und damit eine Methode einer Komponenten oder ein Dienst (Service).

A Generic Parallel Branch and Bound Environment on a Network of Workstations

Article

Jul 1999

this paper we present a generic framework for parallel branch and bound algorithms which has these feature. We discuss and motivate the choice we have made, and present test results from an object oriented implementation. 1 Introduction

Parallelization of FDM/FEM computation for PDEs on PARAM YUVA-II cluster of Xeon Phi coprocessors

Article

Feb 2015

This paper discusses an efficient implementation of finite difference method (FDM) and finite element method (FEM) computations for Partial Differential Equation (Poisson Equation) on a message passing cluster with Intel Xeon Phi coprocessors[6,15]. We have performed computations on PARAM YUVA-II [9] which is a message passing cluster with compute nodes as Xeon multi-core processors and Xeon Phi coprocessors [6,15,17-19]. A combination of OpenMP [4] and MPI [5,19,20] is used for structured grid FDM computations. The unstructured triangular and hexahedral meshes and graph partitioning software METIS [10] are used in FEM computations. The Jacobi iterative method is used to solve resulting matrix system of linear equations. A detailed performance analysis of optimizations on Xeon Phi coprocessor using OpenMP and MPI framework are presented. Our experiments indicate that MPI-OpenMP codes on FDM computations achieve 2X to 3X speed-ups for large mesh sizes. The FEM implementation has shown marginal improvement in speed-up on Xeon Phi Cluster.

Sequential and Parallel Algebraic Riccati Equations Solutions via ESST on the Schur Method

Conference Paper

Full-text available

Dec 1999

A method for solving the Discrete/Continu.ou.s Algebraic Riccati Equ.ation in Sequential and Parallel and Distributed forms, that modifies and proposes a paral-lelization for the Sch.u.r Method of [l] is presented. To transform the symplectic/Hamiltonian matrix in a simple form, Elementary Stabilized Sim,ilarity Transform.a-tions are utilized. A sequential implementation of the proposed algorithm for dense matrices is made and a parallel implementation on a distributed memory system with an asynchronous parallelization strategy over a workstations network is proposed.

Fast Marching Methods - Parallel Implementation and Analysis

Article

Full-text available

Sep 2008

Cristina Marinovici

Synchronisation for dynamic load balancing of decentralised conservative distributed simulation

Article

Full-text available

May 2014

Synchronisation mechanisms are essential in distributed simulation. Some systems rely on central units to control the simulation but central units are known to be bottlenecks. If we want to avoid using a central unit to optimise the simulation speed, we lose the capacity to act on the simulation at a global scale. Being able to act on the entire simulation is an important feature which allows to dynamically load-balance a distributed simulation. While some local partitioning algorithms exist, their lack of global view reduces their efficiency. Running a global partitioning algorithm without central unit requires a synchronisation of all logical processes (LPs) at the same step. The first algorithm requires the knowledge of some topological properties of the network while the second algorithm works without any requirement. The algorithms are detailed and compared against each other. An evaluation shows the benefits of using a global dynamic load-balancing for distributed simulations.

Agent agreement protocols based on golay error-correcting code

Conference Paper

Jul 2013

In this paper, we propose simple protocols for enabling two communicating agents that may have never met before to extract common knowledge out of any initial knowledge that each of them possesses. The initial knowledge from which the agents start, may even be independent of each other, implying that the two agents need not have had previous access to common information sources. In addition, the common knowledge extracted upon the termination of the protocols depends, in a fair way, on the (possibly independent) information items initially known, separately, by the two agents. It is fair in the sense that there is a negotiation between the two agents instead of one agent forcing the other to conform to its own knowledge. These protocols, may be extended in order to support security applications where the establishment of a common knowledge is required. Moreover, the implementation of the protocols leads to reasonably small code that can also fit within resource limited devices involved in any communication network while, at the same time, it is efficient as simulation results demonstrate.

Expressing POP with a Global View Using Chapel:Toward a More Produc- tive Ocean Model

Article

Full-text available

The performance and scalability of SHMEM and MPI2 one-sided routines on a SGI Origin 2000 and a Cray T3E-600

Article

Aug 2004

SUMMARY This paper compares the performance and scalability of SHMEM and MPI-2 one-sided routines on different communication patterns for a SGI Origin 2000 and a Cray T3E-600. The communication tests were chosen to represent commonly used communication patterns with low contention (accessing distant messages, a circular right shift, a binary tree broadcast) to communication patterns with high contention (a 'naive' broadcast and an all-to-all). For all the tests and for small message sizes, the SHMEM implementation significantly outperformed the MPI-2 implementation for both the SGI Origin 2000 and Cray T3E-600. Copyright c � 2004 John Wiley & Sons, Ltd.

Moore meets maxwell

Article

Mar 2012

Moore's Law has driven the semiconductor revolution enabling over four decades of scaling in frequency, size, complexity, and power. However, the limits of physics are preventing further scaling of speed, forcing a paradigm shift towards multicore computing and parallelization. In effect, the system is taking over the role that the single CPU was playing: high-speed signals running through chips but also packages and boards connect ever more complex systems.

HPCS Library Study Effort

Article

Full-text available

Exploring the Performance Potential of Chapel in Scientic Computations

Article

Full-text available

Languages are being designed that simplify the tasks of creating, extending, and maintaining sci- entic application specically for use on parallel computing architectures. Widespread adoption of any language by the high performance computing (HPC) community is strongly dependent upon achieved performance of applications. A common presumption is that performance is adversely aected as the level of abstraction increases. In this paper we report on our investigations into the potential of one such language, Chapel, to deliver performance while adhering to its code development and maintenance goals. In particular, we explore how the unconstrained memory model presented by Chapel may be exploited by the compiler and runtime system in order to eciently execute computations common to numerous scientic application programs. Experiments, executed on a Cray X1E, AMD dual-core, and Intel quad- core processor based systems, reveal that with the appropriate architecture and runtime support, the Chapel model can achieve performance equal to the best Fortran/MPI, Co-Array Fortran, and OpenMP implementations, while substantially easing the burden on the application code developer.

Parallelization of Linear Algebra Algorithms Using ParSol Library of Mathematical Objects

Chapter

Jan 2009

The linear algebra problems are an important part of many algorithms, such as numerical solution of PDE systems. In fact, up to 80% or even more of computing time in this kind of algorithms is spent for linear algebra tasks. The parallelization of such solvers is the key for parallelization of many advanced algorithms. The mathematical objects library ParSol not only implements some important linear algebra objects in C++, but also allows for semiautomatic parallelization of data parallel and linear algebra algorithms, similar to High Performance Fortran (HPF). ParSol library is applied to implement the finite difference scheme used to solve numerically a system of PDEs describing a nonlinear interaction of two counterpropagating laser waves. Results of computational experiments are presented.

Parallel Models for Reactive Scattering Calculations

Conference Paper

Jul 2001

The task of articulating some computer programs aimed at calculating reaction probabilities and reactive cross sections of elementary atom diatom reactions as concurrent computational processes is discussed. Various parallelization issues concerned with memory and CPU requirements of the different parallel models when applied to two classes of approach to the integration of the Schrödinger equation describing atom diatom elementary reactions are addressed. Particular attention is given to the importance of computational granularity for the choice of the model.

Interfacing with the Numerical Homotopy Algorithms in PHCpack

Conference Paper

Aug 2006
Lect Notes Comput Sci

PHCpack implements numerical algorithms for solving polynomial systems using homotopy continuation methods. In this paper we describe two types of interfaces to PHCpack. The first interface PHCmaple originally follows OpenXM, in the sense that the program (in our case Maple) that uses PHCpack needs only the executable version phc built by the package PHCpack. Following the recent development of PHCpack, PHCmaple has been extended with functions that deal with singular polynomial systems, in particular, the deflation procedures that guarantee the ability to refine approximations to an isolated solution even if it is multiple. The second interface to PHCpack was developed in conjunction with MPI (Message Passing Interface), needed to run the path trackers on parallel machines. This interface gives access to the functionality of PHCpack as a conventional software library.

A Buffered-Mode MPI Implementation for the Cell BETM Processor

Conference Paper

Full-text available

The Cell Broadband EngineTM is a heterogeneous multi-core architecture developed by IBM, Sony and Toshiba. It has eight computation intensive cores (SPEs) with a small local memory, and a single PowerPC core. The SPEs have a total peak single precision performance of 204.8 Gflops/s, and 14.64 Gflops/s in double precision. Therefore, the Cell has a good potential for high performance computing. But the unconventional architecture makes it difficult to program. We propose an implementation of the core features of MPI as a solution to this problem. This can enable a large class of existing applications to be ported to the Cell. Our MPI implementation attains bandwidth up to 6.01 GB/s, and latency as small as 0.41 μs. The significance of our work is in demonstrating the effectiveness of intra-Cell MPI, consequently enabling the porting of MPI applications to the Cell with minimal effort.

MPI: the complete reference-2nd Edition

No full-text available

Recommended publications

Introduction to Parallel Computing (2nd Edition)