Article

MPI: the complete reference-2nd Edition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Still, as most BSML primitives are higher-order functions, and we need to use functions such as remove_option, a work-around was needed. Our solution is shown in module Abstract (lines [19][20][21][22][23][24]. Instead of writing a concrete implementation of remove_option, we declare a function remove_option without defining it, and we only give its semantics (with an axiom) when the pre-condition is met. ...
... To be able to specify and write BSML programs, we need BSML primitives in WhyML. BSML primitives are implemented in parallel on top of MPI [23] called throught OCaml's Foreign Function Interface (FFI). Therefore, we cannot provide BSML in WhyML as an implementation. ...
... The axiomatization of BSML primitives can be found in Figure 5. The semantics of functions mkpar, apply, proj and put are expressed in their contract (lines [12][13][14][15][16][17][18][19][20][21][22][23][24] while the strict positivity condition on bsp_p is given as an axiom on line 4. The type of parallel vector is abstract. ...
Preprint
BSML is a pure functional library for the multi-paradigm language OCaml. BSML embodies the principles of the Bulk Synchronous Parallel (BSP) model, a model of scalable parallel computing. We propose a formalization of BSML primitives with WhyML, the specification language of Why3 and specify and prove the correctness of most of the BSML standard library. Finally, we develop and verify the correctness of a small BSML application.
... The existing code, from which the models presented in this section are derived, is written in Fortran 90 and uses the Message Passing Interface MPI [56] standard (Message Passing Interface) [56] for expressing parallelism. The rise of the abstraction level when (retro)modeling was performed could only be achieved with participation from mathematicians who passed on their knowledge during more or less formalized discussions. ...
... The existing code, from which the models presented in this section are derived, is written in Fortran 90 and uses the Message Passing Interface MPI [56] standard (Message Passing Interface) [56] for expressing parallelism. The rise of the abstraction level when (retro)modeling was performed could only be achieved with participation from mathematicians who passed on their knowledge during more or less formalized discussions. ...
... However, none of these languages are capable of exploiting massively parallel architectures directly, and they must rely on external components. The most widely used of these complementary solutions is certainly MPI (Message Passing Interface) [56] which is a standardized message-passing system. It is mainly used to exploit distributed memory architectures. ...
Article
Full-text available
This paper reports on a four-year project that aims to raise the abstraction level through the use of model-driven engineering (MDE) techniques in the development of scientific applications relying on high-performance computing. The development and maintenance of high-performance scientific computing software is reputedly a complex task. This complexity results from the frequent evolutions of supercomputers and the tight coupling between software and hardware aspects. Moreover, current parallel programming approaches result in a mixing of concerns within the source code. Our approach relies on the use of MDE and consists in defining domain-specific modeling languages targeting various domain experts involved in the development of HPC applications, allowing each of them to handle their dedicated model in a both user-friendly and hardware-independent way. The different concerns are separated thanks to the use of several models as well as several modeling viewpoints on these models. Depending on the targeted execution platforms, these abstract models are translated into executable implementations by means of model transformations. To make all of these effective, we have developed a tool chain that is also presented in this paper. The approach is assessed through a multi-dimensional validation that focuses on its applicability, its expressiveness and its efficiency. To capitalize on the gained experience, we analyze some lessons learned during this project.
... This thesis is about using the Message Passing Interface (MPI) [45] in a P2P computing framework called F2F Computing [43]. ...
... Its goal was to make a usable standard for distributed and parallel system, getting ideas from existing systems, but not selecting one of them for the standardization. [45] The process of standardization began in a workshop in Williamsburg, Virginia at the end of April in 1992. After just one year of work in 1993 the first draft of MPI-1 was finished. ...
... This effort was successful, because the first draft had the main features and the first official version of MPI-1 was already presented at June 1994. [45] In 1995 an updated version of MPI was released -MPI-1.1. The changes from Version 1.0 are minor. ...
... In our setting, parallel-in-time integration is considered as a possibility for additional fine grain parallelism on top of an existing coarse grain spatial decomposition. In a preliminary phase, we have decided to simulate the parallelization in time, whereas the parallelization in space is truly implemented on a distributed memory passing system using the Message Passing Interface (MPI) [19]. This allows us to predict at a very moderate cost if the time parallelization can be relevant in our study. ...
... In a first phase, we have decided to simulate the parallelization in time, whereas the parallelization in space is truly implemented on a distributed memory passing system. Explicit Runge-Kutta time integration methods 7 In Hybrid the computation of the right-hand side of the Navier-Stokes equations involves 39 gradient evaluations (3 for (14), 18 for (15), 6 for (16), 9 for (18) and 3 for (19), respectively). Since the RK4 time integration method requires 4 evaluations of right-hand sides, the total amount of gradient evaluations for one time step of the coarse solver is then 156. ...
Article
Full-text available
Direct Numerical Simulation of turbulent flows is a computationally demanding problem that requires efficient parallel algorithms. We investigate the applicability of the time-parallel Parareal algorithm to an instructional case study related to the simulation of the decay of homogeneous isotropic turbulence in three dimensions. We combine a Parareal variant based on explicit time integrators and spatial coarsening with the space-parallel Hybrid Navier–Stokes solver. We analyse the performance of this space–time parallel solver with respect to speedup and quality of the solution. The results are compared with reference data obtained with a classical explicit integration, using an error analysis which relies on the energetic content of the solution. We show that a single Parareal iteration is able to reproduce with high fidelity the main statistical quantities characterizing the turbulent flow field.
... The most popular message passing technology is the Message Passing Interface (mpi) [[ 4 ]] [258], a message passing library for c and fortran. mpi is an industry standard and is implemented on a wide range of parallel computers. ...
... For performances issue, a bsp library can send messages during the computation phase of a super-step, but it is hidden to programmers.7 bsp libraries are generally implemented using mpi[258] or low level routines of the given specifics architectures. ...
Article
This thesis takes part in the formal verification of parallel programs. The aim of formal verification is to ensure that a program will run as it should, without making mistakes, blocking, or terminating abnormally. This is even more important in the parallel computation field, where the cost of calculations can be very high. The BSP model (Bulk Synchronous Parallelism) is a model of parallelism well suited for the use of formal methods. It guarantees a structure in the parallel program, by organising it into super-steps, each of them consisting of a phase of computations, then communications between the processes. In this thesis, we chose to extend an existing tool to adapt it for the proof of BSP programs. We based ourselves on Why, a VCG (verification condition generator) that has the advantage of being able to interface with several automatic provers and proof assistants to discharge the proof obligations. There are multiple contributions in this thesis. In a first part, we present a comparison of the existing BSP libraries, in order to show the most used BSP primitives, which are the most interesting to formalise. We then present BSP-Why, our tool for the proof of BSP programs. This tools uses a generation of a sequential program to simulate the parallel program in input, thus allowing the use of Why and the numerous associated provers to prove the proof obligations. We then show how BSP-Why can be used to prove the correctness of some basic BSP algorithms, and also on a more complex example, the generation of the state-space (model-checking) of systems, especially for security protocols. Finally, in order to ensure the greatest confidence in the BSP-Why tool, we give a formalisation of the language semantics, in the Coq proof assistant. We also prove the correctness of the transformation used to go from a parallel program to a sequential program
... The target platform for our experimental study is a cluster of 16 MPI is a software for message passing, proposed as a standard by a broad committee of vendors, implementers, and users [39]. MPI is portable and flexible software that is used widely. ...
... In our experiments we implemented PD1 and PD2 in C programming language using the MPI library [35,39]. These two implementations were tested using vectors of orders ranging from 512 up to 2048 with block size of 2, 4 and 8 elements. ...
Article
Full-text available
This paper is focused on designing two parallel dot product imple- mentations for heterogeneous master-worker platforms. These implementations are based on the data allocation and dynamic load balancing strategies. The first implementation is the dynamic master - worker with allocation of vectors where the master distributes vectors (data) and computations to the workers whereas the second one is the dynamic master - worker with allocation of vector pointers where the vectors are supposed to be replicated among participating resources beforehand and the master distributes computations to the workers. We also report a set of numerical experiments on a heterogeneous platform where computational resources have different computing powers. Also, the workers are connected to the master by links of same capacities. The obtained experimental results demonstrate that the dynamic allocation of vector point- ers achieve better performance than the original one for computing dot product computation. The paper also presents and verifies an accurate timing model to predict the performance of the proposed implementations on clusters of heterogeneous workstations. Through this model the viability of the proposed implementations can be revealed without the extra effort that would be needed to carry out real testing.
... Thus we will need encapsulated communications between differents architectures and subset synchronization [35]. Version 0.2 of the BSML library is hence based on MPI [37]. It also has a smaller number of primitives which are closer to the BSλ-calculus than the primitives of version 0.1. ...
... This improves on the earlier design DPML/Caml Flight [16,12] in which the global parallel control structure sync had to be prevented dynamically from nesting. This is very different from SPMD programming (Single Program Multiple Data) where the programmer must use a sequential language and a communication library (like MPI [37]). A parallel program is then the multiple copies of a sequential program, which exchange messages using the communication library. ...
... The Message Passing Interface (MPI) library [154][155][156] was used to distribute ROIs across computing nodes at the coarse scale. At the finer scale, ROIs were delegated to Open Multi-Processing (OpenMP) [102] threads. ...
Preprint
Volumetric crystal structure indexing and orientation mapping are key data processing steps for virtually any quantitative study of spatial correlations between the local chemistry and the microstructure of a material. For electron and X-ray diffraction methods it is possible to develop indexing tools which compare measured and analytically computed patterns to decode the structure and relative orientation within local regions of interest. Consequently, a number of numerically efficient and automated software tools exist to solve the above characterisation tasks. For atom probe tomography (APT) experiments, however, the strategy of making comparisons between measured and analytically computed patterns is less robust because many APT datasets may contain substantial noise. Given that general enough predictive models for such noise remain elusive, crystallography tools for APT face several limitations: Their robustness to noise, and therefore, their capability to identify and distinguish different crystal structures and orientation is limited. In addition, the tools are sequential and demand substantial manual interaction. In combination, this makes robust uncertainty quantifying with automated high-throughput studies of the latent crystallographic information a difficult task with APT data. To improve the situation, we review the existent methods and discuss how they link to those in the diffraction communities. With this we modify some of the APT methods to yield more robust descriptors of the atomic arrangement. We report how this enables the development of an open-source software tool for strong-scaling and automated identifying of crystal structure and mapping crystal orientation in nanocrystalline APT datasets with multiple phases.
... Hence, we used the Open Multi-Processing (OpenMP) framework which is a parallel programming framework dedicated to systems with shared memory [6]. According to a comparative study, OpenMP has showed a better result in terms of scalability than Message Massing Interface(MPI) [22] and MapReduce [8]. Moreover, authors of [15] concluded that, this latter is a good option, when the problem requires intensive computation and the amount of data is moderate. ...
... Moreover, reliability mechanisms are greatly hampered by the shared state: for example, a lock becomes permanently unavailable if the thread holding it fails. The High Performance Computing (HPC) community build large-scale (10 6 core) distributed memory systems using the de facto standard MPI communication libraries[9]. Increasingly these are hybrid applications that combine MPI with OpenMP. Unfortunately MPI is not suitable for producing general purpose concurrent software as it is too low level with explicit, synchronous message passing. ...
Article
Full-text available
Large scale servers with hundreds of hosts and tens of thousands of cores are becoming common. To exploit these platforms software must be both scalable and reliable, and distributed actor languages like Erlang are a proven technology in this area. While distributed Erlang conceptually supports the engineering of large scale reliable systems, in practice it has some scalability limits that force developers to depart from the standard language mechanisms at scale. In earlier work we have explored these scalability limitations, and addressed them by providing a Scalable Distributed (SD) Erlang library that partitions the network of Erlang Virtual Machines (VMs) into scalable groups (s groups). This paper presents the first systematic evaluation of SD Erlang s groups and associated tools, and how they can be used. We present a comprehensive evaluation of the scalability and reliability of SD Erlang using three typical benchmarks and a case study. We demonstrate that s groups improve the scalability of reliable and unreliable Erlang applications on up to 256 hosts (6144 cores). We show that SD Erlang preserves the class-leading distributed Erlang reliability model, but scales far better than the standard model. We present a novel, systematic, and tool-supported approach for refactoring distributed Erlang applications into SD Erlang. We outline the new and improved monitoring, debugging and deployment tools for large scale SD Erlang applications. We demonstrate the scaling characteristics of key tools on systems comprising up to 10K Erlang VMs.
... Many applications running on HPC systems use message passing to take advantage of the huge amount of resources available in those centers. To efficiently develop these parallel applications, the preferable framework has been the well-known Message Passing Interface (MPI) [23]. There exist several approaches that model the communication generated by MPI applications, and depending on how these communications are used at simulation time, two types of simulation can be distinguished: on-line simulation [19,21,30] and off-line simulation [5,16,18,25]. ...
Article
Full-text available
Simulation is often used in order to evaluate the behavior and the performance of computing systems. Specifically, in the field of high-performance interconnection networks for HPC clusters the simulation has been extensively considered to verify and validate network operation models and to evaluate their performance. Nevertheless, experiments conducted to evaluate network performance using simulation tools should be fed with realistic network traffic from real benchmarks and/or applications. This approach has grown in popularity because it allows to evaluate the simulation model under realistic traffic situations. In this paper, we propose a family of tools for modeling realistic workloads which capture the behavior of MPI applications into self-related traces called VEF traces. The main novelty of this approach is that it replays the MPI collective operations with their corresponding messages, offering an MPI message-based task simulation framework. The proposed framework neither provides a network simulator nor depends on any specific simulation platform. Besides, this framework allows us to use the generated traces by any third-party network simulator working at message level.
... This level of parallelization is implemented via the Message Passing Interface (MPI) mechanism (Gropp et al., 2000). The initial data is broadcasted to workers. ...
Chapter
This chapter describes an implementation of an online dynamic security assessment system based on time-domain simulation, in which dynamic models are the same as those used offline for planning studies with all necessary details. It contains a review of the adopted methods and algorithms (power flow, continuation power flow, time-domain simulation, energy functions, single machine equivalent, and Prony spectral decomposition) focusing on the main issues related to their numerical and computational performance. The utilization of these methods to execute complex security tasks such as dynamic contingency analysis and security region computations is described. Aspects of high-performance computation (fine- and coarse-grain parallelization) are discussed. Some practical results obtained online and comparisons of online with offline planning cases are shown.
... Une solution possible est de renvoyer une valeur de retour spécifique, et d'avoir une valeur différente pour chaque scénario d'erreur possible. C'est la solution mise en place en 100 Chapitre 6. Gestion des exceptions parallèles C par exemple avec la valeur de retour entière renvoyée par la fonction main, ou la valeur de retour de certaines fonctions MPI [103] ...
Article
Parallel architectures have now reached every computing device, but software developers generally lackthe skills to program them through explicit models such as MPI or the Pthreads. There is a need for moreabstract models such as the algorithmic skeletons which are a structured approach. They can be viewed ashigher order functions that represent the behaviour of common parallel algorithms, and those are combinedby the programmer to generate parallel programs. Programmers want to obtain better performances through the usage of parallelism, but the development time implied is also an important factor. Algorithmic skeletons provide interesting results in both those fields. The Orléans Skeleton Library or OSL provides a set of algorithmic skeletons for data parallelism within the bulk synchronous parallel model for the C++ language. It uses advanced metaprogramming techniques to obtain good performances. We improved OSL in order to obtain better performances from its generated programs, and extended its expressivity. We wanted to analyze the ratio between the performance of programs and the development effort needed within OSL and other parallel programming models. The comparison between parallel programs written within OSL and their equivalents in low level parallel models shows a better productivity for high level models : they are easy to use for the programmers while providing decent performances.
... days or weeks). The jobs are tightly coupled, and use the message-passing interface (MPI) programming model [3] for communication and synchronization among all the processors. Although it is necessary to support HPC applications that demand the computing capacity of an exascale machine, it is also important to enable ensemble runs of applications that have uncertainty in high-dimension parameter space. ...
Conference Paper
Full-text available
One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs – we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.
... State-of-the-art parallel FFTs such as the FFTW-MPI [33] are designed according to the computational scheme shown in Fig. 1(a). The FFTW-MPI is the parallel version of the FFTW designed for running on parallel distributed system using the Message Passing Interface (MPI) library [36]. Nevertheless, the astonishing performance of the FFTW for single processor machines is unfortunately not found in the one-dimensional FFTW-MPI designed for distributed memory systems. ...
Article
An efficient parallel implementation of a nonparaxial beam propagation method for the numerical study of the nonlinear Helmholtz equation is presented. Our solution focuses on minimizing communication and computational demands of the method which are dependent on a nonparaxiality parameter. Performance tests carried out on different types of parallel systems behave according theoretical predictions and show that our proposal exhibits a better behavior than those solutions based on the use of conventional parallel fast Fourier transform implementations. The application of our design is illustrated in a particularly demanding scenario: the study of dark solitons at interfaces separating two defocusing Kerr media, where it is shown to play a key role.
... An interpolation algorithm has been used to reduce time and space cost for massive data. All of these techniques are parallelized to process large data on multiple compute nodes, using MapReduce [16], iterative MapReduce [17] and/or MPI [18] frameworks. We improved the parallel efficiency of DACIDR by developing a hybrid workflow model on high performance computers (HPC) [19]. ...
Conference Paper
Full-text available
The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters). In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods.
... The developed applications should work efficiently on multicore and multiprocessor systems. An implementation of MPI (message passing interface) is the traditional means of creating such applications [13]. MPI provides a highly flexible ability to develop applications that address specific requirements of the algorithm structure and the used computing infrastructure. ...
Article
Full-text available
Design and analysis of complex nanophotonic and nanoelectronic structures require significant computing resources. Cloud computing infrastructure allows distributed parallel applications to achieve greater scalability and fault tolerance. The problems of effective use of high-performance computing systems for modeling and simulation of subwavelength diffraction gratings are considered. Rigorous coupled-wave analysis (RCWA) is adapted to cloud computing environment. In order to accomplish this, data flow of the RCWA is analyzed and CPU-intensive operations are converted to data-intensive operations. The generated data sets are structured in accordance with the requirements of MapReduce technology.
... Each such type of machine instance is typically suited to a particular type of parallelization paradigm. Virtualized setups mimicking desktop computing configurations are available with a range of core count and memory and are suitably used for multicore or shared memory parallelization using OpenMP [3]. Cluster compute instances, which are specially-hosted on machines in close proximity and interconnected with high bandwidth cables to reduce latency, are suited to parallelization using the Message Passing Interface (MPI) protocol [4]. ...
Article
Full-text available
Cloud computing is a potential paradigm-shifter for system-level electronic design automation tools for chip-package-board design. However, exploiting the true power of on-demand scalable computing is as yet an unmet challenge. We examine electromagnetic (EM) field simulation on cloud platforms.
... Because of the popularity of linking independent PCs in Beowulf-type clusters, message passing has become the dominant style of parallel computing. Gropp and Lusk [2], Gropp et al. [3], and Snit et al. [4] are excellent references for parallel computation using message passing. Dongarra et al. [5] and Foster [6] are useful references for general parallel programming. ...
... For multithreading programming[l], low-level libraries such as pthreads [2], Windows Threads [3] and high-level libraries like OpenMP [4] are the most popular available for instrument parallel software. Additionally, communication mechanisms and APIs such as MPI [5], PVM [6], and CRL [7] are widely used in distributed application development, offering a variety of primitives for point to point and collective operations, as well as, process control, startup and shutdown. ...
Article
Full-text available
Current high-performance applications development is increasing due to breakthrough advances in microprocessor and power management technologies, network speed and reliability. By this way, distributed parallel applications make use of message-passing interface and multithreaded programming libraries. Nevertheless, drawbacks in message-passing implementations limit the use of thread-safe network communication. This paper presents a thread-safe message-passing interface based on MPI Standard assuring correct message ordering and sender/receiver synchronization.
... MPICH-G2) with MPICH-2.1.0 and MPICH-1.2.7 that are the cluster implementation of MPI [8]. To compare the performance of different MPI implementations, jobs are submitted to HPC cluster. ...
Article
Full-text available
This paper presents the application of a parallel high accuracy simulation code for Incompressible Navier-Stokes solution with free surface flow around an ACV (Air Cushion Vehicle) on ShirazUCFD Grid environment. The parallel finite volume code is developed for incompressible Navier-Stokes solver on general curvilinear coordinates system for modeling free surface flows. A single set of dimensionless equations is derived to handle both liquid and air phases in viscous incompressible free surface flow in general curvilinear coordinates. The volume of fluid (VOF) method with lagrangian propagation in computational domain for modeling the free surface flow is implemented. The parallelization approach uses a domain decomposition method for the subdivision of the numerical grid, the SPMD program model and MPICH-G2 as the message passing environment is used to obtain a portable application.
... The structure of MPI is very well-known and hence we simply refer the reader to one of the books on MPI [3] [29]. For MPI-2 extensions, not considered herein, refer to [30]. ...
Article
A Linux cluster with Gigabit Ethernet interconnect is a local and accessible re- source for solving scientific problems, including finite-element simulations. As general- purpose protocol stacks are not designed for parallel computing, the delivered through- put and latency may be significantly below that suggested by the hardware specifi- cation of the interconnect. The paper evaluates communication software across the protocol stack from high-level communication harnesses, such as MPI, Charm++ and MOSIX, through intermediate communication primitives, such as sockets, streams, and pipes, to underlying protocols such as TCP and TIPC, as well as low-level User- level Network Interfaces such as GAMMA. The evaluation is not only in terms of short message latency and throughput but in terms of the efficiency of the solution, that is the throughput per computation overhead. There are a number of findings, such as the rel- ative efficiency of TIPC with Open MPI; confirmation of the advantage of GAMMA with MPI; the weaker performance of the single system image software under test; and the enhancement from adding application-level load-balancing to the system-level load-balancing available in some communication harnesses.
... 12. C is the implementation language used by the High Performance Linpack code. A reference implementation of a scalable version of the Linpack benchmark [20, 22] requires an implementation MPI library [19, 21] for communicating between processors of a parallel computer. The linker and the runtime of the C programming language is usually used by all other languages, and C is often used as the compilation target for many of the languages used in this study. ...
Article
Full-text available
The research team explores a rich feature set, large algorithmic variety, and detailed implementation considerations for one of the most fundamental computational kernels of computational science: LU factorization of a dense matrix by Gaussian elimination with partial pivoting. For the target implementation platforms and systems, they analyze and compare established shared and distributed memory environments as well as relatively new Partitioned Global Address Space programming languages, which include those coming from the High Productivity Computing Systems (HPCS) project. To give quantitative measures of each hardware platform metrics, combined with implementation characteristics, they compare scalability, raw and relative performance as well as the source code features, functionality, and absolute size breakdown as measured by Source Lines of Code (SLOC).
... Zhuang and Zhu (1994) demonstrated parallelization in a reservoir simulator using MPI. Beyond the basic send and receive capabilities, some MPI features, such as communicators, topologies, communication modes and single-call collective operations ( Snir et al. 1998, Gropp et al. 1999) were entertained to parallelize the serial-MARS algorithm. ...
Article
Full-text available
In this paper, a parallelized version of multivariate adaptive regression splines (MARS, Friedman 1991) is developed and utilized within a decision-making framework (DMF) based on an OA/MARS continuous-state stochastic dynamic programming (SDP) method (Chen et al. 1999). The DMF is used to evaluate current and emerging technologies for the multi-level liquid line of a wastewater treatment system, involving up to eleven levels of treatment and monitoring ten pollutants moving through the levels of treatment. At each level, one technology unit is selected out of set of options which includes the empty unit. The parallel-MARS algorithm enables the computational efficiency to solve this ten-dimensional SDP problem using a new solution method which employs orthogonal array-based Latin hypercube designs and a much higher number of eligible knots.
... For good reasons MPI (the Message Passing Interface) [4,9] comes without a performance model and, apart from some "advice to implementers," without any requirements or recommendations as to what a good implementation should satisfy regarding performance. The main reasons are, of course, that the implementability of the MPI standard should not be restricted to systems with specific interconnect capabilities and that implementers should be given maximum freedom in how to realize the various MPI constructs. ...
Conference Paper
Full-text available
The MPI Standard does not make any performance guarantees, but users expect (and like) MPI implementations to deliver good performance. A common-sense expectation of performance is that an MPI function should perform no worse than a combination of other MPI functions that can implement the same functionality. In this paper, we formulate some performance requirements and conditions that good MPI implementations can be expected to fulfill by relating aspects of the MPI standard to each other. Such a performance formulation could be used by benchmarks and tools, such as SKaMPI and Perfbase, to automatically verify whether a given MPI implementation fulfills basic performance requirements. We present examples where some of these requirements are not satisfied, demonstrating that there remains room for improvement in MPI implementations.
Article
Full-text available
Gene expression programming (GEP) is a data driven evolutionary technique that is well suited to correlation mining of system components. With the rapid development of industry 4.0, the number of components in a complex industrial system has increased significantly with a high complexity of correlations. As a result, a major challenge in employing GEP to solve system engineering problems lies in computation efficiency of the evolution process. To address this challenge, this paper presents EGEP, an event tracker enhanced GEP, which filters irrelevant system components to ensure the evolution process to converge quickly. Furthermore, we introduce three theorems to mathematically validate the effectiveness of EGEP based on a GEP schema theory. Experiment results also confirm that EGEP outperforms the GEP with a shorter computation time in an evolution.
Thesis
Full-text available
http://www.mathematik.uni-leipzig.de/Media/DissAbstracts/abstract.khaleghabadi.pdf https://ul.qucosa.de/api/qucosa%3A11798/attachment/ATT-0/
Experiment Findings
Full-text available
The load balancing algorithm distributing homogeneous load among the cluster hence increases the speed of high performance clustered system due to its parallel computation capabilities because of its compute nodes. The most attractive point of load balancing algorithm is to distribute load of heavily loaded compute node among lightly loaded compute nodes during the execution, which is called as process migration. This process migration time can be saved using new method in this algorithm. Hence some policies are needed to consider at the time of load transfer decision as well as at the time of process migration. The construction of dynamic load balancing algorithm requires MPI instructions to achieve needs parallel programming. The parallel programming on the cluster can be execute using massage passing interface (MPI) or application programming interface (API). This paper uses only MPI library to build new load balancing algorithm. Due to very highly variable workload of a cluster system, the difficulty of load balancing is also increasing across its compute nodes. This paper proposes new approach of existing dynamic load balancing algorithm, which is implemented on Rock cluster and maximum time it gives the better performance. This algorithm uses demand driven load collection methods so its speed is increases. This paper focused and on comparison between previous dynamic load balancing algorithm and also gives performance of new dynamic load balancing algorithm.
Article
We describe a general framework for adding the values of two approximate counters to produce a new approximate counter value whose expected estimated value is equal to the sum of the expected estimated values of the given approximate counters. (To the best of our knowledge, this is the first published description of any algorithm for adding two approximate counters.) We then work out implementation details for five different kinds of approximate counter and provide optimized pseudocode. For three of them, we present proofs that the variance of a counter value produced by adding two counter values in this way is bounded, and in fact is no worse, or not much worse, than the variance of the value of a single counter to which the same total number of increment operations have been applied. Addition of approximate counters is useful in massively parallel divide-and-conquer algorithms that use a distributed representation for large arrays of counters. We describe two machine-learning algorithms for topic modeling that use millions of integer counters and confirm that replacing the integer counters with approximate counters is effective, speeding up a GPU-based implementation by over 65% and a CPU-based implementation by nearly 50%, as well as reducing memory requirements, without degrading their statistical effectiveness.
Article
Full-text available
Fast execution of the applications achieved through parallel execution of the processes. This is very easily achieved by high performance cluster (HPC) through concurrent processing with the help of its compute nodes. The HPC cluster provides super computing power using execution of dynamic load balancing algorithm on compute nodes of the clusters. The main objective of dynamic load balancing algorithm is to distribute even workload among the compute nodes for increasing overall efficiency of the clustered system. The logic of dynamic load balancing algorithm needs parallel programming. The parallel programming on the HPC cluster can achieve through massage passing interface in C programming. The MPI library plays very important role to build new load balancing algorithm. The workload on a HPC cluster system can be highly variable, increasing the difficulty of balancing the load across its compute nodes. This paper proposes new idea of existing dynamic load balancing algorithm, by mixing centralized and decentralized approach which is implemented on Rock cluster and maximum time it gives the better performance. This paper also gives comparison between previous dynamic load balancing algorithm and new dynamic load balancing algorithm.
Article
Fast execution of the applications achieved through parallel execution of the processes. This is very easily achieved by high performance cluster (HPC) through concurrent processing with the help of its compute nodes. The HPC cluster provides super computing power using execution of dynamic load balancing algorithm on compute nodes of the clusters. The main objective of dynamic load balancing algorithm is to distribute even workload among the compute nodes for increasing overall efficiency of the clustered system. The logic of dynamic load balancing algorithm needs parallel programming. The parallel programming on the HPC cluster can achieve through massage passing interface in C programming. The MPI library plays very important role to build new load balancing algorithm. The workload on a HPC cluster system can be highly variable, increasing the difficulty of balancing the load across its compute nodes. This paper proposes new idea of existing dynamic load balancing algorithm, by mixing centralized and decentralized approach which is implemented on Rock cluster and maximum time it gives the better performance. This paper also gives comparison between previous dynamic load balancing algorithm and new dynamic load balancing algorithm.
Conference Paper
HPC systems are widely used for accelerating calculation-intensive irregular applications, e.g., molecular dynamics (MD) simulations, astrophysics applications, and irregular grid applications. As the scalability and complexity of current HPC systems keeps growing, it is difficult to parallelize these applications in an efficient fashion due to irregular communication patterns, load imbalance issues, dynamic characteristics, and many more. This paper presents a fine granular programming scheme, on which programmers are able to implement parallel scientific applications in a fine granular and SPMD (single program multiple data) fashion. Different from current programming models starting from the global data structure, this programming scheme provides a high-level and object-oriented programming interface that supports writing applications by focusing on the finest granular elements and their interactions. Its implementation framework takes care of the implementation details e.g., the data partition, automatic EP aggregation, memory management, and data communication. The experimental results on SuperMUC show that the OOP implementations of multi-body and irregular applications have little overhead compared to the manual implementations using C++ with OpenMP or MPI. However, it improves the programming productivity in terms of the source code size, the coding method, and the implementation difficulty.
Chapter
In 1991, CERFACS was commissioned by the French climate modelling community to perform the technical assembling of an ocean General Circulation Model (GCM), Océan Parallélisé (OPA) developed by the Laboratoire d’Océanographie Dynamique et de Climatologie (LODYC), and two different atmospheric GCMs, Action de Recherche Petite Echelle Grande Echelle (ARPEGE) and the Laboratoire de Météorologie Dynamique zoom (LMDz) model developed respectively by Météo-France and the Laboratoire de Météorologie Dynamique (LMD).
Chapter
Bei den Modellen für verteilten Speicher unterscheiden wir wieder, wie bei den Modellen für gemeinsamen Speicher, in nebenläufige und kooperative Modelle. Die kooperativen Modelle untergliedern sich in nachrichten-basierte Modelle, wobei ein Prozess einem anderen Prozess eine Nachricht zusenden (send) kann, die dieser dann empfängt (receive). Der Aufruf eines Dienstes wird dabei in eine Nachricht verpackt. Entfernte Aufrufe, wobei der Dienst direkt aufgerufen wird. Der Dienst kann sein eine Prozedur, eine Methode eines Objektes, eine durch sein Interface spezifizierte Methode und damit eine Methode einer Komponenten oder ein Dienst (Service).
Article
this paper we present a generic framework for parallel branch and bound algorithms which has these feature. We discuss and motivate the choice we have made, and present test results from an object oriented implementation. 1 Introduction
Article
This paper discusses an efficient implementation of finite difference method (FDM) and finite element method (FEM) computations for Partial Differential Equation (Poisson Equation) on a message passing cluster with Intel Xeon Phi coprocessors[6,15]. We have performed computations on PARAM YUVA-II [9] which is a message passing cluster with compute nodes as Xeon multi-core processors and Xeon Phi coprocessors [6,15,17-19]. A combination of OpenMP [4] and MPI [5,19,20] is used for structured grid FDM computations. The unstructured triangular and hexahedral meshes and graph partitioning software METIS [10] are used in FEM computations. The Jacobi iterative method is used to solve resulting matrix system of linear equations. A detailed performance analysis of optimizations on Xeon Phi coprocessor using OpenMP and MPI framework are presented. Our experiments indicate that MPI-OpenMP codes on FDM computations achieve 2X to 3X speed-ups for large mesh sizes. The FEM implementation has shown marginal improvement in speed-up on Xeon Phi Cluster.
Conference Paper
Full-text available
A method for solving the Discrete/Continu.ou.s Algebraic Riccati Equ.ation in Sequential and Parallel and Distributed forms, that modifies and proposes a paral-lelization for the Sch.u.r Method of [l] is presented. To transform the symplectic/Hamiltonian matrix in a simple form, Elementary Stabilized Sim,ilarity Transform.a-tions are utilized. A sequential implementation of the proposed algorithm for dense matrices is made and a parallel implementation on a distributed memory system with an asynchronous parallelization strategy over a workstations network is proposed.
Article
Full-text available
Synchronisation mechanisms are essential in distributed simulation. Some systems rely on central units to control the simulation but central units are known to be bottlenecks. If we want to avoid using a central unit to optimise the simulation speed, we lose the capacity to act on the simulation at a global scale. Being able to act on the entire simulation is an important feature which allows to dynamically load-balance a distributed simulation. While some local partitioning algorithms exist, their lack of global view reduces their efficiency. Running a global partitioning algorithm without central unit requires a synchronisation of all logical processes (LPs) at the same step. The first algorithm requires the knowledge of some topological properties of the network while the second algorithm works without any requirement. The algorithms are detailed and compared against each other. An evaluation shows the benefits of using a global dynamic load-balancing for distributed simulations.
Conference Paper
In this paper, we propose simple protocols for enabling two communicating agents that may have never met before to extract common knowledge out of any initial knowledge that each of them possesses. The initial knowledge from which the agents start, may even be independent of each other, implying that the two agents need not have had previous access to common information sources. In addition, the common knowledge extracted upon the termination of the protocols depends, in a fair way, on the (possibly independent) information items initially known, separately, by the two agents. It is fair in the sense that there is a negotiation between the two agents instead of one agent forcing the other to conform to its own knowledge. These protocols, may be extended in order to support security applications where the establishment of a common knowledge is required. Moreover, the implementation of the protocols leads to reasonably small code that can also fit within resource limited devices involved in any communication network while, at the same time, it is efficient as simulation results demonstrate.
Article
SUMMARY This paper compares the performance and scalability of SHMEM and MPI-2 one-sided routines on different communication patterns for a SGI Origin 2000 and a Cray T3E-600. The communication tests were chosen to represent commonly used communication patterns with low contention (accessing distant messages, a circular right shift, a binary tree broadcast) to communication patterns with high contention (a 'naive' broadcast and an all-to-all). For all the tests and for small message sizes, the SHMEM implementation significantly outperformed the MPI-2 implementation for both the SGI Origin 2000 and Cray T3E-600. Copyright c � 2004 John Wiley & Sons, Ltd.
Article
Moore's Law has driven the semiconductor revolution enabling over four decades of scaling in frequency, size, complexity, and power. However, the limits of physics are preventing further scaling of speed, forcing a paradigm shift towards multicore computing and parallelization. In effect, the system is taking over the role that the single CPU was playing: high-speed signals running through chips but also packages and boards connect ever more complex systems.
Article
Full-text available
Languages are being designed that simplify the tasks of creating, extending, and maintaining sci- entic application specically for use on parallel computing architectures. Widespread adoption of any language by the high performance computing (HPC) community is strongly dependent upon achieved performance of applications. A common presumption is that performance is adversely aected as the level of abstraction increases. In this paper we report on our investigations into the potential of one such language, Chapel, to deliver performance while adhering to its code development and maintenance goals. In particular, we explore how the unconstrained memory model presented by Chapel may be exploited by the compiler and runtime system in order to eciently execute computations common to numerous scientic application programs. Experiments, executed on a Cray X1E, AMD dual-core, and Intel quad- core processor based systems, reveal that with the appropriate architecture and runtime support, the Chapel model can achieve performance equal to the best Fortran/MPI, Co-Array Fortran, and OpenMP implementations, while substantially easing the burden on the application code developer.
Chapter
The linear algebra problems are an important part of many algorithms, such as numerical solution of PDE systems. In fact, up to 80% or even more of computing time in this kind of algorithms is spent for linear algebra tasks. The parallelization of such solvers is the key for parallelization of many advanced algorithms. The mathematical objects library ParSol not only implements some important linear algebra objects in C++, but also allows for semiautomatic parallelization of data parallel and linear algebra algorithms, similar to High Performance Fortran (HPF). ParSol library is applied to implement the finite difference scheme used to solve numerically a system of PDEs describing a nonlinear interaction of two counterpropagating laser waves. Results of computational experiments are presented.
Conference Paper
The task of articulating some computer programs aimed at calculating reaction probabilities and reactive cross sections of elementary atom diatom reactions as concurrent computational processes is discussed. Various parallelization issues concerned with memory and CPU requirements of the different parallel models when applied to two classes of approach to the integration of the Schrödinger equation describing atom diatom elementary reactions are addressed. Particular attention is given to the importance of computational granularity for the choice of the model.
Conference Paper
PHCpack implements numerical algorithms for solving polynomial systems using homotopy continuation methods. In this paper we describe two types of interfaces to PHCpack. The first interface PHCmaple originally follows OpenXM, in the sense that the program (in our case Maple) that uses PHCpack needs only the executable version phc built by the package PHCpack. Following the recent development of PHCpack, PHCmaple has been extended with functions that deal with singular polynomial systems, in particular, the deflation procedures that guarantee the ability to refine approximations to an isolated solution even if it is multiple. The second interface to PHCpack was developed in conjunction with MPI (Message Passing Interface), needed to run the path trackers on parallel machines. This interface gives access to the functionality of PHCpack as a conventional software library.
Conference Paper
Full-text available
The Cell Broadband EngineTM is a heterogeneous multi-core architecture developed by IBM, Sony and Toshiba. It has eight computation intensive cores (SPEs) with a small local memory, and a single PowerPC core. The SPEs have a total peak single precision performance of 204.8 Gflops/s, and 14.64 Gflops/s in double precision. Therefore, the Cell has a good potential for high performance computing. But the unconventional architecture makes it difficult to program. We propose an implementation of the core features of MPI as a solution to this problem. This can enable a large class of existing applications to be ported to the Cell. Our MPI implementation attains bandwidth up to 6.01 GB/s, and latency as small as 0.41 μs. The significance of our work is in demonstrating the effectiveness of intra-Cell MPI, consequently enabling the porting of MPI applications to the Cell with minimal effort.
ResearchGate has not been able to resolve any references for this publication.