Article

How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Extraordinary technological improvements over the past few years in areas such as microprocessors, memory, buses, networks, and software have made it possible to assemble groups of inexpensive personal computers and/or workstations into a cost effective system that functions in concert and posses tremendous processing power. Cluster computing is not new, but in company with other technical capabilities, particularly in the area of networking, this class of machines is becoming a high-performance platform for parallel and distributed applications [1,2,6,7,8,9,10,11,12]. ...
... The concept of Beowulf clusters is originated at the Center of Excellence in Space Data and Information Sciences (CESDIS), located at the NASA Goddard Space Flight Center in Maryland [6]. The goal of building a Beowulf cluster is to create a cost-effective parallel computing system from commodity components to satisfy specific computational requirements for the earth and space sciences community. ...
... Nodes are connected using Fast Ethernet with a maximum bandwidth of 300Mbits, through three 24-port switches with channel bonded technique. Channel bonding is a method where the data in each message gets striped across the multiple network cards installed in each machine [1,2,6]. The THPTB is operated as a unit system to share networking, file servers, and other peripherals. ...
Article
To use supercomputer for high-performance computing has been growing. Supercomputers that are single big expensive machines with a shared memory and one or more processors meet the professional need. A large-scale processing and storage system that provides high bandwidth at low cost is then their expectation. A cluster is a collection of independent and cheap machines, used together as a supercomputer to provide a solution. In this paper, a SMP-based PC cluster (36 processors), called THPTB (TungHai Parallel TestBed) with channel bonded technique, was proposed and built in CSIE. The system architecture and benchmark performances of the cluster are also presented in this paper. To take advantage of the parallelism of the SMPs cluster systems by using message-passing libraries, the HPL benchmark is used to demonstrate the performance. The experimental results show that our cluster can obtain 17.38 GFlops/s with channel bonding, when the total number of processor used is 36.
... To use some conventional methods using Hadoop ecosystem, usually, the framework of Hadoop MapReduce run over Hadoop Distributed File Server (HDFS), which has the advantage of multiple local disks on a computer node providing better data-locality [26]. Though, majority of the HPC clusters [27,28] used to follow traditional Beowulf architecture [29,30]. In such systems, the computer nodes are provided with a very light weight operating system, or sometimes with a limited capacity of local storage [31]. ...
... These inconsistencies lead toward attenuation of MapReduce running on HPC clusters. Moreover, recent studies also verify that MapReduce does not provide significant results when it combines with HPC cluster [10,30,31]. These limitations lead us toward a question whether storage system with Lustre gives us local storage capabilities that facilitate MapReduce, which gives us efficient results on HPC clusters? ...
Article
The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze and select features from such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where a selection of appropriate features and High-Performance Computing (HPC) solution has become a key issue and has attracted attention in recent years. Therefore, keeping in view the needs above, there is a requirement for a system that can efficiently select features and analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that selects features by using Artificial Bee Colony (ABC). Moreover, a Kalman filter is used in Hadoop ecosystem that is used for removal of noise. Furthermore, traditional MapReduce with ABC is used that enhance the processing efficiency. Moreover, a complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed Hadoop-based ABC algorithm. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce with the ABC algorithm. ABC algorithm is used to select features, whereas, MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes with near real-time. Moreover, the proposed system is compared with Swarm approaches and is evaluated regarding efficiency, accuracy and throughput by using ten different data sets. The results show that the proposed system is more scalable and efficient in selecting features.
... There are several implementations of computer clusters for different operation systems and environments. For GNU/Linux there is various of cluster software which support GNU/Linux for application clustering like Beowulf [1] [2], distcc [3] [4] and MPICH Linux Virtual Server [5]. MOSIX [6] [7], openMosix [8] [9], Kerrighed [10] [11], Open SSI [12] are clusters implemented into the kernel that provide automatic process migration between the homogeneous nodes. ...
... Matrix multiplication is one of the most commonly used algorithms in computing, thus we decided to utilize this algorithm in order to micro-benchmark our HPC cluster. The product matrix multiplication algorithm takes two matrices A mxl and B pxn as input data, and produces an output matrix C = A * B as it is defined in (1). It is very important to state that both input matrices can be multiplied only when the number of columns in the first matrix equals the number of rows in the second matrix p. ...
Conference Paper
Full-text available
In almost any education institution there are computer labs that are available to students during work hours. However, during off hours, these labs are underutilized machines which contain significant computing performance. In this paper we are present the cluster infrastructure that we implemented for utilizing one of the computer labs at FON University. Additionally we implemented a simple micro-benchmark in order to present the computing capabilities of our cluster.
... PC configuration is the process where components of a PC such as a processor, motherboard, GPU, RAM, ROM, power supply unit, and Case, are assembled into a working system. This process is an assembly area of the system build [1]. A PC can be assembled using different configuration methods, i.e., pre-build, self-assembled, and custom-build. ...
Article
A custom-built PC is assembled to cater a specific user needs. Internationally, there are many e-stores available for users which offer them to build and order their systems as per their needs. In, 2020, Pakistan observed a covid-19 lockdown and mostly international shipments are cancelled due to the pandemic. This real world situation highlighted the deficiency of local market to fulfill the demands of custom-built PC users. In this paper, we propose and design a generic architecture of an e-store (Nice PC Maker (NPM)) to facilitate users with custom-build PC option along with other added features such as pre-build system purchasing and PC components purchasing. WooCommerce product builders (wordpress) are used to implement the architectural design. We also configured this generic architecture for a local client which helpes them to increase their business and facilitates users to get their desired systems.
... One of the parallel hardware topics covered in the course was the Beowulf cluster, created by Thomas Sterling and Donald Becker at NASA [13]. Their Beowulf cluster was a dedicated multiprocessor workstation, built from commodity off-the-shelf PC hardware, connected with a standard network (e.g., Ethernet), and running open source software (e.g., Linux, MPI, etc.). ...
Article
Much has changed about parallel and distributed computing (PDC) since the author began teaching the topic in the late 1990s. This paper reviews some of the key changes to the field and describes their impacts on his work as a PDC educator. Such changes include: the availability of free implementations of the message passing interface (MPI) for distributed-memory multiprocessors; the development of the Beowulf cluster; the advent of multicore architectures; the development of free multithreading languages and libraries such as OpenMP; the availability of (relatively) inexpensive manycore accelerator devices (e.g., GPUs); the availability of free software platforms like CUDA, OpenACC, OpenCL, and OpenMP for using accelerators; the development of inexpensive single board computers (SBCs) like the Raspberry Pi, and other changes. The paper details the evolution of PDC education at the author's institution in response to these changes, including curriculum changes, seven different Beowulf cluster designs, and the development of pedagogical tools and techniques specifically for PDC education. The paper also surveys many of the hardware and software infrastructure options available to PDC educators, provides a strategy for choosing among them, and provides practical advice for PDC pedagogy. Through these discussions, the reader may see how much PDC education has changed over the past two decades, identify some areas of PDC that have remained stable during this same time period, and so gain new insight into how to efficiently invest one's time as a PDC educator.
... One of the parallel hardware topics covered in the course was the Beowulf cluster, created by Thomas Sterling and Donald Becker at NASA [10]. A Beowulf cluster is a dedicated multiprocessor built from commodity off-the-shelf PCs, connected with standard network fabric (e.g., Ethernet), and running open source software (e.g., Linux, MPI, etc.). ...
... The recent drastic reduction in computer hardware costs coupled with explosive increases in computing capabilities motivate the development of a Cluster computing architecture that is customized to the problem being addressed and can approach supercomputing capabilities. We propose the use of a Beowulf Cluster (Sterling, et al., 1999) that has several advantages in the operational context of on-line systems. ...
Article
Recent advances in information technology provide opportunities for secure, efficient and economical data transfer vis-à-vis real time operations for large-scale traffic systems equipped with sensor systems. This paper describes an Internet based control architecture that provides an automated mechanism for real-time route guidance using a high performance computing environment located at a remote site. A distributed computing paradigm called the Beowulf Cluster is proposed to achieve a high performance computing environment for the on-line architecture.
... Although the idea of using clusters to improve system reliability dates back to the 60s [23], its usage to improve system performance is more recent. The Beowulf project [3] built a $40,000 cluster with 16 personal computers that was able to reach 1 GFLOPs. Since then, computer clusters have become a low cost alternative to build high-performance systems. ...
Article
Computer clusters are today a cost-effective way of providing either high-performance and/or high-availability. The flexibility of their configuration aims to fit the needs of multiple environments, from small servers to SME and large Internet servers. For these reasons, their usage has expanded not only in academia but also in many companies. However, each environment needs a different “cluster flavour”. High-performance and high-throughput computing are required in universities and research centres while high-performance service and high-availability are usually reserved to use in companies. Despite this fact, most university cluster computing courses continue to cover only high-performance computing, usually ignoring other possibilities. In this paper, a master-level course which attempts to fill this gap is discussed.
... In this work, we develop a hybrid programming model to efficiently run the FFEM in a cluster of PC's with Multi-Core technology. Clusters of PC's are basically distributed memory type machines which have gained a lot of popularity in the scientific community due to the relatively high computational capacity at low cost [15]. Taking advantage of the Multi-Core technology, a new main stream in the processor industry [6], a cluster with a certain computational capacity can nowadays be built with a fewer processors. ...
Article
Full-text available
It is well known that probabilistic and non probabilistic methods for uncertainty analysis require great computational capacity to perform hundreds or even thousands of simulations. In this work we develop a Hybrid Computational model, a combination of Shared Memory and Distributed Memory models, to take advantage of the newest Multi-Cores processors technology that is becoming more frequently used in clusters of PC's. To evaluate the model, numerical experiments are conducted using Intel Core2Duo processors.
... Plus récemment cependant, des informaticiens ont commencé à tirer parti de l'augmentation phénoménale de la puissance de calcul des ordinateurs personnels (PCs) pour essayer de les transformer en machines de calcul parallèle en réseau. De telles machines sont connues sous le nom de 'clusters de PCs' ou 'Beowulfs', et utilisent souvent des composants informatiques standard afin de réduire le coût de construction (Taubes 1996;Sterling et al. 1999). L'avantage de ces machines est de rendre possible le calcul 3-D haute résolution à faible coût, mais le désavantage est qu'il faut modifier explicitement les algorithmes de calcul existants, car chaque PC a son propre processeur et sa propre mémoire. ...
Article
We validate the spectral-element method for seismic wave propagation in spherically-symmetric global Earth models. We also include the full complexity of 3-D Earth models, i.e., lateral variations in compressional-wave velocity, shear-wave velocity and density, a 3-D crustal model, ellipticity, as well as topography and bathymetry. We also include the effects of the oceans, rotation, and self-gravitation. For the oceans we introduce a formulation based upon an equivalent load, in which the oceans do not need to be meshed explicitly. Some of these effects, which are often considered negligible in global seismology, can in fact play a significant role for certain source-receiver configurations. Anisotropy and attenuation are also incorporated in this study. The complex phenomena that are taken into account are introduced in such a way that we preserve the main advantages of the spectral-element method, which are an exactly diagonal mass matrix and very high computational efficiency on parallel computers. For self-gravitation and the oceans we benchmark the spectral-element synthetic seismograms against normal-mode synthetic seismograms for spherically-symmetric reference model PREM. The two methods are in excellent agreement for all body- and surface-wave arrivals with periods greater than about 20 s in the case of self-gravitation and 25 s in the case of the oceans. We subsequently present results of simulations for a real earthquake in a fully 3-D Earth model for which the fit to the data is significantly improved compared to classical normal-mode calculations based upon PREM.
... Le projet Beowulf [168,167] a consisté en la réalisation d'une grappe de PCs sous LINUX inter-connectés par un réseau Ethernet et utilisant les protocoles de communication standards UNIX. Pour accroître les performances, plusieurs interfaces réseau sont utilisées simultanément. ...
Article
Full-text available
Recent works in computer vision produced complex algorithms and applications that require more and more computing power despite the fact that they need to be run at real-time frequency (usually around 25 to 30 images per second). This thesis presents a novel software solution for developing such applications and to run them in real-time on cluster machines. To do so, we developed efficient, high-level object-oriented libraries while making the usual parallel programming model easier to use for computer vision developers. The main difficulty is to find a good trade-off between efficiency and readability. To do so, we used advanced software engineering techniques to implement two object-oriented libraries : E.V.E. , that provides aMATLAB® like interface to take care of SIMD parallelism, and QUAFF , which is an object-oriented implementation of the parallel algorithmic skeletons model. This work has been validated with two realistic computer vision applications - a real time 3D reconstruction and a particule filter based pedestrian tracker - that were developed and run on a cluster - the BABYLON machine - using two communication bus and fourteen computing nodes with two processors, exhibiting three level of parallelism : MIMD, SMP and SIMD. On this architecture, speed-up of 30 to 100 - compared to basic implementation - have been measured.
... With arrangements like a pc cluster, running Linux and using software like mpi or pvm, we have an alternative special purpose approach that is essentially software based [91,92,93]. There are two key reasons why this kind of approach has become so prevalent: first a pc cluster requires modest hardware investments and can be expected to be run in dedicated mode; secondly, since the used software is de facto standard, programs can be expected to have a longer lifetime and the methods and algorithms used in the programs can be gradually evolved for best performance. ...
... When supporting design work based on the FEM, when handling an unsteady or nonlinear problem in particular, computation takes much time even with the use of GA for optimization. To solve such a problem, grid computing with multiple computers connected to one another via a network [5] has recently been gathering attention. As an example, educational personal computers constituting a network were used for auto crash analysis while they were not in operation during the night in a joint study by Hiroshima University, MAZDA Motor Corporation and FUJITSU Corporation [6]. ...
Conference Paper
Full-text available
When seeking optimal parameters by numerical analysis in performance-based design, the number of combinations may increase to an extreme level, or objective functions or restrictions may not be accurately formulated. Then, genetic algorithms (GAs) are sometimes used. For certain problems, however, computation takes much time or no effective solutions are obtained even with the use of GAs. To compensate for such shortcomings, this study examines the applicability of island genetic algorithm (island-GA), a type of distributed genetic algorithm, to design problems. Using distributed-GA is expected to lead to efficient application of grid computing. Then, the time of computation may be reduced and networks of computers may be used effectively. The effectiveness of island-GA is verified in optimal impact resistance design for reinforced concrete slabs as an example of design problem. 1 Introduction In structural design, the finite element method (FEM) and other numerical analysis techniques have been used widely and design proposals based on such analysis methods have been examined on a daily basis. In engineering design, however, numerous parameters exist and a wide variety of restrictions should be taken into consideration. Using numerical analysis techniques including the FEM for checking design proposals therefore involves the handling of a large number of combinations of design parameters. Then, obtaining solutions within a practical time frame may be sometimes difficult. One of the effective solutions to such a problem is genetic algorithm (GA), which simulates the hereditary and evolutionary processes of species on computers. GA has been applied to
... The computer architecture based on a uni-processor in which its performance mainly depends on its clock speed has been changed to the form of multi-processor. Also, as the commodity development of computers with powerful micro-processor is available, relatively inexpensive computers such as personal computers and workstations came to be exploited in high-performance computing by combining these computing resources in a cooperative way using software technique to create a computing power required for computation intensive tasks, e.g., Beowulf clustering system [101] was able to create a great computing power needed by connecting personal computers via widely available networking technology running any one of several open-source operating system such as Linux. Nowadays, high-performance computing seems to continue to favor such an integration of existing computing resources, influenced by the fact that electronic processing speeds began to approach limitations imposed by the laws of physics, rather than develop a new fast processor. ...
... We relied primarily on [Swendson 2004], with the following distinguishing features/differences: 1. We opted for SUSE Linux [OpenSUSE 2005]. At the time of writing, the Warthog runs SUSE Linux release 9.3. ...
Article
Full-text available
Next generation volunteer-based distributed computing projects are working to embrace a wide range of distributed computing environments. In this paper we report on our early experiences with the ChessBrain II project, an established collaboration between researchers in a number of countries, investigating the feasibility of inhomogeneous speed-critical distributed computation.
... The Network File System (CALLAGHAN; PAWLOWSKI;STAUBACH, 1995), or NFS, is the de facto standard for distributed file sharing in the Unix world, and consequently has been naturally absorbed by the Beowulf cluster model since its first steps (STERLING, 1999). NFS was developed by Sun Microsystems and first came out in 1985 with the release of SunOS 2.0. ...
Article
Full-text available
This paper presents a comparison on current file systems with optimised performance dedicated to cluster computing. It presents the main features of three of such systems — xFS, PVFS and NFSP — and establishes comparisons between them, in relation to standard NFS, in terms of performance, fault tolerance and adequacy to the Beowulf model.
... The machine was an instant success and their idea of providing COTS 1 base systems to satisfy specific computational requirements quickly spread through NASA and into the academic and research communities. The development effort for this first machine quickly grew into a what we now call the Beowulf Project [41]. Some of the major accomplishment of the Beowulf Project will be chronicled below, but a non-technical measure of success is the observation that researcher within the High Performance Computer community are now referring to such machines as "Beowulf Class Cluster Computers". ...
... This was when laboratories started to build computational clusters from commodity components. The idea and phenomenal success of the BeoWulf cluster[5] shows that scientists (i) prefer to have a solution that is under their direct control, (ii) are quite willing to use existing proven and successful templates, and, (iii) generally want a ‘do-it-yourself’ inexpensive solution. As an alternative to ‘building your own cluster’, bringing computations to the free computer resource became a successful paradigm, Grid Computing[6]. ...
... With this architecture, available number of processors was boosted to more than 100, resulting in a very large improvement in the price performance of supercomputers. However, this move was counteracted by the birth of " Beowulf-class " machines [2,10], which are really simple cluster of commodity PCs with Intel x86 CPUs connected by standard Ethernet. This architecture ooers typically one order of magnitude better price performance than other computer architecture, simply because the production cost is low due to mass production. ...
Article
We overview our GRAvity PipE (GRAPE) project to develop special-purpose computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. Our newest machine, GRAPE-6, achieved the peak speed of , and sustained performance of , for the total budget of about 4 million USD.We also discuss relative advantages of special-purpose and general-purpose computers and the future of high-performance computing for science and technology.
... Especially for some companies, the PC clusters can be used to replace mainframe systems or supercomputers and save much hardware cost. According to efficiency and cost, the use of parallel version software and cluster system is a good way and it will become more and more popular [11] in the near feature. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially for sequence alignment. ...
Article
In addition to the traditional massively parallel computers, distributed workstation clusters now play an important role in scientific computing perhaps due to the advent of commodity high performance processors, low-latency/high-band width networks and powerful development tools. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. To fully utilize the relatively inexpensive CPU cycles available to today's scientists, a PC cluster consists of one master node and seven slave nodes (16 processors totally), is proposed and built for bioinformatics applications. We use the mpiBLAST and HMMer on parallel computer to speed up the process for sequence alignment. The mpiBLAST software uses a message-passing library called MPI (Message Passing Interface) and the HMMer software uses a software package called PVM (Parallel Virtual Machine), respectively. The system architecture and performances of the cluster are also presented in this paper.
... In order to measure the performance of our cluster, the parallel ray-tracing problem is illustrated and the experimental result is demonstrated on our Linux SMPs cluster. The experimental results show that the highest speedup is 15.22 for PVMPOV [5,7], when the total numbers of processor is 16 on SMPs cluster. Also, the LU of NPB benchmark is used to demonstrate the performance of our cluster tested by using LAM/MPI library [4]. ...
Article
This document describes how to set up a diskless Linux box. As technology is advancing rapidly, network-cards are becoming cheaper and much faster - 100 Mbits/s Ethernet is common now and in about 1 year 1000 Mbits/s i.e. 1GigBits Ethernet cards will become a standard. With high-speed network cards, remote access will become as fast as the local disk access which will make diskless nodes a viable alternative to workstations in local LAN. Also diskless nodes eliminate the cost of software upgrades and system administration costs like backup, recovery which will be centralized on the server side. Diskless nodes also enable "sharing/optimization" of centralized server CPU, memory, hard-disk, tape and CDROM resources. Diskless nodes provides mobility for the users i.e., users can log on from any one of diskless nodes and are not tied to one workstation. In this paper, a SMP-based PC cluster consists of one master node and eight diskless slave nodes (16 processors), is proposed and built. The system architecture and benchmark performances of the cluster are also presented in this paper.
... A partir da metade dos anos 90, com o surgimento de redes de comunicação de alto desempenho, tais como Myrinet (BODEN et al., 1995) e SCI (INSTITUTE OF ELECTRICAL AND ELECTRONIC ENGINEERS, 1992; HELLWAGNER; REINE- FELD, 1999), muitas instituições de ensino e pesquisa começaram a utilizar plataformas baseadas em computadores comuns conectados por essas redes. Essas plataformas foram denominadas clusters e a sua utilização ficou mundialmente conhecida como cluster computing (STERLING et al., 1999; BUYYA, 1999a,b; STERLING, 2002). O uso de clusters mudou drasticamente o panorama da programação paralela, uma vez que as bibliotecas de multiprogramação e comunicação até então utilizadas tiveram que ser adaptadas para protocolos específicos para essas redes de alto desempenho. ...
... If we can utilize PCs interconnected by an Ethernet/Fast Ethernet for distributed computing, the networked environment is called PC cluster (Sterling, Salmon, Becker & Savarese, 1999). To implement parallel programming in a PC cluster, the major issue is the distribution of information among PCs. ...
Article
During the competitive environment, how to sufficiently and efficiently utilize the computational resources under the consideration of cost had became an important action to be done by most practitioners. For the recent years, the PC cluster architecture had been applied into many applications, especial for the computational resources. In this study, we will introduce a case study in Taiwan owing the issue based on PC cluster architecture with MPI techniques. This case is an application of clustering technique to DNA analysis. Not only the hardware/software architecture of PC cluster had been constructed, but the clustering algorithm based on such cluster architecture was also proposed in this study. And, the rationality of the proposed architecture and algorithm can be demonstrated well in this study.
Article
Full-text available
The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.
Article
With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMA-enhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.
Conference Paper
In the last decades, supercomputers have become a necessity in science and industry. Huge data centers consume enormous amounts of electricity and we are at a point where newer, faster computers must no longer drain more power than their predecessors. The fact that user demand for compute capabilities has not declined in any way has led to studies of the feasibility of exaflop systems. Heterogeneous clusters with highly-efficient accelerators such as GPUs are one approach to higher efficiency. We present the new L-CSC cluster, a commodity hardware compute cluster dedicated to Lattice QCD simulations at the GSI research facility. L-CSC features a multi-GPU design with four FirePro S9150 GPUs per node providing 320 GB/s memory bandwidth and 2.6 TFLOPS peak performance each. The high bandwidth makes it ideally suited for memory-bound LQCD computations while the multi-GPU design ensures superior power efficiency. The November 2014 Green500 list awarded L-CSC the most power-efficient supercomputer in the world with 5270 MFLOPS/W in the Linpack benchmark. This paper presents optimizations to our Linpack implementation HPL-GPU and other power efficiency improvements which helped L-CSC reach this benchmark. It describes our approach for an accurate Green500 power measurement and unveils some problems with the current measurement methodology. Finally, it gives an overview of the Lattice QCD application on L-CSC.
Conference Paper
This paper reviews the technical and social events that stimulated early deployments of large-scale Beowulf-style clusters for production scientific and engineering use at the National Center for Supercomputing Applications (NCSA) and the subsequent development of the NSF TeraGrid. Insights and lessons from these experiences have shaped further development of high-performance computing environments and exposed a set of research challenges for creation of exascale computing systems.
Article
IntroductionInterconnection NetworksOptical Interconnection Networks: Technologies, Architectures, and SystemsCase StudiesComparison of Basic ApproachesProspects for Realization (Barriers, Technology, or Economic)Conclusion and Areas to WatchGlossaryCross ReferencesReferencesFurther Reading
Conference Paper
Recently, MapReduce is getting deployed over many High Performance Computing (HPC) clusters. Different studies reveal that by leveraging the benefits of high-performance interconnects like InfiniBand in these clusters, faster MapReduce job execution can be obtained by using additional performance enhancing features. Although RDMA-enhanced MapReduce has been proven to provide faster solutions over Hadoop distributed file system, efficiencies over parallel file systems used in HPC clusters are yet to be discovered. In this paper, we present a complete methodology for evaluating MapReduce over Lustre file system to provide insights about the interactions of different system components in HPC clusters. Our performance evaluation shows that RDMA-enhanced MapReduce can achieve significant benefits in terms of execution time (49% in a 128-node HPC cluster) and resource utilization, compared to the default architecture. To the best of our knowledge, this is the first attempt to evaluate RDMA-enhanced MapReduce over Lustre file system on HPC clusters.
Conference Paper
HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA over both default HDFS and Lustre.
Article
Nonuniformity is a common characteristic of contemporary computer systems, mainly because of physical distances in computer designs. In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some nonuniform memory access (NUMA) architectures, depending on if the memory is close to the requesting processor or not. Much research has been devoted to optimizing such systems. This thesis identifies another important property of computer designs, nonuniform communication architecture (NUCA). High-end hardware-coherent machines built from a few large nodes or from chip multiprocessors, are typical NUCA systems that have a lower penalty for reading recently written data from a neighbor’s cache than from a remote cache. The first part of the thesis identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations that exploit NUCAs are presented and investigated. This type of lock is shown to be almost twice as fast
Article
Full-text available
El crecimiento de los clusters de computadores, y en concreto de sistemas multicluster incrementa los potenciales puntos de fallos, exigiendo la utilización de esquemas de tolerancia a fallos que proporcionen la capacidad de terminar el procesamiento. El objetivo general planteado a sistemas de tolerancia a fallos es que el trabajo total se ejecute correctamente, aún cuando falle algún elemento del sistema, perdiendo el mínimo trabajo realizado posible, teniendo en cuenta que las prestaciones disminuyen debido al overhead necesario introducido para tolerar fallos y a la perdida de una parte del sistema. Esta Tesis presenta un modelo de tolerancia a fallos en clusters de computadores geográficamente distribuidos, utilizando Replicación de Datos denominado FTDR (Fault Tolerant Data Replication). Está basado en la replicación inicial de los procesos y una replicación de datos dinámica durante la ejecución, con el objetivo de preservar los resultados críticos. Está orientado a aplicaciones con un modelo de ejecución Master/Worker y ejecutado de forma transparente al usuario. El sistema de tolerancia a fallos diseñado, es configurable y cumple el requisito de escalabilidad. Se ha diseñado un modelo funcional, e implementado un Middleware . Se propone una metodología para incorporarlo en el diseño de aplicaciones paralelas. El modelo está basado en detectar fallos en cualquiera de los elementos funcionales del sistema (nodos de computo y redes de interconexión) y tolerar estos fallos a partir de la replicación de programas y datos realizada, garantizando la finalización del trabajo, y preservando la mayor parte del computo realizado antes del fallo, para ello es necesario, cuando se produce un fallo, recuperar la consistencia del sistema y reconfigurar el multicluster de una forma transparente al usuario. El Middleware desarrollado para la incorporación de la tolerancia a fallos en el entorno multicluster consigue un sistema más fiable, sin incorporar recursos hardware extra, de forma que partiendo de los elementos no fiables del cluster, permite proteger el cómputo realizado por la aplicación frente a fallos, de tal manera que si un ordenador falla otro se encarga de terminar su trabajo y el computo ya realizado está protegido por la Replicación de Datos. Este Middleware se puede configurar para soportar más de un fallo simultáneo, seleccionar un esquema centralizado o distribuido, también se pueden configurar parámetros relativos a aspectos que influyen en el overhead introducido, frente a la pérdida de más o menos computo realizado. Para validar el sistema se ha diseñado un sistema de inyección de fallos. Aunque añadir la funcionalidad de tolerancia a fallos, implica una perdida de prestaciones, se ha comprobado experimentalmente, que utilizando este sistema, el overhead introducido sin fallos, es inferior al 3% y en caso de fallo, después de un tiempo de ejecución, es mejor el tiempo de ejecución (runtime) tolerando el fallo que relanzar la aplicación.
Conference Paper
A new approach of computing a low dimensional invariant manifold of chemical reaction mechanisms is presented in this paper. An n-dimensional chemical reaction state space described by a system of autonomous differential equations dc/dt = f(c) can be described by a lower m-dimensional space. The reduced dimensional space is developed by constructing an invariant manifold of the system. For a nonlinear system, the various low dimensional manifolds can be constructed by linearizing the governing differential equations about a fixed point c\ where f(c*) = 0. At this point, the Jacobian matrix of the resulting linear system is efficiently computed using complex variables and analyzed in order to identify the various invariant manifolds. Construction of the reduced reaction spaces proceeds by transformations of the system species variables c to new variables z. The resulting system of new ode's expressed in terms of the new variables z are uncoupled in the linear terms, but coupled in the nonlinear terms. Next a functional relationship between the (n-m) new fast variables and the m- new slow variables of the required manifold is assumed. When this functional relationship is substituted into the transformed equations (which are in terms of the new variables z) a set of partial differential equations (pde's) results (essentially a projection of the n-dimensional state space onto reduced m-dimensional invariant manifold). In this paper details of this dynamical systems approach of construction reduced chemical kinetic mechanism is described. The Genetic Algorithm (GA) is used to optimize a Neural Network model which is then used to construct the invariant manifold. This approach maybe used for the reduction of large scale systems involving hundreds of species. © 2001 by the American Institute of Aeronautics and Astronautics. All rights reserved.
Conference Paper
We present a comparative study of parallel performance of a discontinuous Galerkin compressible flow solver based on five different numerical fluxes on distributed memory parallel computers, specifically a compute cluster and two many-core machines. Compute clusters, composed of commodity hardware components, are recognized as the most cost effective parallel computers while the many-core processors for personal computers are quite common now a days, offering another parallel computing platform. The parallel flow solver uses a discontinuous Galerkin (DG) method based on a Taylor series basis for compressible Euler equations. The solution is marched explicitly in time using a three stage third order Runge-Kutta scheme to attain a steady state. The performance of the parallel solver, in terms of speedup, for five different numerical fluxes is computed on the said parallel computers. The parallel code follows the computational domain partitioning strategy and uses MPI (Message Passing Interface) library for parallelization. A number of strategies for making the parallel solution a communication efficient one have been discussed.
Article
We overview our GRAPE (GRAvity PipE) and GRAPE-DR project to develop dedicated computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. GRAPE-6, completed in 2002, achieved the peak speed of 64 Tflops. The next machine, GRAPE-DR, will have the peak speed of 2 Pflops and will be completed in 2008. We discuss the physics of stellar systems, evolution of general-purpose high-performance computers, our GRAPE and GRAPE-DR projects and issues of numerical algorithms.
Article
Goal of this study is to implement 3-dimensional underwater appearance graphical display, fishery measured information display, sonar data representation and display, and 3-dimensional underwater appearance animation based on coefficient data of chaos behavior and fishing modeling of fishing gears from PC cluster system. In order to accomplish the goals of this study, it is essential to compose user interfacing and realistic description of image scenes in the towing-net fishery simulator, and techniques to describe sand cloud effects under water using particle systems are necessary. In this study, we implemented graphical representations and animations of the simulator by using OpenGL together with C routines.
Article
Full-text available
Parallel Computing has emerged as if the necessity in the modelling of atmosphere and ocean. In the accompanying note a computational model is designed and experiments carried out for communication overheads, including latency in relation to InfiniBand and customized Floswitch configurations. The data clearly demonstrates that the Ethernet based communication fares poorly and the implication is that for such systems to be competitive for tightly coupled class of problems, parallelization strategies where computation and communication overlap need to be looked into. Introduction Parallel computing has become indispensable in solving large scale computing problems. The literature on parallel computing is very extensive and the references 1–9 presents a cross section of it. It is seen that as the power of microprocessor based CPU keeps growing, the tendency to put them together for building bigger and bigger parallel computers does not decline; this gave rise to the development of a host of interconnection networks ranging from shared memory devices to crossbar switch, Ethernet, and InfiniBand type connectivity. The supporting software also grew in functionalities and ease of operation. MPI 10
Conference Paper
This paper presents the case that education in the 21st Century can only measure up to national needs if technologies developed in the simulation community, further enhanced by the power of high performance computing, are harnessed to supplant traditional didactic instruction. The authors cite their professional experiences in simulation, high performance computing and pedagogical studies to support their thesis that this implementation is not only required, it is feasible, supportable and affordable. Surveying and reporting on work in computer-aided education, this paper will discuss the pedagogical imperatives for group learning, risk management and "hero teacher" surrogates, all being optimally delivered with entity level simulations of varying types. Further, experience and research is adduced to support the thesis that effective implementation of this level of simulation is enabled only by, and is largely dependent upon, high performance computing, especially by the ready utility and acceptable costs of Linux clusters.
Article
Full-text available
This paper presents the Netuno supercomputer, a large-scale cluster installed at Federal University of Rio de Janeiro in Brazil. A detailed performance evaluation of Netuno is presented, depicting its computational and I/O performance, as well as the results for two real-world applications. Since building a high- performance cluster for running a wide range of applications is a non-trivial task, some lessons learned from assembling and operating this cluster, such as the excellent performance of the OpenMPI library, and the relevance of employing an efficient parallel file system over the traditional NFS system, can be useful knowledge to support the design of new systems. Currently, Netuno is being heavily used to run large scale simulations in the areas of ocean modeling, meteorology, engineering, physics, and geophysics.
Article
Aerodynamic problems involving moving objects have many applications, including store separation, fluid–structure interaction, takeoff and landing, and fast maneuverability. While wind tunnel or flight test remain important, time accurate computational fluid dynamics (CFD) offers the option of calculating these procedures from first principles. However, such computations are complicated and time consuming. Parallel computing offers a very effective way to improve our productivity in doing CFD analysis. In this article, we review recent progress made in parallel computing in this area. The store separation problem will be used to offer a physical focus and to help motivate the research effort. The chimera grid technique will be emphasized due to its flexibility and wide use in the technical community. In the Chimera grid scheme, a set of independent, overlapping, structured grids are used to decompose the domain of interest. This allows the use of efficient structured grid flow solvers and associated boundary conditions, and allows for grid motion without stretching or regridding. However, these advantages are gained in exchange for the requirement to establish communication links between the overlapping grids via a process referred to as “grid assembly.” Logical, incremental steps are presented in the parallel implementation of the grid assembly function. Issues related to data structure, processor-to-processor communication, parallel efficiency, and assessment of run time improvement are addressed in detail. In a practical application, the current practice allows the CPU time to be reduced from 6.5 days on a single processor computer to about 4 h on a parallel computer.
Article
A cluster is a collection of independent and cheap machines, used together as a supercomputer to provide a solution. In traditional scalable computing clusters, there is related high powered Intel or AMD based PC's associated with several gigabytes of hard disk space spread around. As far as a cluster administrator is concerned, the installer may have to install an operating system and related software for each cluster node. Every node (excluding the Server) is equipped without any hard disk and is booted from the network. Since it has booted successfully, it works as same as a fully equipped one. We will explain how to set up a diskless cluster for computing purpose. In this paper, a SMP-based PC cluster consists of one master node and eight diskless slave nodes (16 processors), is proposed and built. The system architecture and benchmark performances of the cluster are also presented in this paper.
Article
Full-text available
Many potential inventions are never discovered because the thought processes of scientists and engineers are channeled along well-traveled paths. In contrast, the evolutionary process tends to opportunistically solve problems without considering whether the evolved solution comports with human preconceptions about whether the goal is impossible. This paper demonstrates how genetic programming can be used to automate the process of exploring queries, conjectures, and challenges concerning the existence of seemingly impossible entities. The paper suggests a way by which genetic programming can be used to automate the invention process. We illustrate the concept using a challenge posed by a leading analog electrical engineer concerning whether it is possible to design a circuit composed of only resistors and capacitors that delivers a gain of greater than one. The paper contains a circuit evolved by genetic programming that satisfies the requirement of this challenge as well a related more difficult challenge. The original challenge was motivated by a circuit patented in 1956 for preprocessing inputs to oscilloscopes. The paper also contains an evolved circuit satisfying (and exceeding) the original design requirements of the circuit patented in 1956. This evolved circuit is another example of a result produced by genetic programming that is competitive with a human- produced result that was considered to be creative and inventive at the time it was first discovered.
Article
Full-text available
Previously reported applications of genetic programming to the automatic synthesis of computational circuits have employed simulations based on DC sweeps. DC sweeps have the advantage of being considerably less time-consuming than time-domain simulations. However, this type of simulation does not necessarily lead to robust circuits that correctly perform the desired mathematical function over time. This paper addresses the problem of automatically synthesizing computational circuits using multiple time-domain simulations and presents results involving the synthesis of both the topology and sizing for a squaring, square root, and multiplier computational circuit and a lag circuit (from the field of control).
Article
We present parallelization of a discontinuous Galerkin (DG) code on distributed memory parallel computers for compressible, inviscid fluid flow computations on unstructured meshes. The parallel DG code is based on a Taylor series basis and it uses LU-SGS (Lower-Upper Symmetric Gauss-Seidel) method for solving the linear system obtained by implicit time integration. Cost effective Ethernet based compute cluster composed of commonly available PC components is considered as the parallel computing platform. The parallel solution is based on computational domain partitioning and MPI (Message Passing Interface) programming paradigm. We demonstrate scalability of the developed parallel code on a cluster using two efficient MPI communication procedures. We also demonstrated the dominance of memory-bound nature of the parallel solution over network-bound nature.
ResearchGate has not been able to resolve any references for this publication.