How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing

Using a PC Cluster for High-Performance Computing and Applications

Article

Chao-Tung Yang

To use supercomputer for high-performance computing has been growing. Supercomputers that are single big expensive machines with a shared memory and one or more processors meet the professional need. A large-scale processing and storage system that provides high bandwidth at low cost is then their expectation. A cluster is a collection of independent and cheap machines, used together as a supercomputer to provide a solution. In this paper, a SMP-based PC cluster (36 processors), called THPTB (TungHai Parallel TestBed) with channel bonded technique, was proposed and built in CSIE. The system architecture and benchmark performances of the cluster are also presented in this paper. To take advantage of the parallelism of the SMPs cluster systems by using message-passing libraries, the HPL benchmark is used to demonstrate the performance. The experimental results show that our cluster can obtain 17.38 GFlops/s with channel bonding, when the total number of processor used is 36.

Towards Modeling and Optimization of Features Selection in Big Data based Social Internet of Things

Article

Sep 2017
FUTURE GENER COMP SY

The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze and select features from such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where a selection of appropriate features and High-Performance Computing (HPC) solution has become a key issue and has attracted attention in recent years. Therefore, keeping in view the needs above, there is a requirement for a system that can efficiently select features and analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that selects features by using Artificial Bee Colony (ABC). Moreover, a Kalman filter is used in Hadoop ecosystem that is used for removal of noise. Furthermore, traditional MapReduce with ABC is used that enhance the processing efficiency. Moreover, a complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed Hadoop-based ABC algorithm. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce with the ABC algorithm. ABC algorithm is used to select features, whereas, MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes with near real-time. Moreover, the proposed system is compared with Swarm approaches and is evaluated regarding efficiency, accuracy and throughput by using ten different data sets. The results show that the proposed system is more scalable and efficient in selecting features.

The 10 th Conference for Informatics and Information Technology (CIIT 2013) TRANSFORMING COMPUTER LAB INTO A MINI HPC CLUSTER

Conference Paper

Full-text available

Apr 2013

In almost any education institution there are computer labs that are available to students during work hours. However, during off hours, these labs are underutilized machines which contain significant computing performance. In this paper we are present the cluster infrastructure that we implemented for utilizing one of the computer labs at FON University. Additionally we implemented a simple micro-benchmark in order to present the computing capabilities of our cluster.

Nice PC Maker: An Online Interface to build Custom PCs

Article

Nov 2022

A custom-built PC is assembled to cater a specific user needs. Internationally, there are many e-stores available for users which offer them to build and order their systems as per their needs. In, 2020, Pakistan observed a covid-19 lockdown and mostly international shipments are cancelled due to the pandemic. This real world situation highlighted the deficiency of local market to fulfill the demands of custom-built PC users. In this paper, we propose and design a generic architecture of an e-store (Nice PC Maker (NPM)) to facilitate users with custom-build PC option along with other added features such as pre-build system purchasing and PC components purchasing. WooCommerce product builders (wordpress) are used to implement the architectural design. We also configured this generic architecture for a local client which helpes them to increase their business and facilitates users to get their desired systems.

Evolving PDC curriculum and tools: A study in responding to technological change

Article

Jul 2021
J PARALLEL DISTR COM

Joel Cameron Adams

Much has changed about parallel and distributed computing (PDC) since the author began teaching the topic in the late 1990s. This paper reviews some of the key changes to the field and describes their impacts on his work as a PDC educator. Such changes include: the availability of free implementations of the message passing interface (MPI) for distributed-memory multiprocessors; the development of the Beowulf cluster; the advent of multicore architectures; the development of free multithreading languages and libraries such as OpenMP; the availability of (relatively) inexpensive manycore accelerator devices (e.g., GPUs); the availability of free software platforms like CUDA, OpenACC, OpenCL, and OpenMP for using accelerators; the development of inexpensive single board computers (SBCs) like the Raspberry Pi, and other changes. The paper details the evolution of PDC education at the author's institution in response to these changes, including curriculum changes, seven different Beowulf cluster designs, and the development of pedagogical tools and techniques specifically for PDC education. The paper also surveys many of the hardware and software infrastructure options available to PDC educators, provides a strategy for choosing among them, and provides practical advice for PDC pedagogy. Through these discussions, the reader may see how much PDC education has changed over the past two decades, identify some areas of PDC that have remained stable during this same time period, and so gain new insight into how to efficiently invest one's time as a PDC educator.

Retrospective: A Look Back at 20+ Years of Experience in Parallel Computing Education

Conference Paper

May 2020

Joel Cameron Adams

An Internet Based On-Line Architecture for Real-Time Traffic Systems Control

Article

Jun 2000

Recent advances in information technology provide opportunities for secure, efficient and economical data transfer vis-à-vis real time operations for large-scale traffic systems equipped with sensor systems. This paper describes an Internet based control architecture that provides an automated mechanism for real-time route guidance using a high performance computing environment located at a remote site. A distributed computing paradigm called the Beowulf Cluster is proposed to achieve a high performance computing environment for the on-line architecture.

On a course on computer cluster configuration and administration

Article

Jan 2017

Computer clusters are today a cost-effective way of providing either high-performance and/or high-availability. The flexibility of their configuration aims to fit the needs of multiple environments, from small servers to SME and large Internet servers. For these reasons, their usage has expanded not only in academia but also in many companies. However, each environment needs a different “cluster flavour”. High-performance and high-throughput computing are required in universities and research centres while high-performance service and high-availability are usually reserved to use in companies. Despite this fact, most university cluster computing courses continue to cover only high-performance computing, usually ignoring other possibilities. In this paper, a master-level course which attempts to fill this gap is discussed.

A hybrid computational model for non-probabilistic uncertainty analysis

Article

Full-text available

Jan 2008

It is well known that probabilistic and non probabilistic methods for uncertainty analysis require great computational capacity to perform hundreds or even thousands of simulations. In this work we develop a Hybrid Computational model, a combination of Shared Memory and Distributed Memory models, to take advantage of the newest Multi-Cores processors technology that is becoming more frequently used in clusters of PC's. To evaluate the model, numerical experiments are conducted using Intel Core2Duo processors.

Advanced numerical modeling in geophysics

Article

Sep 2003

Dimitri Komatitsch

We validate the spectral-element method for seismic wave propagation in spherically-symmetric global Earth models. We also include the full complexity of 3-D Earth models, i.e., lateral variations in compressional-wave velocity, shear-wave velocity and density, a 3-D crustal model, ellipticity, as well as topography and bathymetry. We also include the effects of the oceans, rotation, and self-gravitation. For the oceans we introduce a formulation based upon an equivalent load, in which the oceans do not need to be meshed explicitly. Some of these effects, which are often considered negligible in global seismology, can in fact play a significant role for certain source-receiver configurations. Anisotropy and attenuation are also incorporated in this study. The complex phenomena that are taken into account are introduced in such a way that we preserve the main advantages of the spectral-element method, which are an exactly diagonal mass matrix and very high computational efficiency on parallel computers. For self-gravitation and the oceans we benchmark the spectral-element synthetic seismograms against normal-mode synthetic seismograms for spherically-symmetric reference model PREM. The two methods are in excellent agreement for all body- and surface-wave arrivals with periods greater than about 20 s in the case of self-gravitation and 25 s in the case of the oceans. We subsequently present results of simulations for a real earthquake in a fully 3-D Earth model for which the fit to the data is significantly improved compared to classical normal-mode calculations based upon PREM.

A cluster for real time computer vision: architecture, tools and application

Article

Full-text available

Dec 2006

Joel Falcou

Recent works in computer vision produced complex algorithms and applications that require more and more computing power despite the fact that they need to be run at real-time frequency (usually around 25 to 30 images per second). This thesis presents a novel software solution for developing such applications and to run them in real-time on cluster machines. To do so, we developed efficient, high-level object-oriented libraries while making the usual parallel programming model easier to use for computer vision developers. The main difficulty is to find a good trade-off between efficiency and readability. To do so, we used advanced software engineering techniques to implement two object-oriented libraries : E.V.E. , that provides aMATLAB® like interface to take care of SIMD parallelism, and QUAFF , which is an object-oriented implementation of the parallel algorithmic skeletons model. This work has been validated with two realistic computer vision applications - a real time 3D reconstruction and a particule filter based pedestrian tracker - that were developed and run on a cluster - the BABYLON machine - using two communication bus and fourteen computing nodes with two processors, exhibiting three level of parallelism : MIMD, SMP and SIMD. On this architecture, speed-up of 30 to 100 - compared to basic implementation - have been measured.

Algorithms for Molecular Dynamics Simulations

Article

Jan 2006

Fredrik Hedman

Application of the island-genetic algorithm to optimal impact resistance design of a RC slab

Conference Paper

Full-text available

Jun 2006

When seeking optimal parameters by numerical analysis in performance-based design, the number of combinations may increase to an extreme level, or objective functions or restrictions may not be accurately formulated. Then, genetic algorithms (GAs) are sometimes used. For certain problems, however, computation takes much time or no effective solutions are obtained even with the use of GAs. To compensate for such shortcomings, this study examines the applicability of island genetic algorithm (island-GA), a type of distributed genetic algorithm, to design problems. Using distributed-GA is expected to lead to efficient application of grid computing. Then, the time of computation may be reduced and networks of computers may be used effectively. The effectiveness of island-GA is verified in optimal impact resistance design for reinforced concrete slabs as an example of design problem. 1 Introduction In structural design, the finite element method (FEM) and other numerical analysis techniques have been used widely and design proposals based on such analysis methods have been examined on a daily basis. In engineering design, however, numerous parameters exist and a wide variety of restrictions should be taken into consideration. Using numerical analysis techniques including the FEM for checking design proposals therefore involves the handling of a large number of combinations of design parameters. Then, obtaining solutions within a practical time frame may be sometimes difficult. One of the effective solutions to such a problem is genetic algorithm (GA), which simulates the hereditary and evolutionary processes of species on computers. GA has been applied to

Adaptive Scheduling Using Support Vector Machine on Heterogeneous Distributed Systems

Article

Yongwon Park

Early experiences with clusters and compute farms in ChessBrain II

Article

Full-text available

Kevin Lew

Next generation volunteer-based distributed computing projects are working to embrace a wide range of distributed computing environments. In this paper we report on our early experiences with the ChessBrain II project, an established collaboration between researchers in a number of countries, investigating the feasibility of inhomogeneous speed-critical distributed computation.

A Comparison on Current Distributed File Systems for Beowulf Clusters

Article

Full-text available

This paper presents a comparison on current file systems with optimised performance dedicated to cluster computing. It presents the main features of three of such systems — xFS, PVFS and NFSP — and establishes comparisons between them, in relation to standard NFS, in terms of performance, fault tolerance and adequacy to the Beowulf model.

Modeling Parallel Processes in Biosystems

Article

Alexander Goncearenco

GrayWulf: Petascale Data-Intensive Computing for eScience

Conference Paper

Full-text available

Dec 2008

GRAPE project

Article

Dec 2002
J COMPUT APPL MATH

Jun Makino

We overview our GRAvity PipE (GRAPE) project to develop special-purpose computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. Our newest machine, GRAPE-6, achieved the peak speed of , and sustained performance of , for the total budget of about 4 million USD.We also discuss relative advantages of special-purpose and general-purpose computers and the future of high-performance computing for science and technology.

Apply Parallel Bioinformatics Applications on Linux PC Clusters

Article

In addition to the traditional massively parallel computers, distributed workstation clusters now play an important role in scientific computing perhaps due to the advent of commodity high performance processors, low-latency/high-band width networks and powerful development tools. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. To fully utilize the relatively inexpensive CPU cycles available to today's scientists, a PC cluster consists of one master node and seven slave nodes (16 processors totally), is proposed and built for bioinformatics applications. We use the mpiBLAST and HMMer on parallel computer to speed up the process for sequence alignment. The mpiBLAST software uses a message-passing library called MPI (Message Passing Interface) and the HMMer software uses a software package called PVM (Parallel Virtual Machine), respectively. The system architecture and performances of the cluster are also presented in this paper.

A Linux PC Cluster with Diskless Slave Nodes for Parallel Computing

Article

This document describes how to set up a diskless Linux box. As technology is advancing rapidly, network-cards are becoming cheaper and much faster - 100 Mbits/s Ethernet is common now and in about 1 year 1000 Mbits/s i.e. 1GigBits Ethernet cards will become a standard. With high-speed network cards, remote access will become as fast as the local disk access which will make diskless nodes a viable alternative to workstations in local LAN. Also diskless nodes eliminate the cost of software upgrades and system administration costs like backup, recovery which will be centralized on the server side. Diskless nodes also enable "sharing/optimization" of centralized server CPU, memory, hard-disk, tape and CDROM resources. Diskless nodes provides mobility for the users i.e., users can log on from any one of diskless nodes and are not tied to one workstation. In this paper, a SMP-based PC cluster consists of one master node and eight diskless slave nodes (16 processors), is proposed and built. The system architecture and benchmark performances of the cluster are also presented in this paper.

Estudo e Desenvolvimento de um Modelo de Integração de Agregados Heterogêneos

Article

Full-text available

Achieving Organism Clustering Analysis by Using PC Cluster Architecture with MPI Techniques

Article

Jan 2007

During the competitive environment, how to sufficiently and efficiently utilize the computational resources under the consideration of cost had became an important action to be done by most practitioners. For the recent years, the PC cluster architecture had been applied into many applications, especial for the computational resources. In this study, we will introduce a case study in Taiwan owing the issue based on PC cluster architecture with MPI techniques. This case is an application of clustering technique to DNA analysis. Not only the hardware/software architecture of PC cluster had been constructed, but the clustering algorithm based on such cluster architecture was also proposed in this study. And, the rationality of the proposed architecture and algorithm can be demonstrated well in this study.

Implementing micro High Performance Computing (µHPC) artifact: Affordable HPC Facilities for Academia

Conference Paper

Oct 2020

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

Article

Full-text available

Jun 2018
INT J PARALLEL PROG

The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.

A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters

Article

Jan 2016

With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMA-enhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.

Clusters

Chapter

Jan 2011

Thomas L. Sterling PhD

Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500

Conference Paper

Jul 2015

In the last decades, supercomputers have become a necessity in science and industry. Huge data centers consume enormous amounts of electricity and we are at a point where newer, faster computers must no longer drain more power than their predecessors. The fact that user demand for compute capabilities has not declined in any way has led to studies of the feasibility of exaflop systems. Heterogeneous clusters with highly-efficient accelerators such as GPUs are one approach to higher efficiency. We present the new L-CSC cluster, a commodity hardware compute cluster dedicated to Lattice QCD simulations at the GSI research facility. L-CSC features a multi-GPU design with four FirePro S9150 GPUs per node providing 320 GB/s memory bandwidth and 2.6 TFLOPS peak performance each. The high bandwidth makes it ideally suited for memory-bound LQCD computations while the multi-GPU design ensures superior power efficiency. The November 2014 Green500 list awarded L-CSC the most power-efficient supercomputer in the world with 5270 MFLOPS/W in the Linpack benchmark. This paper presents optimizations to our Linpack implementation HPL-GPU and other power efficiency improvements which helped L-CSC reach this benchmark. It describes our approach for an accurate Green500 power measurement and unveils some problems with the current measurement methodology. Finally, it gives an overview of the Lattice QCD application on L-CSC.

Beowulf Clusters

Conference Paper

Jan 2015

Daniel A. Reed

This paper reviews the technical and social events that stimulated early deployments of large-scale Beowulf-style clusters for production scientific and engineering use at the National Center for Supercomputing Applications (NCSA) and the subsequent development of the NSF TeraGrid. Insights and lessons from these experiences have shaped further development of high-performance computing environments and exposed a set of research challenges for creation of exascale computing systems.

Next-Generation Cluster Networks

Article

Jan 2012

IntroductionInterconnection NetworksOptical Interconnection Networks: Technologies, Architectures, and SystemsCase StudiesComparison of Basic ApproachesProspects for Realization (Barriers, Technology, or Economic)Conclusion and Areas to WatchGlossaryCross ReferencesReferencesFurther Reading

Distributed Computing Economics

Chapter

Jan 2004

High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA

Conference Paper

May 2015

MapReduce over Lustre: Can RDMA-based Approach Benefit?

Conference Paper

Aug 2014

Recently, MapReduce is getting deployed over many High Performance Computing (HPC) clusters. Different studies reveal that by leveraging the benefits of high-performance interconnects like InfiniBand in these clusters, faster MapReduce job execution can be obtained by using additional performance enhancing features. Although RDMA-enhanced MapReduce has been proven to provide faster solutions over Hadoop distributed file system, efficiencies over parallel file systems used in HPC clusters are yet to be discovered. In this paper, we present a complete methodology for evaluating MapReduce over Lustre file system to provide insights about the interactions of different system components in HPC clusters. Our performance evaluation shows that RDMA-enhanced MapReduce can achieve significant benefits in terms of execution time (49% in a 128-node HPC cluster) and resource utilization, compared to the default architecture. To the best of our knowledge, this is the first attempt to evaluate RDMA-enhanced MapReduce over Lustre file system on HPC clusters.

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

Conference Paper

May 2015

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA over both default HDFS and Lustre.

Efficient Synchronization and Coherence for Nonuniform Communication Architectures

Article

Jan 2003

Zoran Radovic

Nonuniformity is a common characteristic of contemporary computer systems, mainly because of physical distances in computer designs. In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some nonuniform memory access (NUMA) architectures, depending on if the memory is close to the requesting processor or not. Much research has been devoted to optimizing such systems. This thesis identifies another important property of computer designs, nonuniform communication architecture (NUCA). High-end hardware-coherent machines built from a few large nodes or from chip multiprocessors, are typical NUCA systems that have a lower penalty for reading recently written data from a neighbor’s cache than from a remote cache. The first part of the thesis identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations that exploit NUCAs are presented and investigated. This type of lock is shown to be almost twice as fast

Dynamic Reconfiguration with Virtual Services

Article

May 2005

Daniel F. Savarese

FTDR: Tolerancia a fallos, en clusters de computadores geográficamente distribuidos, basada en Replicación de Datos

Article

Full-text available

Oct 2006

Josemar Rodrigues de Souza

El crecimiento de los clusters de computadores, y en concreto de sistemas multicluster incrementa los potenciales puntos de fallos, exigiendo la utilización de esquemas de tolerancia a fallos que proporcionen la capacidad de terminar el procesamiento. El objetivo general planteado a sistemas de tolerancia a fallos es que el trabajo total se ejecute correctamente, aún cuando falle algún elemento del sistema, perdiendo el mínimo trabajo realizado posible, teniendo en cuenta que las prestaciones disminuyen debido al overhead necesario introducido para tolerar fallos y a la perdida de una parte del sistema. Esta Tesis presenta un modelo de tolerancia a fallos en clusters de computadores geográficamente distribuidos, utilizando Replicación de Datos denominado FTDR (Fault Tolerant Data Replication). Está basado en la replicación inicial de los procesos y una replicación de datos dinámica durante la ejecución, con el objetivo de preservar los resultados críticos. Está orientado a aplicaciones con un modelo de ejecución Master/Worker y ejecutado de forma transparente al usuario. El sistema de tolerancia a fallos diseñado, es configurable y cumple el requisito de escalabilidad. Se ha diseñado un modelo funcional, e implementado un Middleware . Se propone una metodología para incorporarlo en el diseño de aplicaciones paralelas. El modelo está basado en detectar fallos en cualquiera de los elementos funcionales del sistema (nodos de computo y redes de interconexión) y tolerar estos fallos a partir de la replicación de programas y datos realizada, garantizando la finalización del trabajo, y preservando la mayor parte del computo realizado antes del fallo, para ello es necesario, cuando se produce un fallo, recuperar la consistencia del sistema y reconfigurar el multicluster de una forma transparente al usuario. El Middleware desarrollado para la incorporación de la tolerancia a fallos en el entorno multicluster consigue un sistema más fiable, sin incorporar recursos hardware extra, de forma que partiendo de los elementos no fiables del cluster, permite proteger el cómputo realizado por la aplicación frente a fallos, de tal manera que si un ordenador falla otro se encarga de terminar su trabajo y el computo ya realizado está protegido por la Replicación de Datos. Este Middleware se puede configurar para soportar más de un fallo simultáneo, seleccionar un esquema centralizado o distribuido, también se pueden configurar parámetros relativos a aspectos que influyen en el overhead introducido, frente a la pérdida de más o menos computo realizado. Para validar el sistema se ha diseñado un sistema de inyección de fallos. Aunque añadir la funcionalidad de tolerancia a fallos, implica una perdida de prestaciones, se ha comprobado experimentalmente, que utilizando este sistema, el overhead introducido sin fallos, es inferior al 3% y en caso de fallo, después de un tiempo de ejecución, es mejor el tiempo de ejecución (runtime) tolerando el fallo que relanzar la aplicación.

Computing low dimensional invariant manifolds of reaction mechanism using genetic algorithms

Conference Paper

Jan 2001

A new approach of computing a low dimensional invariant manifold of chemical reaction mechanisms is presented in this paper. An n-dimensional chemical reaction state space described by a system of autonomous differential equations dc/dt = f(c) can be described by a lower m-dimensional space. The reduced dimensional space is developed by constructing an invariant manifold of the system. For a nonlinear system, the various low dimensional manifolds can be constructed by linearizing the governing differential equations about a fixed point c\ where f(c*) = 0. At this point, the Jacobian matrix of the resulting linear system is efficiently computed using complex variables and analyzed in order to identify the various invariant manifolds. Construction of the reduced reaction spaces proceeds by transformations of the system species variables c to new variables z. The resulting system of new ode's expressed in terms of the new variables z are uncoupled in the linear terms, but coupled in the nonlinear terms. Next a functional relationship between the (n-m) new fast variables and the m- new slow variables of the required manifold is assumed. When this functional relationship is substituted into the transformed equations (which are in terms of the new variables z) a set of partial differential equations (pde's) results (essentially a projection of the n-dimensional state space onto reduced m-dimensional invariant manifold). In this paper details of this dynamical systems approach of construction reduced chemical kinetic mechanism is described. The Genetic Algorithm (GA) is used to optimize a Neural Network model which is then used to construct the invariant manifold. This approach maybe used for the reduction of large scale systems involving hundreds of species. © 2001 by the American Institute of Aeronautics and Astronautics. All rights reserved.

On Parallel Performance of a Discontinuous Galerkin Compressible Flow Solver Based on Different Numerical Fluxes

Conference Paper

Jan 2011

We present a comparative study of parallel performance of a discontinuous Galerkin compressible flow solver based on five different numerical fluxes on distributed memory parallel computers, specifically a compute cluster and two many-core machines. Compute clusters, composed of commodity hardware components, are recognized as the most cost effective parallel computers while the many-core processors for personal computers are quite common now a days, offering another parallel computing platform. The parallel flow solver uses a discontinuous Galerkin (DG) method based on a Taylor series basis for compressible Euler equations. The solution is marched explicitly in time using a three stage third order Runge-Kutta scheme to attain a steady state. The performance of the parallel solver, in terms of speedup, for five different numerical fluxes is computed on the said parallel computers. The parallel code follows the computational domain partitioning strategy and uses MPI (Message Passing Interface) library for parallelization. A number of strategies for making the parallel solution a communication efficient one have been discussed.

Do-it-yourself computational astronomy. Hardwares, algorithms, softwares, and sciences

Article

Mar 2008

Jun Makino

We overview our GRAPE (GRAvity PipE) and GRAPE-DR project to develop dedicated computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. GRAPE-6, completed in 2002, achieved the peak speed of 64 Tflops. The next machine, GRAPE-DR, will have the peak speed of 2 Pflops and will be completed in 2008. We discuss the physics of stellar systems, evolution of general-purpose high-performance computers, our GRAPE and GRAPE-DR projects and issues of numerical algorithms.

Computers and Software Engineering

Article

Full-text available

Development of Simulator with Cluster System for Towing Fisheries

Article

Jan 2005

Goal of this study is to implement 3-dimensional underwater appearance graphical display, fishery measured information display, sonar data representation and display, and 3-dimensional underwater appearance animation based on coefficient data of chaos behavior and fishing modeling of fishing gears from PC cluster system. In order to accomplish the goals of this study, it is essential to compose user interfacing and realistic description of image scenes in the towing-net fishery simulator, and techniques to describe sand cloud effects under water using particle systems are necessary. In this study, we implemented graphical representations and animations of the simulator by using OpenGL together with C routines.

On scalability aspects of cluster computing

Article

Full-text available

Apr 2014
INDIAN J MAR SCI

Parallel Computing has emerged as if the necessity in the modelling of atmosphere and ocean. In the accompanying note a computational model is designed and experiments carried out for communication overheads, including latency in relation to InfiniBand and customized Floswitch configurations. The data clearly demonstrates that the Ethernet based communication fares poorly and the implication is that for such systems to be competitive for tightly coupled class of problems, parallelization strategies where computation and communication overlap need to be looked into. Introduction Parallel computing has become indispensable in solving large scale computing problems. The literature on parallel computing is very extensive and the references 1–9 presents a cross section of it. It is seen that as the power of microprocessor based CPU keeps growing, the tendency to put them together for building bigger and bigger parallel computers does not decline; this gave rise to the development of a host of interconnection networks ranging from shared memory devices to crossbar switch, Ethernet, and InfiniBand type connectivity. The supporting software also grew in functionalities and ease of operation. MPI 10

High-performance computing enables simulations to transform education

Conference Paper

Dec 2007

This paper presents the case that education in the 21st Century can only measure up to national needs if technologies developed in the simulation community, further enhanced by the power of high performance computing, are harnessed to supplant traditional didactic instruction. The authors cite their professional experiences in simulation, high performance computing and pedagogical studies to support their thesis that this implementation is not only required, it is feasible, supportable and affordable. Surveying and reporting on work in computer-aided education, this paper will discuss the pedagogical imperatives for group learning, risk management and "hero teacher" surrogates, all being optimally delivered with entity level simulations of varying types. Further, experience and research is adduced to support the thesis that effective implementation of this level of simulation is enabled only by, and is largely dependent upon, high performance computing, especially by the ready utility and acceptable costs of Linux clusters.

The Experience in Designing and Evaluating the High Performance Cluster Netuno

Article

Full-text available

Apr 2014

This paper presents the Netuno supercomputer, a large-scale cluster installed at Federal University of Rio de Janeiro in Brazil. A detailed performance evaluation of Netuno is presented, depicting its computational and I/O performance, as well as the results for two real-world applications. Since building a high- performance cluster for running a wide range of applications is a non-trivial task, some lessons learned from assembling and operating this cluster, such as the excellent performance of the OpenMPI library, and the relevance of employing an efficient parallel file system over the traditional NFS system, can be useful knowledge to support the design of new systems. Currently, Netuno is being heavily used to run large scale simulations in the areas of ocean modeling, meteorology, engineering, physics, and geophysics.

Parallel computing of overset grids for aerodynamic problems with moving objects

Article

Feb 2000
PROG AEROSP SCI

Aerodynamic problems involving moving objects have many applications, including store separation, fluid–structure interaction, takeoff and landing, and fast maneuverability. While wind tunnel or flight test remain important, time accurate computational fluid dynamics (CFD) offers the option of calculating these procedures from first principles. However, such computations are complicated and time consuming. Parallel computing offers a very effective way to improve our productivity in doing CFD analysis. In this article, we review recent progress made in parallel computing in this area. The store separation problem will be used to offer a physical focus and to help motivate the research effort. The chimera grid technique will be emphasized due to its flexibility and wide use in the technical community. In the Chimera grid scheme, a set of independent, overlapping, structured grids are used to decompose the domain of interest. This allows the use of efficient structured grid flow solvers and associated boundary conditions, and allows for grid motion without stretching or regridding. However, these advantages are gained in exchange for the requirement to establish communication links between the overlapping grids via a process referred to as “grid assembly.” Logical, incremental steps are presented in the parallel implementation of the grid assembly function. Issues related to data structure, processor-to-processor communication, parallel efficiency, and assessment of run time improvement are addressed in detail. In a practical application, the current practice allows the CPU time to be reduced from 6.5 days on a single processor computer to about 4 h on a parallel computer.

An Introduction to a PC Cluster with Diskless Slave Nodes

Article

Jan 2002

A cluster is a collection of independent and cheap machines, used together as a supercomputer to provide a solution. In traditional scalable computing clusters, there is related high powered Intel or AMD based PC's associated with several gigabytes of hard disk space spread around. As far as a cluster administrator is concerned, the installer may have to install an operating system and related software for each cluster node. Every node (excluding the Server) is equipped without any hard disk and is booted from the network. Since it has booted successfully, it works as same as a fully equipped one. We will explain how to set up a diskless cluster for computing purpose. In this paper, a SMP-based PC cluster consists of one master node and eight diskless slave nodes (16 processors), is proposed and built. The system architecture and benchmark performances of the cluster are also presented in this paper.

Searching for the Impossible using Genetic Programming

Article

Full-text available

Jan 1999

Many potential inventions are never discovered because the thought processes of scientists and engineers are channeled along well-traveled paths. In contrast, the evolutionary process tends to opportunistically solve problems without considering whether the evolved solution comports with human preconceptions about whether the goal is impossible. This paper demonstrates how genetic programming can be used to automate the process of exploring queries, conjectures, and challenges concerning the existence of seemingly impossible entities. The paper suggests a way by which genetic programming can be used to automate the invention process. We illustrate the concept using a challenge posed by a leading analog electrical engineer concerning whether it is possible to design a circuit composed of only resistors and capacitors that delivers a gain of greater than one. The paper contains a circuit evolved by genetic programming that satisfies the requirement of this challenge as well a related more difficult challenge. The original challenge was motivated by a circuit patented in 1956 for preprocessing inputs to oscilloscopes. The paper also contains an evolved circuit satisfying (and exceeding) the original design requirements of the circuit patented in 1956. This evolved circuit is another example of a result produced by genetic programming that is competitive with a human- produced result that was considered to be creative and inventive at the time it was first discovered.

Use of Time-Domain Simulations in Automatic Synthesis of Computational Circuits Using Genetic Programming

Article

Full-text available

Feb 2003

Previously reported applications of genetic programming to the automatic synthesis of computational circuits have employed simulations based on DC sweeps. DC sweeps have the advantage of being considerably less time-consuming than time-domain simulations. However, this type of simulation does not necessarily lead to robust circuits that correctly perform the desired mathematical function over time. This paper addresses the problem of automatically synthesizing computational circuits using multiple time-domain simulations and presents results involving the synthesis of both the topology and sizing for a squaring, square root, and multiplier computational circuit and a lag circuit (from the field of control).

A communication-efficient, distributed memory parallel code using discontinuous Galerkin method for compressible flows

Article

Oct 2010

We present parallelization of a discontinuous Galerkin (DG) code on distributed memory parallel computers for compressible, inviscid fluid flow computations on unstructured meshes. The parallel DG code is based on a Taylor series basis and it uses LU-SGS (Lower-Upper Symmetric Gauss-Seidel) method for solving the linear system obtained by implicit time integration. Cost effective Ethernet based compute cluster composed of commonly available PC components is considered as the parallel computing platform. The parallel solution is based on computational domain partitioning and MPI (Message Passing Interface) programming paradigm. We demonstrate scalability of the developed parallel code on a cluster using two efficient MPI communication procedures. We also demonstrated the dominance of memory-bound nature of the parallel solution over network-bound nature.

How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing

No full-text available

Recommended publications

Building Software Environments for Research Computing Clusters

dproc - Extensible Run-Time Resource Monitoring for Cluster Applications