Article

Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with Grid

March 2005
Future Generation Computer Systems 21(3):417-437

March 2005
21(3):417-437

DOI:10.1016/j.future.2004.04.011

Source
DBLP

Authors:

Franck Cappello

Argonne National Laboratory

Gilles Fedak

iExec Blockchain Tech

Thomas Herault

University of Tennessee

Show all 7 authorsHide

Global Computing systems belong to the class of large-scale distributed systems. Their properties high computational, storage and communication performance potentials, high resilience make them attractive in academia and industry as computing infrastructures in complement to more classical infrastructures such as clusters or supercomputers. However, generalizing the use of these systems in a multi-user and multi-parallel programming context involves finding solutions and providing mechanisms for many issues such as programming bag of tasks and message passing parallel applications, securing the application, the system itself and the computing nodes, deploying the systems for harnessing resources managed in different ways. In this paper, we present our research, often influenced by user demands, towards a Computational peer-to-peer system called XtremWeb. We describe (a) the architecture of the system and its motivations, (b) the parallel programming paradigms available in XtremWeb and how they are implemented, (c) the deployment issues and what mechanisms are used to harness simultaneously uncoordinated set of resources, and resources managed by batch schedulers and (d) the security issue and how we address, inside XtremWeb, the protection of the computing resources. We present two multi-parametric applications to be used in production: Aires belonging to the high energy physics (HEP) Auger project and a protein conformation predictor using a molecular dynamic simulator. To evaluate the performance and volatility tolerance, we present experiment results for bag of tasks applications and message passing applications. We show that the system can tolerate massive failure and we discuss the performance of the node protection mechanism. Based on the XtremWeb project developments and evolutions, we will discuss the convergence between Global Computing systems and Grid.

Bioinspired Algorithms in Complex Ephemeral Environments

Article

Full-text available

Nov 2018
FUTURE GENER COMP SY

The concept of Ephemeral Computing is an emergent topic that is currently consolidating among the research community. It includes computing systems where the nodes or the connectivity have an ephemeral and thus unpredictable nature. Although the capacity and computer power of small and medium devices (as smartphones or tablets) are increasing swiftly, their computing capacities are usually underexploited. The availability of highly-volatile heterogeneous computer resources capable of running software agents requires suitable algorithms to make a proper use of the available resources while circumventing the potential problems that may produce such non-reliable systems. Due to the non-reliable nature of the system where the algorithms under consideration should run, they have to be ephemerality-aware, having the self-capability for understanding this kind of environments and adapt to them by means of exibility, plasticity and robustness. Because of their decentralized functioning, intrinsic parallelism, resilience, and adaptiveness, bioinspired algorithms suit well to this endeavour. The papers in this special issue address a variety of issues and concerns in ephemeral and complex domains, including: signal reconstruction, large scale social network analysis, diseases detection and prevention, unit deployment and collaborative hyper-heuristics.

Supervision des réseaux et services pair à pair

Thesis

Dec 2005

Guillaume Doyen

Le modèle pair à pair (P2P) est aujourd'hui utilisé dans des environnements contraints. La décentralisation induites par ce modèle repousse les limites du modèle client-serveur. Néanmoins, pour pouvoir garantir un niveau de service, il requiert l'intégration d'une infrastructure de supervision adaptée. Ce dernier point constitue le cadre de notre travail. Concernant la modélisation de l'information de gestion, nous avons conçu une extension de CIM pour le modèle P2P. Pour la valider, nous l'avons implantée sur Jxta. Nous avons ensuite spécialisé notre modèle de l'information pour les tables de hachage distribuées (DHT). Nous avons abstrait le fonctionnement des DHTs, proposé un ensemble de métriques qui caractérisent leur performance, et déduit un modèle de l'information qui les intègre. Enfin, concernant l'organisation du plan de gestion, nous avons proposé un modèle hiérarchique, qui permet aux pairs de s'organiser selon une arborescence de gestionnaires et d'agents. Cette proposition a été mise en oeuvre sur une implantation de Pastry.

A hybrid GPU cluster and volunteer computing platform for scalable deep learning

Article

Full-text available

Jul 2018
J SUPERCOMPUT

Deep learning is a very computing-intensive and time-consuming task. It needs an amount of computing resource much greater than a single machine can afford to train a sophisticated model within a reasonable time. Normally, GPU clusters are required to reduce the training time of a deep learning model from days to hours. However, building large dedicated GPU clusters is not always feasible or even ineffective for most organizations due to the cost of purchasing, operation and maintenance while such systems are not fully utilized all the time. In this regard, volunteer computing can address this problem as it provides additional computing resources at less or no cost. This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning. The owners of the machines contribute unused computing resources on their computers to extend the capability of the GPU cluster. The challenge is to seamlessly align the differences between GPU cluster and volunteer computing systems so as to ensure the scalability transparency, whereas performance is also another major concern. We validate the proposed work with two well-known sample cases. The results show an efficient use of our hybrid platform at sub-linear speedup.

Optimization as a Service: On the Use of Cloud Computing for Metaheuristic Optimization

Conference Paper

Feb 2013

Cloud computing has emerged as a new technology that provides on-demand access to a large amount of computing resources. This makes it an ideal environment for executing metaheuristic optimization experiments. In this paper, we investigate the use of cloud computing for metaheuristic optimization. This is done by analyzing job characteristics from our production system and conducting a performance comparison between different execution environments. Additionally, a cost analysis is done to incorporate expenses of using virtual resources.

Concepts and requirements for a cloud-based optimization service

Conference Paper

Feb 2014

Cloud computing has gained widespread acceptance in both the scientific and commercial community. Mathematical optimization is one of the domains, which benefit from cloud computing by using additional computing power for optimization problems to reduce the calculation time. Of course this is also true for our field of metaheuristic optimization. Metaheuristics provide powerful methods to solve a wide range of optimization problems and may be used as a foundation for a data analysis service. Due to the deficiency of an agreed-upon reference architecture it is quit cumbersome to compare existing solutions regarding different kinds of aspects (e.g. scalability, custom extensions, workflow, etc.). Besides the usual user working with an optimization service we also have those who are responsible for architecting and implementing these systems. The lack of a list of requirements and any formal reference architecture makes it even harder to improve those systems. For that reason we have raised the following questions: i) what are the requirements, ii) what are the commonalities of existing optimization software, and iii) can we deduce a reference architecture for a cloud-based optimization service? This paper presents a comprehensive analysis of current research projects and important requirements in the context of optimization services, which then leads to the definition of a reference architecture and forms the base of any further evaluation. We also present our own hybrid cloud-based optimization service (OaaS), which is built upon the PaaS-approach of Windows Azure. OaaS defines a generic and extensible service which can be adapted to support custom optimization scenarios.

Distributed volunteer computing for solving ensemble learning problems

Article

Aug 2015
FUTURE GENER COMP SY

Multi-source imagery fusion using deep learning in a cloud computing platform

Preprint

Full-text available

Jul 2023

Given the high availability of data collected by different remote sensing instruments, the data fusion of multi-spectral and hyperspectral images (HSI) is an important topic in remote sensing. In particular, super-resolution as a data fusion application using spatial and spectral domains is highly investigated because its fused images is used to improve the classification and tracking objects accuracy. On the other hand, the huge amount of data obtained by remote sensing instruments represent a key concern in terms of data storage, management and pre-processing. This paper proposes a Big Data Cloud platform using Hadoop and Spark to store, manages, and process remote sensing data. Also, a study over the parameter \textit{chunk size} is presented to suggest the appropriate value for this parameter to download imagery data from Hadoop into a Spark application, based on the format of our data. We also developed an alternative approach based on Long Short Term Memory trained with different patch sizes for super-resolution image. This approach fuse hyperspectral and multispectral images. As a result, we obtain images with high-spatial and high-spectral resolution. The experimental results show that for a chunk size of 64k, an average of 3.5s was required to download data from Hadoop into a Spark application. The proposed model for super-resolution provides a structural similarity index of 0.98 and 0.907 for the used dataset.

BalticLSC: Low-Code Software Development Platform for Large Scale Computations

Article

Jan 2021

Towards a Unified Requirements Model for Distributed High Performance Computing

Chapter

Dec 2019

High Performance Computing (HPC) consists in development and execution of sophisticated computation applications, developed by highly skilled IT personnel. Several past studies report significant problems with applying HPC in industry practice. This is caused by lack of necessary IT skills in developing highly parallelised and distributed computation software. This calls for new methods to reduce software development effort when constructing new computation applications. In this paper we propose a generic requirements model consisting of a conceptual domain specification, unified domain vocabulary and use-case-based functional requirements. Vocabulary definition provides detailed clarifications of HPC fundamental component elements and their role in the system. Further we address security issues by providing transparency principles for HPC. We also propose a research agenda that leads to the creation of a model-based software development system dedicated to building Distributed HPC applications at a high level of abstraction, with the object of making HPC more available for smaller institutions.

Survey and Taxonomy of Volunteer Computing

Article

Full-text available

Apr 2019

Volunteer Computing is a kind of distributed computing that harnesses the aggregated spare computing re- sources of volunteer devices. It provides a cheaper and greener alternative computing infrastructure that can complement the dedicated, centralized, and expensive data centres. The aggregated idle computing resources of devices ranging from desktop computers to routers and smart TVs are being utilized to provide the much needed computing infrastructure for compute intensive tasks such as scientific simulations and big data analysis. However, the use of Volunteer Computing is still dominated by scientific applications and only a very small fraction of the potential volunteer nodes are participating. This paper provides a comprehensive survey of Volunteer Computing, covering key technical and operational issues such as security, task distribution, resource management, and incentive models. The paper also presents a taxonomy of Volunteer Computing systems, together with discussions of the characteristics of specific systems in each category. In order to harness the full potentials of Volunteer Computing and make it a reliable alternative computing infrastructure for general applications, we need to improve the existing techniques and device new mechanisms. Thus, this paper also sheds light on important issues regarding the future research and development of Volunteer Computing systems with the aim of making them a viable alternative computing infrastructure.

Optimisation du débit pour des applications linéaires multi-tâches sur plateformes distribuées incluant des temps de reconfiguration

Thesis

Jan 2015

Mathias Coqblin

Les travaux présentés dans cette thèse portent sur l’ordonnancement d’applications multi-tâches linéaires de type workflow sur des plateformes distribuées. La particularité du système étudié est que le nombre de machines composant la plateforme est plus petit que le nombre de tâches à effectuer. Dans ce cas les machines sont supposées être capables d’effectuer toutes les tâches de l’application moyennant une reconfiguration, sachant que toute reconfiguration demande un temps donné dépendant ou non des tâches. Le problème posé est de maximiser le débit de l’application,c’est à dire le nombre moyen de sorties par unité de temps, ou de minimiser la période, c’est à dire le temps moyen entre deux sorties. Par conséquent le problème se décompose en deux sous problèmes: l’assignation des tâches sur les machines de la plateforme (une ou plusieurs tâches par machine), et l’ordonnancement de ces tâches au sein d’une même machine étant donné les temps de reconfiguration. Pour ce faire la plateforme dispose d’espaces appelés buffers, allouables ou imposés, pour stocker des résultats de production temporaires et ainsi éviter d’avoir à reconfigurer les machines après chaque tâche. Si les buffers ne sont pas pré-affectés nous devons également résoudre le problème de l’allocation de l’espace disponible en buffers afin d’optimiser l’exécution de l’ordonnancement au sein de chaque machine. Ce document est une étude exhaustive des différents problèmes associés à l’hétérogénéité de l’application ; en effet si la résolution des problèmes est triviale avec des temps de reconfiguration et des buffers homogènes, elle devient bien plus complexe si ceux-ci sont hétérogènes. Nous proposons ainsi d’étudier nos trois problèmes majeurs pour différents degrés d’hétérogénéité de l’application. Nous proposons des heuristiques pour traiter ces problèmes lorsqu’il n’est pas possible de trouver une solution algorithmique optimale.

Blockchain Technology with Applications to Distributed Control and Cooperative Robotics: A Survey

Article

Full-text available

Jan 2019

As a disruptive technology, blockchain, particularly its original form of bitcoin as a type of digital currency, has attracted great attentions. The innovative distributed decision making and security mechanism lay the technical foundation for its success, making us consider to penetrate the power of blockchain technology to distributed control and cooperative robotics, in which the distributed and secure mechanism is also highly demanded. Actually, security and distributed communication have long been unsolved problems in the field of distributed control and cooperative robotics. It has been reported on the network failure and intruder attacks of distributed control and multi-robotic systems. Blockchain technology provides promise to remedy this situation thoroughly. This work is intended to create a global picture of blockchain technology on its working principle and key elements in the language of control and robotics, to provide a shortcut for beginners to step into this research field.

A Survey on Blockchain Technology and Its Potential Applications in Distributed Control and Cooperative Robots

Preprint

Full-text available

Nov 2018

Big Data Mining Using Public Distributed Computing

Article

Jun 2018

Public distributed computing is a type of distributed computing in which so-called volunteers provide computing resources to projects. Research show that public distributed computing has the required potential and capabilities to handle big data mining tasks. Considering that one of the biggest advantages of such computational model is low computational resource costs, this raises the question of why this method is not widely used for solving such today’s computational challenges as big data mining. The purpose of this paper is to overview public distributed computing capabilities for big data mining tasks. The outcome of this paper provides the foundation for future research required to bring back attention to this low-cost public distributed computing method and make it a suitable platform for big data analysis.

Grid of Security : a decentralized enforcement of the network security

Book

Full-text available

Apr 2012

Gridlan: a Multi-purpose Local Grid Computing Framework

Article

Full-text available

Sep 2016

In scientific computing, more computational power generally implies faster and possibly more detailed results. The goal of this study was to develop a framework to submit computational jobs to powerful workstations underused by nonintensive tasks. This is achieved by using a virtual machine in each of these workstations, where the computations are done. This group of virtual machines is called the Gridlan. The Gridlan framework is intermediate between the cluster and grid computing paradigms. The Gridlan is able to profit from existing cluster software tools, such as resource managers like Torque, so a user with previous experience in cluster operation can dispatch jobs seamlessly. A benchmark test of the Gridlan implementation shows the system's suitability for computational tasks, principally in embarrassingly parallel computations.

BOINC Forks, Issues and Directions of Development1

Article

Full-text available

Dec 2016

The article based on the experience of running BOINC projects. We interviewed developers of projects on the platform BOINC in order to adopt their experience with the platform: issues with which they are confronted, how they have solved them, what changes have they done in BOINC and their opinion about BOINC platform, what should be improved in BOINC platform to make it better. Next we were study materials about experience of using the BOINC platform and BOINC issues. Finally we made conclusions about the actions to be taken for the development of BOINC: increase number of crunchers; rewrite the platform using modern architectural solutions and the latest technologies; initiate creation of services providing access to computing resources of crunchers.

Using Infrastructure Awareness to Support the Recruitment of Volunteer Computing Participants

Thesis

Full-text available

Dec 2011

Juan David Hincapié-Ramos

The Mini-Grid is a volunteer computing infrastructure that gathers computational power from multiple participants and uses it to execute bio-informatics algorithms. The Mini-Grid is an instance of a larger set of systems that I call participative computational infrastructures (PCI). PCIs depend on their participants to provide a service, with every instance of the system executing similar tasks and collaborating with others. Participants to these infrastruc- tures come together to contribute resources like computational power, storage capacity, network connectivity and human reasoning skills. While plenty of research has focused on the technical aspect of these infrastructures (task parallelization, distribution, robustness, and security), the participative aspect, which deals with how to recruit and maintain participants, has been largely overlooked. Despite the multiple experiences with volunteer computing projects, only a few researchers have looked into the motivational factors affecting the enrolment and permanence of participants. This dissertation studies participation from the broader context of the relationship between users and infrastructures in the field of Human- Computer Interaction (HCI), and argues that participative computational infrastructures face a fundamental recruitment challenge derived from their being “invisible” computational systems. To counter this challenge this dissertation proposes the notion of Infrastructure Awareness: a feedback mechanism on the state of, and changes in, the properties of computational infrastructures provided in the periphery of the user’s attention, and supporting gradual disclosure of detailed information on user’s request. Working with users of the Mini-Grid, this thesis shows the design process of two infrastructure awareness systems aimed at supporting the recruitment of participants, the implementation of one possible technical strategy, and an in-the-wild evaluation. The thesis finalizes with a discussion of the results and implications of infrastructure awareness for participative and other computational infrastructures.

Web Pages Content Analysis Using Browser-Based Volunteer Computing

Article

Full-text available

Jan 2013

Existing solutions to the problem of ﬁnding valuable information on the Websuﬀers from several limitations like simpliﬁed query languages, out-of-date in-formation or arbitrary results sorting. In this paper a diﬀerent approach to thisproblem is described. It is based on the idea of distributed processing of Webpages content. To provide suﬃcient performance, the idea of browser-basedvolunteer computing is utilized, which requires the implementation of text pro-cessing algorithms in JavaScript. In this paper the architecture of Web pagescontent analysis system is presented, details concerning the implementation ofthe system and the text processing algorithms are described and test resultsare provided.

An OpenStack-Based Implementation of a Volunteer Cloud

Conference Paper

Apr 2016

Recent developments in Cloud computing technology provide capabilities for an extensible, reliable, effective and dynamic infrastructure to technology-enabled enterprises, in order to efficiently leverage (or even monetize) their on-premise equipment. Furthermore, the virtualization technologies powering the Cloud revolution expand their reach by the day, and are nowadays commonly available, nearly household, capabilities. In this light, the intersection between volunteering and Cloud computing may bring massive and ubiquitous compute power for IaaS users. For instance, scientists and researchers, as a category of very demanding users, may benefit from such an enlargement of the pool of resources to tap into for high complexity computational workloads and big data problems without concern for the setup and maintenance of the underlying infrastructure. We have investigated this concept in the past under the Cloud@Home project, aimed at implementing a desktop-powered Cloud. In this paper we propose a blueprint of a Cloud@Home implementation starting from OpenStack, a well-known platform for Cloud solutions, a de-facto standard with variety of features, high interoperability and Open Source support. The reference, layered architecture and the preliminary implementation of a Cloud@Home framework based on OpenStack are discussed in the paper.

PGTrust: a decentralized free-riding prevention model for DG systems

Article

Full-text available

Jun 2016
CLUSTER COMPUT

Desktop grids (DG) offer large amounts of computing power coming from internet-based volunteer networks. They suffer from the free-riding phenomenon. It may be possible for users to free ride, consuming resources donated by others but not donating any of their own. In this paper, we present PGTrust: our decentralized free-riding prevention model designed for PastryGrid. PastryGrid is a decentralized DG system which manages resources over a decentralized P2P network. PGTrust relies on the notion of score which is a metric of reputation used to evaluate the level of QoS of a peer. We have conducted out experimentations on Grid’5000 testbed. Obtained results prove the benefits of our free-riding prevention model. PGTrust is able to improve application running time by discouraging free-riders and motivating selfish peers to contribute. It offers a considerable speedup over distributed applications.

Distributed Parallel Algorithm for Numerical Solving of 3D Problem of Fluid Dynamics in Anisotropic Elastic Porous Medium Using MapReduce and MPI Technologies

Conference Paper

Full-text available

Jan 2014

Paper presents an advanced iterative MapReduce solution that employs Hadoop and MPI technologies. First, we present an overview of working implementations that make use of the same technologies. Then we define an academic example of numeric problem with an emphasis on its computational features. The named definition is used to justify the proposed solution design.

Using Mining@Home for Distributed Ensemble Learning

Conference Paper

Full-text available

Sep 2012

Mining@Home was recently designed as a distributed architecture for running data mining applications according to the “volunteer computing” paradigm. Mining@Home already proved its efficiency and scalability when used for the discovery of frequent itemsets from a transactional database. However, it can also be adopted in several different scenarios, especially in those where the overall application can be divided into distinct jobs that may be executed in parallel, and input data can be reused, which naturally leads to the use of data cachers. This paper describes the architecture and implementation of the Mining@Home system and evaluates its performance for the execution of ensemble learning applications. In this scenario, multiple learners are used to compute models from the same input data, so as to extract a final model with stronger statistical accuracy. Performance evaluation on a real network, reported in the paper, confirms the efficiency and scalability of the framework.

CoRDAGe: a generic service for co-deploying and re-deploying grid applications

Article

Jan 2009

Loïc Cudennec

Federating physical resources located in different universities, institutes and companies leads to the concept of grid computing. These infrastructures are particularly outfitted to support the heavy computing demand coming from scientific distributed applications. Unfortunately, both applications and infrastructures are complex to use, especially when dealing with the very initial deployment step. This requires from the user to select physical resources, transfer programs and monitor the execution of the application. As far as today, a large number of systems allow to automate these operations in very simple static cases. Unfortunately, only a few of them can handle complex deployments like the re-deployment of some additional parts of the application or the coordinated deployment of multiple applications. In this thesis we propose a model that helps in dynamically deploying applications over computing grids. This model offers two main functionalities. First, it translates high-level application-specific actions into low-level generic operations to manage resources. Second, it performs a pre-planification of deployments, as well as re-deployments and co-deployments. This model satisfies three properties. 1) Resource management is made transparent for the application and the user. 2) Actions are specific to each application type. 3) Applying the model is as few intrusive as possible regarding the application programming model and source code. CORDAGE is an architecture that has been proposed to illustrate this model. It has been developed on top of the OAR job scheduler and the ADAGE deployment tool. CORDAGE has been validated using the JXTA peer-to-peer framework, the JUXMEM data-sharing service and the GFARM distributed file-system. Ou approach has been tested within the GRID'5000 experimental testbed. http://cordage.gforge.inria.fr/

Modern computing: Vision and challenges

Article

Full-text available

Mar 2024

Over the past six decades, the computing systems field has experienced significant transformations, profoundly impacting society with transformational developments, such as the Internet and the commodification of computing. Underpinned by technological advancements, computer systems, far from being static, have been continuously evolving and adapting to cover multifaceted societal niches. This has led to new paradigms such as cloud, fog, edge computing, and the Internet of Things (IoT), which offer fresh economic and creative opportunities. Nevertheless, this rapid change poses complex research challenges, especially in maximizing potential and enhancing functionality. As such, to maintain an economical level of performance that meets ever-tighter requirements, one must understand the drivers of new model emergence and expansion, and how contemporary challenges differ from past ones. To that end, this article investigates and assesses the factors influencing the evolution of computing systems, covering established systems and architectures as well as newer developments, such as serverless computing, quantum computing, and on-device AI on edge devices. Trends emerge when one traces technological trajectory, which includes the rapid obsolescence of frameworks due to business and technical constraints, a move towards specialized systems and models, and varying approaches to centralized and decentralized control. This comprehensive review of modern computing systems looks ahead to the future of research in the field, highlighting key challenges and emerging trends, and underscoring their importance in cost-effectively driving technological progress.

A Comprehensive Literature Review on Data Science Researches throughout the Period from 2008 to 2018 in IEEE Database

Article

Dec 2020

Modelling and verification of reconfigurable fault-tolerant and self-recovering systems in hybrid Clouds

Article

Apr 2021
SIMUL MODEL PRACT TH

Hybrid Cloud environments allow the utilization of local resources in private Clouds with resources from public Clouds when needed. Such environments represent systems with high failure rates because they feature heterogeneous components, a large number of servers with intensive workload are built as complex architectures. For these reasons, the availability of such systems could be easily compromised if the failure of these heterogeneous components is not handled correctly, which may cause request rejection and frequent performance degradation. Providing highly reliable Cloud applications, in particular in a hybrid Cloud environment, is a challenging and critical research problem. Therefore, the question we address in this paper is how to provision resources to user requests in the presence of failures in a hybrid Cloud environment. To this end, we propose a reconfigurable formal model of the hybrid Cloud architecture, then we utilize instantiations of this model, simulation and real-time execution runs to estimate different performance metrics related to fault detection and self-recovery strategies in hybrid Cloud. Our approach is based on the combination of the model-based and the probabilistic approaches.

Comparison of Various Algorithms for Scheduling Tasks in a Desktop Grid System Using a ComBos Simulator

Chapter

Jan 2020

A desktop grid system is one of the most common types of distributed systems. The distinctive features of a desktop grid system are the high heterogeneity and unreliability of computing nodes. Desktop grid systems deployed on the BOINC platform are considered. To simulate the functioning of the desktop grid, a modified ComBos simulator based on SimGrid is used. The ComBos simulator adds support for applications with a limited number of tasks, asynchronous execution of multiple applications and various computing resources. Data from existing voluntary distributed computing projects were used to simulate the functioning of the desktop grid. The paper deals with the modification of scheduling system for a desktop grid. Algorithms FS, FCFS, SRPT, and SWRPT were selected from existing heuristic algorithms for comparison. Two heuristic algorithms for scheduling MSF and MPSF tasks were proposed. A simulation of the desktop grid was performed based on data from existing voluntary distributed computing projects. The simulation took into account asynchronous execution of five different computing applications on several types of computing resources. A comparative analysis of the results of various scheduling algorithms in the desktop grid is carried out. Analysis of the results showed that the proposed MPSF algorithm shows the best results from the compared algorithms. The proposed heuristic scheduling algorithm can be applied to umbrella distributed computing projects and to desktop grid in general.

Adaptive Control of Redundant Task Execution for Dependable Volunteer Computing

Chapter

Jan 2011

On the volunteer computing platforms, inter-task dependency leads to serious performance degradation for failed task re-execution because of volatile peers. This paper discusses a performance-oriented task dispatch policy based on the failure probability estimation. The tasks with the highest failure probabilities are selected for dispatch when multiple task enquiries come to the dispatcher. The estimated failure probability is used to find the optimized task assignment that minimizes the overall failure probability of these tasks. This performance-oriented task dispatch policy is evaluated with two real world trace data sets on a simulator. Evaluation results demonstrate the effectiveness of this policy.

Parallelization of Littlewood-Richardson Coefficients Computation and its Integration into the BonjourGrid Meta-Desktop Grid Middleware

Chapter

Jan 2013

This paper shows how to parallelize a compute intensive application in mathematics (Group Theory) for an institutional Desktop Grid platform coordinated by a meta-grid middleware named BonjourGrid. The paper is twofold: it shows how to parallelize a sequential program for a multicore CPU which participates in the computation; and it demonstrates the effort for launching multiple instances of the solutions for the mathematical problem with the BonjourGrid middleware. BonjourGrid is a fully decentralized Desktop Grid middleware. The main results of the paper are: a) an efficient multi-threaded version of a sequential program to compute Littlewood-Richardson coefficients, namely the Multi-LR program and b) a proof of concept, centered around the user needs, for the BonjourGrid middleware dedicated to coordinate multiple instances of programsfor Desktop Grids and with the help of Multi-LR. In this paper, the scientific work consists in starting from a model for the solution of a compute intensive problem in mathematics, to incorporate the concrete model into a middleware and running it on commodity PCs platform managed by an innovative meta Desktop Grid middleware.

Volunteer Clouds

Chapter

Jan 2016

Cloud Computing (CC) offers simple and cost effective outsourcing in dynamic service environments and allows the construction of service-based applications extensible with the latest achievements of diverse research areas. CC is built using dedicated and reliable resources and provides uniform seemingly unlimited capacities. Volunteer Computing (VC) on the other hand uses volatile, heterogeneous and unreliable resources. This chapter per the authors makes an attempt starting from a definition for Cloud Computing to identify the required steps and formulate a definition for what can be considered as the next evolutionary stage for Volunteer Computing: Volunteer Clouds (VCl). There are many idiosyncrasies of VC to overcome (e.g., volatility, heterogeneity, reliability, responsiveness, scalability, etc.). Heterogeneity exists in VC at different levels. The vision of CC promises to provide a homogeneous environment. The goal of this chapter per the authors is to identify methods and propose solutions that tackle the heterogeneities and thus, make a step towards Volunteer Clouds.

Peer-to-Peer Desktop Grids Based on an Adaptive Decentralized Scheduling Mechanism

Chapter

Jan 2012

This article proposes an adaptive fuzzy logic based decentralized scheduling mechanism that will be suitable for dynamic computing environment in which matchmaking is achieved between resource requirements of outstanding tasks and resource capabilities of available workers. Feasibility of the proposed method is done via real time system. Experimental results show that implementing the proposed fuzzy matchmaking based scheduling mechanism maximized the resource utilization of executing workers without exceeding the maximum execution time of the task. It is concluded that the efficiency of FMA-based decentralized scheduling, in the case of parallel execution, is reduced by increasing the number of subtasks.

Grid, SOA and Cloud Computing

Chapter

Jan 2012

In this chapter we are going to introduce the key concepts of SOA, grid, and cloud computing and the relation between them. This chapter illustrates the paradigm shift in technological services due to the incorporation of these models and how we can combine them to develop a highly scalable application system such as petascale computing. Also there will be coverage for some concepts of Web 2.0 and why it needs grid computing and the on-demand enterprise model. Finally, we will discuss some standardization efforts on these models as a further step in developing interoperable grid systems.

Grid of Security

Chapter

Jan 2012

Network security is in a daily evolving domain. Every day, new attacks, viruses, and intrusion techniques are released. Hence, network devices, enterprise servers, or personal computers are potential targets of these attacks. Current security solutions like firewalls, intrusion detection systems (IDS), and virtual private networks (VPN) are centralized solutions, which rely mostly on the analysis of inbound network connections. This approach notably forgets the effects of a rogue station, whose communications cannot be easily controlled unless the administrators establish a global authentication policy using methods like 802.1x to control all network communications among each device. To the best of the authors’ knowledge, a distributed and easily manageable solution for the global security of an enterprise network does not exist. In this chapter, they present a new approach to deploy a distributed security solution where communication between each device can be control in a collaborative manner. Indeed, each device has its own security rules, which can be shared and improved through exchanges with others devices. With this new approach, called grid of security, a community of devices ensures that a device is trustworthy and that communications between devices progress in respect of the control of the system policies. To support this approach, the authors present a new communication model that helps structuring the distribution of security services among the devices. This can secure both ad-hoc, local-area or enterprise networks in a decentralized manner, preventing the risk of a security breach in the case of a failure.

Defining volunteer computing: a formal approach

Article

Full-text available

Jun 2015

Volunteer computing resembles private desktop grids whereas desktop grids are not fully equivalent to volunteer computing. There are several attempts to distinguish and categorize them using informal and formal methods. However, most formal approaches model a particular middleware and do not focus on the general notion of volunteer or desktop grid computing. This work makes an attempt to formalize their characteristics and relationship. To this end formal modeling is applied that tries to grasp the semantic of their functionalities - as opposed to comparisons based on properties, features, etc. We apply this modeling method to formalize the Berkeley Open Infrastructure for Network Computing (BOINC) [Anderson D. P., 2004] volunteer computing system.

Volunteer Computing on Mobile Devices: State of the Art and Future Research Directions

Chapter

Jan 2015

Different forms of parallel computing have been proposed to address the high computational requirements of many applications. Building on advances in parallel computing, volunteer computing has been shown to be an efficient way to exploit the computational resources of under utilized devices that are available around the world. The idea of including mobile devices, such as smartphones and tablets, in existing volunteer computing systems has recently been investigated. In this chapter, we present the current state of the art in the mobile volunteer computing research field, where personal mobile devices are the elements that perform the computation. Starting from the motivations and challenges behind the adoption of personal mobile devices as computational resources, we then provide a literature review of the different architectures that have been proposed to support parallel computing on mobile devices. Finally, we present some open issues that need to be investigated in order to extend user participation and improve the overall system performance for mobile volunteer computing.

Filtering-Based Defense Mechanisms Against DDoS Attacks: A Survey

Article

Sep 2016

This paper presents a comprehensive survey on filtering-based defense mechanisms against distributed denial of service (DDoS) attacks. Several filtering techniques are analyzed and their advantages and disadvantages are presented. In order to help network security analysts choose the most appropriate mechanism according to their security requirements, a comparative classification of these methods is provided. The relevant research efforts are identified and discussed for rendering the current state of the art in the literature. This classification will also serve researchers to address weaknesses of these filtering methods, and thus mitigate DDoS attacks using more effective defense mechanisms.

Tasklets: "Better than Best-Effort" Computing

Conference Paper

Aug 2016

Availability/Network-aware MapReduce over the Internet

Article

Sep 2016
INFORM SCIENCES

MapReduce offers an ease-of-use programming paradigm for processing large datasets. In our previous work, we have designed a MapReduce framework called BitDew-MapReduce for desktop grid and volunteer computing environment, that allows nonexpert users to run data-intensive MapReduce jobs on top of volunteer resources over the Internet. However, network distance and resource availability have great impact on MapReduce applications running over the Internet. To address this, an availability and network-aware MapReduce framework over the Internet is proposed. Simulation results show that the MapReduce job response time could be decreased by 40.05%, thanks to Weighted Naive Bayes Classifier-based availability prediction and landmark-based network estimation. The effectiveness of the new MapReduce framework is further proved by performance evaluation in a real distributed environment.

Virtual EZ Grid: A Volunteer Computing Infrastructure for Scientific Medical Applications

Article

Jul 2014

This paper presents the Virtual EZ Grid project, based on the XtremWeb-CH XWCH volunteer computing platform. The goal of the project is to introduce a flexible distributed computing system, with i an infrastructure with a non-trivial amount of computing resources from various institutes, ii a stable platform that manages these computing resources and provides advanced interfaces for applications, and iii a set of applications that take benefit of the platform. This paper concentrates on the application support of the new version of XWCH, and describes how two medical applications, MedGIFT and NeuroWeb, utilise it.

Decentralizing Volunteer Computing Coordination:

Conference Paper

Aug 2016

This paper attempted to decentralize volunteer computing (VC) coordination with the goal of reducing the reliance on a central coordination server, which had been criticized for performance bottleneck and single point of failure. On analyzing the roles and functions that the VC components played for the centralized master/worker coordination model, this paper proposed a decentralized VC coordination framework based on distributed hash table (DHT) and peer-to-peer (P2P) overlay and then successfully mapped the centralized VC coordination into distributed VC coordination. The proposed framework has been implemented on the performance-proven DHT P2P overlay Chord. The initial verification has demonstrated the effectiveness of the framework when working in distributed environments.

Volunteer Computing on Mobile Devices: State of the Art and Future Research Directions

Chapter

Jan 2015

PastryGridCP: A Decentralized Rollback-Recovery Protocol for Desktop Grid Systems

Conference Paper

Dec 2013

Desktop Grids are composed of several thousands of resources. They are characterized by high volatility of resources, due to voluntary disconnections or failures. This could affect the proper termination of applications execution. PastryGrid is a decentralized system which manages desktop grid resources and user applications over a fully decentralized P2P network. In this paper we present PastryGridCP: our rollback-recovery protocol, which is based on checkpoints designed for the decentralized Desktop Grid system PastryGrid. It provides fault tolerance for grid applications and ensures the termination of the execution of applications in a transparent way to users. We have conducted out experimentations on 110 nodes of Grid’5000. Obtained results validate our protocol and improve the performance of applications.

Task Replication and Scheduling Based on Nearest Neighbor Classification in Desktop Grids

Article

Jan 2013

The desktop grids are a kind of grid computing that incorporates desktop resources into grid infrastructure. In desktop grids, it is important that fast turnaround time is guaranteed in the presence of the dynamic properties such as volatility and heterogeneity. In this paper, we propose a nearest neighbor (NN)-based task scheduling that can selectively allocate tasks to those resources that are suitable for the current situation of a desktop grid environment. The experimental results show that our scheduling is more efficient than the existing scheduling with respect to reducing both turnaround time and the number of resources consumed.

Grid, SOA and Cloud Computing: On-Demand Computing Models

Article

Jan 2011

Service Oriented Architecture (SOA) and Web Services play an invaluable role in grid and cloud computing models and are widely seen as a base for new models of distributed applications and system management tools. SOA, grid and cloud computing models share core and common behavioral features and characteristics by which a synergy is there to develop and implement new services that facilitate the on-demand computing model. In this chapter we are going to introduce the key concepts of SOA, grid, and cloud computing and the relation between them. This chapter illustrates the paradigm shift in technological services due to the incorporation of these models and how we can combine them to develop a highly scalable application system such as petascale computing. Also there will be coverage for some concepts of Web 2.0 and why it needs grid computing and the on-demand enterprise model. Finally, we will discuss some standardization efforts on these models as a further step in developing interoperable grid systems.

Practical Security in Distributed Systems

Article

Feb 2013

Design Principles of Large-Scale Distributed System

Article

Feb 2013

Introduction to peer-to-peer systems The peer-to-peer paradigms Services on structured overlays Building trust in P2P systems Conclusion Bibliography

Checkpoint Sharing-Based Replication Scheme in Desktop Grid Computing

Article

Jan 2012

It is important to reduce turnaround time for all tasks against the presence of execution failures in desktop grids. To achieve this objective, this paper proposes a checkpoint sharing-based replication scheme where each task is basically allocated to multiple desktop resources under a hybrid P2P desktop grid architecture and intermediate execution results (i.e., checkpoints) can be transferred to other resources for its successive execution. To enhance turnaround time, the sequential task distribution based on checkpoints is applied to the scheme. Performance evaluation shows that our scheme is superior to the existing scheme with respect to reducing both turnaround time and total execution time, regardless of a failure rate.

Distributed Processing for Large Geodetic Solutions

Chapter

Jan 2013

H. Boomkamp

This paper reports on the activities of the IAG Working Group 1.1.1 on combination and comparison of precise orbits based on different space geodetic techniques. It will focus on the Dancer project which implements a distributed parameter estimation process that is scalable in the number of GPS receivers, so that an arbitrarily large number of receivers can be processed in a single reference frame realization. The background of this project will be summarized and its mathematical principles will be explained, as well as the essential aspects of the involved internet communication. It will show that the workload for data processing at a single participating receiver remains independent of the network size, while the data traffic only grows as a logarithmic function of the network size.

A hyper-box approach using relational databases for large scale machine learning

Article

Oct 2014

In this paper We follow a simple approach which allows the implementation of machine learning (ML for short) techniques to large data sets. More specifically, we study the case of on-demand dynamic creation of a local model in the neighborhood of a target datum instead of creating a global one on the whole training data set. This approach exploits the advanced data structures and algorithms, embedded in modern relational databases, to identify the neighborhood of a target datum, rapidly. Preliminary experimental results from a large scale classification problem (HIGGS dataset) show that the typical machine learning techniques are applicable to large data sets through this approach, under particular conditions. We highlight some restrictions of the method and some issues arising by implementing it.

Deploying Fault-tolerance and Task Migration with NetSolve

Article

Full-text available

Jan 1998
Lect Notes Comput Sci

Computational power grids are computing environments with massive resources for processing and storage. While these resources maybepervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparenttothe end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the speci#c approaches that have been and are currently being implemented within NetSolve. Keywords Fault-tolerance, Scienti#c Computing, Computational Servers, Checkpointing, Migration. 1 Introduction The advances in computer and network technologies that are shaping the global information infrastructure are also producing a new vision of how that infrastructure will be used. The concept of a Computational Power Grid has emerged t...

Authentication in distributed systems

Conference Paper

Full-text available

Oct 1991

We describe a theory of authentication and a system that implements it. Our theory is based on the notion of principal and a "speaks for" relation between principals. A simple principal either has a name or is a communication channel; a compound principal can express an adopted role or delegation of authority. The theory explains how to reason about a principal's authority by deducing the other principals that it can speak for; authenticating a channel is one important application. We use the theory to explain many existing and proposed mechanisms for security. In particular, we describe the system we have built. It passes principals efficiently as arguments or results of remote procedure calls, and it handles public and shared key encryption, name lookup in a large name space, groups of principals, loading programs, delegation, access control, and revocation.

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Conference Paper

Full-text available

Jul 2001

Omni remote procedure call facility, OmniRPC, is a threadsafe grid RPC facility for cluster and global computing environments. The remote libraries are implemented as executable programs in each remote computer, and OmniRPC automatically allocates remote library calls dynamically on appropriate remote computers to facilitate location transparency. We propose to use OpenMP as an easy-to-use and simple programming environment for the multi-threaded client of OmniRPC. We use the POSIX thread implementation of the Omni OpenMP compiler which allows multi-threaded execution of OpenMP programs by POSIX threads even in a single processor. Multiple outstanding requests of OmniRPC calls in OpenMP work-sharing construct are dispatched to different remote computers to exploit network-wide parallelism.

Overview of GridRPC: A remote procedure call API for Grid computing

Conference Paper

Full-text available

Nov 2002
Lect Notes Comput Sci

This paper discusses preliminary work on standardizing and implementing a remote procedure call (RPC) mechanism for grid computing. The GridRPC API is designed to address the lack of a standardized, portable, and simple programming interface. Our initial work on GridRPC shows that client access to existing grid computing systems such as NetSolve and Ninf can be unified via a common API, a task that has proven to be problematic in the past.

Ninf: A Network Based Information Library for Global World-Wide Computing Infrastructure.

Conference Paper

Full-text available

Apr 1997
Lect Notes Comput Sci

Ninf is an ongoing global network-wide computing infrastructure project which allows users to access computational resources including hard- ware, software and scientific data distributed across a wide area network. Ninf is intended not only to exploit high performance in network parallel computing, but also to provide high quality numerical computation services and accesses to scientific database published by other researchers. Computational resources are shared as Ninf remote libraries executable at a remote Ninf server. Users can build an application by calling the libraries with the Ninf Remote Procedure Call, which is designed to provide a programming interface similar to conventional function calls in existing languages, and is tailored for scientific computation. In order to facilitate location transparency and network-wide parallelism, Ninf metaserver maintains global resource information regarding computational server and databases, allocating and scheduling coarse-grained computation for global load balancing. Ninf also interfaces with the WWW browsers for easy accessi- bility.

OmniRPC: a grid RPC facility for cluster and global computing in OpenMP

Conference Paper

Full-text available

Jun 2001
Lect Notes Comput Sci

Omni remote procedure call facility, OmniRPC, is a thread-safe grid RPC facility for cluster and global computing environments. The remote libraries are implemented as executable programs in each remote computer, and OnmiRPC automatically allocates remote library calls dynamically on appropriate remote computers to facilitate location transparency. We propose to use OpenMP as an easy-to-use and simple programming environment for the multi-threaded client of OmniRPC. We use the POSIX thread implementation of the Omni OpenMP compiler which allows multi-threaded execution of OpenMP programs by POSIX threads even in a single processor. Multiple outstanding requests of OmniRPC calls in OpenMP work-sharing construct are dispatched to different remote computers to exploit network-wide parallelism.

Chord: A scalable peer-to-peer lookup service for internet applications

Conference Paper

Full-text available

Aug 2001

On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing

Conference Paper

Full-text available

Feb 2003

It has been reported [25] that life holds but two certainties, death and taxes. And indeed, it does appear that any society, and in the context of this article, any large-scale distributed system, must address both death (failure) and the establishment and maintenance of infrastructure (which we assert is a major motivation for taxes, so as to justify our title!). Two supposedly new approaches to distributed computing have emerged in the past few years, both claiming to address the problem of organizing large-scale computational societies: peer-to-peer (P2P) [15, 36, 49] and Grid computing [21]. Both approaches have seen rapid evolution, widespread deployment, successful application, considerable hype, and a certain amount of (sometimes warranted) criticism. The two technologies appear to have the same final objective, the pooling and coordinated use of large sets of distributed resources, but are based in different communities and, at least in their current designs, focus on different requirements.

Authentication in Distributed Systems: Theory and Practice

Conference Paper

Full-text available

Oct 1991

We describe a theory of authentication and a system that implements it. Our theory is based on the notion of principal and a ‘speaks for’ relation between principals. A simple principal either has a name or is a communication channel; a compound principal can express an adopted role or delegated authority. The theory shows how to reason about a principal’s authority by deducing the other principals that it can speak for; authenticating a channel is one important application. We use the theory to explain many existing and proposed security mechanisms. In particular, we describe the system we have built. It passes principals efficiently as arguments or results of remote procedure calls, and it handles public and shared key encryption, name lookup in a large name space, groups of principals, program loading, delegation, access control, and revocation.

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Conference Paper

Full-text available

Nov 2003

Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.

Fault Tolerance in Message Passing Interface Programs

Article

Full-text available

Sep 2004
INT J HIGH PERFORM C

In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

Impossibility of Distributed Consensus with One Faulty Process

Article

Full-text available

Apr 1985

The consensus problem involves an asynchronous system of processes, some of which may be unreliable. The problem is for the reliable processes to agree on a binary value. In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process. By way of contrast, solutions are known for the synchronous case, the “Byzantine Generals” problem.

Impossibility of Distributed Consensus with One Faulty Process.

Conference Paper

Full-text available

Jan 1983

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Conference Paper

Full-text available

Dec 2003

MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes

Conference Paper

Full-text available

Dec 2002

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.

XtremWeb: A generic global computing system

Conference Paper

Full-text available

Feb 2001

Global computing achieves high throughput computing by harvesting a very large number of unused computing resources connected to the Internet. This parallel computing model targets a parallel architecture defined by a very high number of nodes, poor communication performance and continuously varying resources. The unprecedented scale of the global computing architecture paradigm requires us to revisit many basic issues related to parallel architecture programming models, performance models, and class of applications or algorithms suitable for this architecture. XtremWeb is an experimental global computing platform dedicated to provide a tool for such studies. The paper presents the design of XtremWeb. Two essential features of this design are multi-applications and high-performance. Accepting multiple applications allows institutions or enterprises to set up their own global computing applications or experiments. High-performance is ensured by scalability, fault tolerance, efficient scheduling and a large base of volunteer PCs. We also present an implementation of the first global application running on XtremWeb

Condor - A Hunter of Idle Workstations

Conference Paper

Full-text available

Jul 1988

The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented. The system aims to maximize the utilization of workstations with as little interference as possible between the jobs it schedules and the activities of the people who own workstations. It identifies idle workstations and schedules background jobs on them. When the owner of a workstation resumes activity at a station, Condor checkpoints the remote job running on the station and transfers it to another workstation. The system guarantees that the job will eventually complete, and that very little, if any, work will be performed more than once. A performance profile of the system is presented that is based on data accumulated from 23 stations during one month

A Measurement Study of Peer-to-Peer File Sharing Systems

Article

Full-text available

Mar 2002
Proceedings of SPIE

The popularity of peer-to-peer multimedia file sharing applications such as Gnutella and Napster has created a flurry of recent research activity into peer-to-peer architectures. We believe that the proper evaluation of a peerto -peer system must take into account the characteristics of the peers that choose to participate. Surprisingly, however, few of the peer-to-peer architectures currently being developed are evaluated with respect to such considerations. In this paper, we remedy this situation by performing a detailed measurement study of the two popular peer-to-peer file sharing systems, namely Napster and Gnutella. In particular, our measurement study seeks to precisely characterize the population of end-user hosts that participate in these two systems. This characterization includes the bottleneck bandwidths between these hosts and the Internet at large, IP-level latencies to send packets to these hosts, how often hosts connect and disconnect from the system, how many files hosts share and download, the degree of cooperation between the hosts, and several correlations between these characteristics. Our measurements show that there is significant heterogeneity and lack of cooperation across peers participating in these systems.

Grid Information Services for Distributed Resource Sharing

Conference Paper

Aug 2001

Steven Fitzgerald

Grid technologies enable large-scale sharing of resources within formal or informal consortia of individuals and/or institutions: what are sometimes called virtual organizations. In these settings, the discovery, characterization, and monitoring of resources, services, and computations are challenging problems due to the considerable diversity, large numbers, dynamic behavior, and geographical distribution of the entities in which a user might be interested. Consequently, information services are a vital part of any Grid software infrastructure, providing fundamental mechanisms for discovery and monitoring, and hence for planning and adapting application behavior. We present here an information services architecture that addresses performance, security, scalability, and robustness requirements. Our architecture defines simple low-level enquiry and registration protocols that make it easy to incorporate individual entities into various information structures, such as aggregate directories that support a variety of different query languages and discovery strategies. These protocols can also be combined with other Grid protocols to construct additional higher-level services and capabilities such as brokering, monitoring, fault detection, and troubleshooting. Our architecture has been implemented as MDS-2, which forms part of the Globus Grid toolkit and has been widely deployed and applied.

The “worm” programs—early experience with a distributed computation

Conference Paper

Feb 1991

SuperWeb: Towards a Global Web-Based Parallel Computing Infrastructure

Conference Paper

Apr 1997

Taxes and the Convergence of Peer-to-Peer and Grid Computing

Article

Jan 2002
Lect Notes Comput Sci

SUBTERFUGUE: A Framework for Observing and Playing with the Reality of Software

Article

M. Coleman

Globus Toolkit 3 Core-A Grid Service Container Framework

Article

AIRES : AIR SHOWERS EXTENDED SIMULATION

Article

S. J. Sciutto

The Common Object Request Broker: Architecture and Specification Version 3

Article

Jan 1991

Massachusett Framingham

Condor - A hunter of idle workstations , Proceedings of the Eighth Conference on Distributed Computing

Article

HARNESS and fault tolerant MPI

Article

Oct 2001
PARALLEL COMPUT

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon.

Sabotage-tolerance mechanisms for volunteer computing systems

Article

Apr 2001
FUTURE GENER COMP SY

Luis F. G. Sarmenta

In this paper, we address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results, by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting and show how it reduces error rates exponentially with redundancy, but requires all work to be done several times, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e. inversely) with the amount of work to be done, while only costing an extra fraction of the original time. Integrating these mechanisms, we then present the new idea of credibility-based fault-tolerance, wherein we estimate the conditional probability of results and workers being correct, based on the results of using voting, spot-checking and other techniques, and then use these probability estimates to direct the use of further redundancy. Using this technique, we are able to attain mathematically guaranteeable levels of correctness, and do so with much smaller slowdown than possible with voting or spot-checking alone. Finally, we validate these new ideas with Monte Carlo simulations, and discuss other possible variations of these techniques.

A high-performance, portable implementation of the MPI message passing interface standard

Article

Jan 1998
PARALLEL COMPUT

MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

Pastry: Scalable, Decentralized Object Location and Routing for Large-Scale Peer-to-Peer Systems

Conference Paper

Jan 2001

This paper presents the design and evaluation of Pastry, a scalable, distributed object location and routing substrate for wide-area peer-to-peer applications. Pastry performs application-level routing and object location in a potentially very large overlay network of nodes connected via the Internet. It can be used to support a variety of peer-to-peer applications, including global data storage, data sharing, group communication and naming. Each node in the Pastry network has a unique identifier (nodeId). When presented with a message and a key, a Pastry node efficiently routes the message to the node with a nodeId that is numerically closest to the key, among all currently live Pastry nodes. Each Pastry node keeps track of its immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. Pastry takes into account network locality; it seeks to minimize the distance messages travel, according to a to scalar proximity metric like the number of IP routing hops. Pastry is completely decentralized, scalable, and self-organizing; it automatically adapts to the arrival, departure and failure of nodes. Experimental results obtained with a prototype implementation on an emulated network of up to 100,000 nodes confirm Pastry’s scalability and efficiency, its ability to self-organize and adapt to node failures, and its good network locality properties.

Global Computing Systems

Conference Paper

Jun 2001
Lect Notes Comput Sci

Global Computing harvest the idle time of Internet connected computers to run very large distributed applications. The unprecedented scale of the GCS paradigm requires to revisit the basic issues of distributed systems: performance models, security, fault-tolerance and scalability. The first parts of this paper review recent work in Global Computing, with particular interest in Peer-to-Peer systems. In the last section, we present XtremWeb, the Global Computing System we are currently developing.

ParaWeb: towards world-wide supercomputing

Conference Paper

Jan 1996

The "Worm" Programs - Early Experience with a Distributed Computation.

Article

Mar 1982

The “worm” programs were an experiment in the development of distributed computations: programs that span machine boundaries and also replicate themselves in idle machines. A “worm” is composed of multiple “segments,” each running on a different machine. The underlying worm maintenance mechanisms are responsible for maintaining the worm—finding free machines when needed and replicating the program for each additional segment. These techniques were successfully used to support several real applications, ranging from a simple multimachine test program to a more sophisticated real-time animation system harnessing multiple machines.

Deploying fault tolerance and taks migration with NetSolve

Article

Oct 1999
FUTURE GENER COMP SY

Web-based Metacomputing with JET.

Article

Nov 1997
Concurrency Pract Ex

One of the most interesting challenges to some part of the high-performance community is to try to exploit the existing computing resources for executing long-running number-crunching applications. Several important issues have to be addressed like, portability, robustness, security, heterogeneity, loadbalancing and fault-tolerance. Java is an emerging language that is receiving an extraordinary enthusiasm and acceptance from several fields of programming. Interestingly, it presents some nice characteristics that partially solve some of those problems. This paper briefly describes JET, a parallel library implemented on Java that supports the execution of parallel applications over the Web. It is oriented to Master/Worker applications which present a coarse-grain task distribution. The library provides a high-level programming interface, support for fault-tolerance and some schemes to mask the latency of the communication. It can be used to execute massively distributed applications usi...

Javelin: Internet-Based Parallel Computing Using Java

Article

Nov 1997
Concurrency Pract Ex

Java offers the basic infrastructure needed to integrate computers connected to the Internet into a seamless parallel computational resource: a flexible, easily-installed infrastructure for running coarsegrained parallel applications on numerous, anonymous machines. Ease of participation is seen as a key property for such a resource to realize the vision of a multiprocessing environment comprising thousands of computers. We present Javelin, a Java-based infrastructure for global computing. The system is based on Internet software technology that is essentially ubiquitous: Web technology. Its architecture and implementation require participants to have access only to a Java-enabled Web browser. The security constraints implied by this, the resulting architecture, and current implementation are presented. The Javelin architecture is intended to be a substrate on which various programming models may be implemented. Several such models are presented: A Linda Tuple Space, an SPMD ...

MPI: The complete reference

Book

Jan 1996

RPC: Remote Procedure Call Protocol Specification Version 2

Article

Jan 1995

R. Srinivasan

The MOSIX Distributed Operating System : Load Balancing for UNIX

Book

Jan 1993

Bibliogr. s. 213-216

P2P-RPC: Programming scientific applications on peer-to-peer systems with remote procedure call

Conference Paper

Jun 2003

Samir Djilali

This paper presents design and implementation of a remote Procedure call (RPC) API for programming applications on Peer-to-Peer environments. The P2P-RPC API is designed to address one of neglected aspect of Peer-to-Peer the lack of a simple programming interface. In this paper we examine one concrete implementation of the P2P-RPC-API derived from OmniRPC (an existing RPC API for the Grid based on Ninf system). This new API is implemented on top of low-level functionalities of the XtremWeb Peer-to-Peer Computing System. The minimal API defined in this paper provides a basic mechanism to make migrate a wide variety of applications using RPC mechanism to the Peer-to-Peer systems. We evaluate P2P-RPC for a numerical application (NAS EP Benchmark) and demonstrate its performance and fault tolerance properties.

Grid Information Services for Distributed Resource Sharing

Conference Paper

Feb 2001

Grid technologies enable large-scale sharing of resources within formal or informal consortia of individuals and/or institutions: what are sometimes called virtual organizations. In these settings, the discovery, characterization, and monitoring of resources, services, and computations are challenging problems due to the considerable diversity; large numbers, dynamic behavior, and geographical distribution of the entities in which a user might be interested. Consequently, information services are a vital part of any Grid software infrastructure, providing fundamental mechanisms for discovery and monitoring, and hence for planning and adapting application behavior. We present an information services architecture that addresses performance, security, scalability, and robustness requirements. Our architecture defines simple low-level enquiry and registration protocols that make it easy to incorporate individual entities into various information structures, such as aggregate directories that support a variety of different query languages and discovery strategies. These protocols can also be combined with other Grid protocols to construct additional higher-level services and capabilities such as brokering, monitoring, fault detection, and troubleshooting. Our architecture has been implemented as MDS-2, which forms part of the Globus Grid toolkit and has been widely deployed and applied

Sabotage-tolerance mechanisms for volunteer computing systems

Conference Paper

Feb 2001

Luis F. G. Sarmenta

We address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting, and show how it reduces error rates exponentially with redundancy, but requires all work to be done at least twice, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e., inversely) with the amount of work to be done, while only costing an extra function of the original time. We then integrate these mechanisms by presenting the new idea of credibility-based fault-tolerance, which uses probability estimates to efficiently limit and direct the use of redundancy. By using voting and spot-checking together credibility-based fault-tolerance effectively allows us to exponentially shrink an already linearly-reduced error rate, and thus achieve error-rates that are orders-of-magnitude smaller than those offered by voting or spot-checking alone. We validate this new idea with Monte Carlo simulations, and discuss how credibility-based fault tolerance can be used with other mechanisms and in other applications

An enabling framework for master-worker applications on theComputational Grid

Conference Paper

Feb 2000

Describes MW (Master-Worker) - a software framework that allows users to quickly and easily parallelize scientific computations using the master-worker paradigm on the Computational Grid. MW provides both a “top-level” interface to application software and a “bottom-level” interface to existing Grid computing toolkits. Both interfaces are briefly described. We conclude with a case study, where the necessary Grid services are provided by the Condor high-throughput computing system, and the MW-enabled application code is used to solve a combinatorial optimization problem of unprecedented complexity

Globally distributed computation over the Internet: The POPCORN project

Conference Paper

Jun 1998

The POPCORN project provides an infrastructure for globally distributed computation over the whole Internet. It provides any programmer connected to the Internet with a single huge virtual parallel computer composed of all processors on the Internet which care to participate at any given moment. The system provides a market-based mechanism of trade in CPU time to motivate processors to provide their CPU cycles for other peoples' computations. Selling CPU time is as easy as visiting a certain Web site with a Java-enabled browser. Buying CPU time is done by writing a parallel program, using our programming paradigm (and libraries). This paradigm was designed to fit the situation of global computation. A third entity in our system is a market for CPU time, which is where buyers and sellers meet and trade. The system has been implemented and may be visited and used on our Web site: http://www.cs.huji.ac.il/-popcorn

SuperWeb: Towards a global Web-based parallel computing infrastructure

Conference Paper

May 1997

The Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment

Grid Services for Distributed System Integration

Article

Jul 2002

Increasingly, computing addresses collaboration, data sharing, and interaction modes that involve distributed resources, resulting in an increased focus on the interconnection of systems both within and across enterprises. These evolutionary pressures have led to the development of Grid technologies. The authors' work focuses on the nature of the services that respond to protocol messages. Grid provides an extensible set of services that can be aggregated in various ways to meet the needs of virtual organizations, which themselves can be defined in part by the services they operate and share

A Scalable Content-Addressable Network

Article

Sep 2001
COMPUT COMMUN REV

Hash tables -- which map "keys" onto "values" -- are an essential building block in modern software systems. We believe a similar functionality would be equally valuable to large distributed systems. In this paper, we introduce the concept of a ContentAddressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales. The CAN design is scalable, fault-tolerant and completely self-organizing, and we demonstrate its scalability, robustness and low-latency properties through simulation.

SuperWeb: Towards a Global Web-Based Parallel Computing Infrastructure

Article

Jul 1999

The Internet, best known by most users as the WorldWide -Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a globa...

NetSolve: A Network-enabled Server for Solving Computational Science Problems

Article

Sep 2000
INT J HIGH PERFORM C

This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. The development of NetSolve was motivated by the need for an easy-to-use, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which require no programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resources available as efficiently as possible. NetSolve offers the ability to look for computational resources on a network, choose the best one available, solve a problem (with retry for fault-tolerance), and return the answer to the user.

Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with Grid

Abstract

No full-text available

Recommended publications

Performance Comparison of Parallel Programming Paradigms on a Multicore Cluster

Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming M...

OpenMP extensions for FPGA accelerators

Simulating billion-task parallel programs