Article

Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with Grid

Authors:
  • iExec Blockchain Tech
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Global Computing systems belong to the class of large-scale distributed systems. Their properties high computational, storage and communication performance potentials, high resilience make them attractive in academia and industry as computing infrastructures in complement to more classical infrastructures such as clusters or supercomputers. However, generalizing the use of these systems in a multi-user and multi-parallel programming context involves finding solutions and providing mechanisms for many issues such as programming bag of tasks and message passing parallel applications, securing the application, the system itself and the computing nodes, deploying the systems for harnessing resources managed in different ways. In this paper, we present our research, often influenced by user demands, towards a Computational peer-to-peer system called XtremWeb. We describe (a) the architecture of the system and its motivations, (b) the parallel programming paradigms available in XtremWeb and how they are implemented, (c) the deployment issues and what mechanisms are used to harness simultaneously uncoordinated set of resources, and resources managed by batch schedulers and (d) the security issue and how we address, inside XtremWeb, the protection of the computing resources. We present two multi-parametric applications to be used in production: Aires belonging to the high energy physics (HEP) Auger project and a protein conformation predictor using a molecular dynamic simulator. To evaluate the performance and volatility tolerance, we present experiment results for bag of tasks applications and message passing applications. We show that the system can tolerate massive failure and we discuss the performance of the node protection mechanism. Based on the XtremWeb project developments and evolutions, we will discuss the convergence between Global Computing systems and Grid.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Among the desired features for the algorithms under consideration -that will potentially be run on non-dedicated local computers, remote devices, grid systems, cloud systems, ubiquitous systems, among others [3,4,5]-we look for ephemerality-awareness, which is related to self-capability for understanding the underlying systems where the algorithm is run as well as taking decisions on how to proceed taking into account the non-reliable nature of the system. ...
... Think, for example, of the pervasive abundance of networked handheld devices, tablets and, lately, wearables -not to mention more classical devices such as desktop computers-whose computational capabilities are often underexploited. Hence, the concept of Eph-C partially overlaps with ubiquitous computing [4], pervasive computing [6], volunteer and distributed computing [5,7] but exhibits its own distinctive features, mainly in terms of the extreme dynamism of the underlying resources, and the ephemerality-aware nature of the computation, which autonomously adapt to the ever-changing computational landscape, not just trying to fit to the inherent volatility of the latter but even trying to use it for profit. ...
Article
Full-text available
The concept of Ephemeral Computing is an emergent topic that is currently consolidating among the research community. It includes computing systems where the nodes or the connectivity have an ephemeral and thus unpredictable nature. Although the capacity and computer power of small and medium devices (as smartphones or tablets) are increasing swiftly, their computing capacities are usually underexploited. The availability of highly-volatile heterogeneous computer resources capable of running software agents requires suitable algorithms to make a proper use of the available resources while circumventing the potential problems that may produce such non-reliable systems. Due to the non-reliable nature of the system where the algorithms under consideration should run, they have to be ephemerality-aware, having the self-capability for understanding this kind of environments and adapt to them by means of exibility, plasticity and robustness. Because of their decentralized functioning, intrinsic parallelism, resilience, and adaptiveness, bioinspired algorithms suit well to this endeavour. The papers in this special issue address a variety of issues and concerns in ephemeral and complex domains, including: signal reconstruction, large scale social network analysis, diseases detection and prevention, unit deployment and collaborative hyper-heuristics.
... Actuellement, la distribution par un serveur central d'un calcul sur une grille est une tâche maitrisée et mise en oeuvre dans plusieurs infrastructures telles que Globus [60] ou eXtremWeb 17 [55,20] La technologie Java qui, après avoirété fortement contestée dans la communauté du calcul distribué, sembleêtre maintenant reconnue pour sa portabilité, sa facilité de programmation et son extensibilité. On trouve actuellement trois propositions qui reposent sur cette technologie [161,61]. ...
... Cela signifie que toutes les autres stratégies conduisent forcémentà une perte du profit et que dans ces conditions, la stratégie (σ 0 , δ 2 ) est la meilleure pour un agent, quelles que soient les autres stratégies. 20 on trouveégalement le terme de pillage pour désigner ce phénomène 21 qui peutêtre une heure, un jour, un mois, . . . en fonction du contexte Par contre, l'ajout d'un mécanisme incitatif permet de changer l'équilibre du système. ...
Thesis
Le modèle pair à pair (P2P) est aujourd'hui utilisé dans des environnements contraints. La décentralisation induites par ce modèle repousse les limites du modèle client-serveur. Néanmoins, pour pouvoir garantir un niveau de service, il requiert l'intégration d'une infrastructure de supervision adaptée. Ce dernier point constitue le cadre de notre travail. Concernant la modélisation de l'information de gestion, nous avons conçu une extension de CIM pour le modèle P2P. Pour la valider, nous l'avons implantée sur Jxta. Nous avons ensuite spécialisé notre modèle de l'information pour les tables de hachage distribuées (DHT). Nous avons abstrait le fonctionnement des DHTs, proposé un ensemble de métriques qui caractérisent leur performance, et déduit un modèle de l'information qui les intègre. Enfin, concernant l'organisation du plan de gestion, nous avons proposé un modèle hiérarchique, qui permet aux pairs de s'organiser selon une arborescence de gestionnaires et d'agents. Cette proposition a été mise en oeuvre sur une implantation de Pastry.
... Utilizing unused computing resources from user's desktop/notebook machines to solve large computing tasks, known as volunteer computing, is introduced in [10,14]. For example, SETI@home [49] aggregates the computing power from thousands of anonymous volunteer users, who reside in different countries around the world, to seek for radio signals from extraterrestrial intelligence. ...
... When a job is executed, the BOINC client is exposed such that it starts retrieving and executing workunits from BOINC until the running time of the encapsulating grid job is over. Similarly, Urbah et al. [54] provides bridging EGEE grids to the XtremWeb desktop grids [14], another volunteer computing platform. For cluster computing, BOINC has been merged to create a scalable cluster for processing large batch jobs [37]. ...
Article
Full-text available
Deep learning is a very computing-intensive and time-consuming task. It needs an amount of computing resource much greater than a single machine can afford to train a sophisticated model within a reasonable time. Normally, GPU clusters are required to reduce the training time of a deep learning model from days to hours. However, building large dedicated GPU clusters is not always feasible or even ineffective for most organizations due to the cost of purchasing, operation and maintenance while such systems are not fully utilized all the time. In this regard, volunteer computing can address this problem as it provides additional computing resources at less or no cost. This work presents the hybrid cluster and volunteer computing platform that scales out GPU clusters into volunteer computing for distributed deep learning. The owners of the machines contribute unused computing resources on their computers to extend the capability of the GPU cluster. The challenge is to seamlessly align the differences between GPU cluster and volunteer computing systems so as to ensure the scalability transparency, whereas performance is also another major concern. We validate the proposed work with two well-known sample cases. The results show an efficient use of our hybrid platform at sub-linear speedup.
... The system was designed with fault tolerance in mind. This allows the mobility of clients, the volatility of workers and failure of the Coordination service [23]. In [24], Abdennadher and Boesch presented and upgraded version called XtremWeb-CH, which supports direct communication between worker to build effective Peer-To-Peer systems. ...
... A similar situation comes with the job management (R5) requirement. As described in [23] the coordinator component manages and supervises the task execution which includes functionality as described in requirement (R5). Similar functionality is provided by the task consistency component from CometCloud and the job management server from Analytics Cloud respectively. ...
Conference Paper
Cloud computing has emerged as a new technology that provides on-demand access to a large amount of computing resources. This makes it an ideal environment for executing metaheuristic optimization experiments. In this paper, we investigate the use of cloud computing for metaheuristic optimization. This is done by analyzing job characteristics from our production system and conducting a performance comparison between different execution environments. Additionally, a cost analysis is done to incorporate expenses of using virtual resources.
... The system was designed with fault tolerance in mind. This allows the mobility of clients, the volatility of workers and failure of the Coordination service [23]. In [24], Abdennadher and Boesch presented and upgraded version called XtremWeb-CH, which supports direct communication between worker to build effective Peer-To-Peer systems. ...
... A similar situation comes with the job management (R5) requirement. As described in [23] the coordinator component manages and supervises the task execution which includes functionality as described in requirement (R5). Similar functionality is provided by the task consistency component from CometCloud and the job management server from Analytics Cloud respectively. ...
Conference Paper
Cloud computing has gained widespread acceptance in both the scientific and commercial community. Mathematical optimization is one of the domains, which benefit from cloud computing by using additional computing power for optimization problems to reduce the calculation time. Of course this is also true for our field of metaheuristic optimization. Metaheuristics provide powerful methods to solve a wide range of optimization problems and may be used as a foundation for a data analysis service. Due to the deficiency of an agreed-upon reference architecture it is quit cumbersome to compare existing solutions regarding different kinds of aspects (e.g. scalability, custom extensions, workflow, etc.). Besides the usual user working with an optimization service we also have those who are responsible for architecting and implementing these systems. The lack of a list of requirements and any formal reference architecture makes it even harder to improve those systems. For that reason we have raised the following questions: i) what are the requirements, ii) what are the commonalities of existing optimization software, and iii) can we deduce a reference architecture for a cloud-based optimization service? This paper presents a comprehensive analysis of current research projects and important requirements in the context of optimization services, which then leads to the definition of a reference architecture and forms the base of any further evaluation. We also present our own hybrid cloud-based optimization service (OaaS), which is built upon the PaaS-approach of Windows Azure. OaaS defines a generic and extensible service which can be adapted to support custom optimization scenarios.
... So far this field has experienced little integration with the area of distributed and peerto-peer data mining. The main reason for this is the centralized nature of popular volunteer computing platforms available today, such as BOINC [4] and XtremWeb [5,6], which requires all data to be served by a group of centrally maintained servers. However, the centralized approach can generate bottlenecks and single points of failure in the system. ...
... The volunteer computing [3] paradigm has been exploited in several scientific applications (i.e., Seti@home, Folding@home, Einstein@ home), but its adoption for mining applications is more challenging. The two most popular volunteer computing platforms available today, BOINC [4] and XtremWeb [5,6], are especially well suited for CPU-intensive applications but are somewhat inappropriate for data-intensive tasks, for two main reasons. First, the centralized nature of such systems requires all data to be served by a group of centrally maintained servers. ...
... In this manner, remote sensing is defined as big data problem, following the 5Vs definition of Big Data (volume, variety, velocity, veracity, and value) [4], carrying new challenges on data storage, data management, and data processing. In order to overcome the challenges related to data storage and management, researchers have proposed some parallel and distribute techniques using super-computers [5], [6], [7], [8], [9]. However, cloud computing technology has gained a lot more of attention due the advantage of commodity computer and storage devices. ...
Preprint
Full-text available
Given the high availability of data collected by different remote sensing instruments, the data fusion of multi-spectral and hyperspectral images (HSI) is an important topic in remote sensing. In particular, super-resolution as a data fusion application using spatial and spectral domains is highly investigated because its fused images is used to improve the classification and tracking objects accuracy. On the other hand, the huge amount of data obtained by remote sensing instruments represent a key concern in terms of data storage, management and pre-processing. This paper proposes a Big Data Cloud platform using Hadoop and Spark to store, manages, and process remote sensing data. Also, a study over the parameter \textit{chunk size} is presented to suggest the appropriate value for this parameter to download imagery data from Hadoop into a Spark application, based on the format of our data. We also developed an alternative approach based on Long Short Term Memory trained with different patch sizes for super-resolution image. This approach fuse hyperspectral and multispectral images. As a result, we obtain images with high-spatial and high-spectral resolution. The experimental results show that for a chunk size of 64k, an average of 3.5s was required to download data from Hadoop into a Spark application. The proposed model for super-resolution provides a structural similarity index of 0.98 and 0.907 for the used dataset.
... However, this solution was designed with a specific task in mind and was not universal. A good example of a more universal solution in the XtremWeb project [6], a peer-to-peer computation system capable of solving different computation problems. This solutions enable large-scale computations in a distributed environment but are still requiring extensive knowledge and experience to successfully develop a computation application. ...
... performance parameters), describe their technical solution and assume implementation of certain user expectations that are not explicitly formulated as functional requirements. A good example of such practice is the XtremWeb project [7]. Its goal was to create a peer-to-peer computation system, that would enable parallel computing in a distributed large-scale system. ...
Chapter
High Performance Computing (HPC) consists in development and execution of sophisticated computation applications, developed by highly skilled IT personnel. Several past studies report significant problems with applying HPC in industry practice. This is caused by lack of necessary IT skills in developing highly parallelised and distributed computation software. This calls for new methods to reduce software development effort when constructing new computation applications. In this paper we propose a generic requirements model consisting of a conceptual domain specification, unified domain vocabulary and use-case-based functional requirements. Vocabulary definition provides detailed clarifications of HPC fundamental component elements and their role in the system. Further we address security issues by providing transparency principles for HPC. We also propose a research agenda that leads to the creation of a model-based software development system dedicated to building Distributed HPC applications at a high level of abstraction, with the object of making HPC more available for smaller institutions.
... Encryption is used to keep the confidentiality and integrity of the tasks/data at rest on the volunteer hosts [Fedak et al. 2001;Mengotti 2004] as well as during transit in a network. Sandboxing and virtualization technologies can be used for deployment of guest tasks so that they will be executed in a controlled environment [Cappello et al. 2005;Cappos et al. 2009;Chien et al. 2003;Zhou and Lo 2004]. This will prevent intentional as well as unintentional security attacks from a project server. ...
Article
Full-text available
Volunteer Computing is a kind of distributed computing that harnesses the aggregated spare computing re- sources of volunteer devices. It provides a cheaper and greener alternative computing infrastructure that can complement the dedicated, centralized, and expensive data centres. The aggregated idle computing resources of devices ranging from desktop computers to routers and smart TVs are being utilized to provide the much needed computing infrastructure for compute intensive tasks such as scientific simulations and big data analysis. However, the use of Volunteer Computing is still dominated by scientific applications and only a very small fraction of the potential volunteer nodes are participating. This paper provides a comprehensive survey of Volunteer Computing, covering key technical and operational issues such as security, task distribution, resource management, and incentive models. The paper also presents a taxonomy of Volunteer Computing systems, together with discussions of the characteristics of specific systems in each category. In order to harness the full potentials of Volunteer Computing and make it a reliable alternative computing infrastructure for general applications, we need to improve the existing techniques and device new mechanisms. Thus, this paper also sheds light on important issues regarding the future research and development of Volunteer Computing systems with the aim of making them a viable alternative computing infrastructure.
... L'ensemble de ces ordinateurs met à profit une partie de ses ressources, et à l'aide d'une application contribue à la formation d'une grille de calcul. Ces systèmes ont été mis en place par exemple par l'Université Paris-Sud Orsay dans la plateforme XtremWeb [24], ou la plateforme BOINC [5,21] de l'Université de Berkeley. Les implémentations de ces plateformes mettent en avant la possibilité de contribuer à différents secteurs de recherche (souvent médicale), en permettant aux utilisateurs d'exploiter les temps d'inactivité ou de faible activité de leur machine. ...
Thesis
Les travaux présentés dans cette thèse portent sur l’ordonnancement d’applications multi-tâches linéaires de type workflow sur des plateformes distribuées. La particularité du système étudié est que le nombre de machines composant la plateforme est plus petit que le nombre de tâches à effectuer. Dans ce cas les machines sont supposées être capables d’effectuer toutes les tâches de l’application moyennant une reconfiguration, sachant que toute reconfiguration demande un temps donné dépendant ou non des tâches. Le problème posé est de maximiser le débit de l’application,c’est à dire le nombre moyen de sorties par unité de temps, ou de minimiser la période, c’est à dire le temps moyen entre deux sorties. Par conséquent le problème se décompose en deux sous problèmes: l’assignation des tâches sur les machines de la plateforme (une ou plusieurs tâches par machine), et l’ordonnancement de ces tâches au sein d’une même machine étant donné les temps de reconfiguration. Pour ce faire la plateforme dispose d’espaces appelés buffers, allouables ou imposés, pour stocker des résultats de production temporaires et ainsi éviter d’avoir à reconfigurer les machines après chaque tâche. Si les buffers ne sont pas pré-affectés nous devons également résoudre le problème de l’allocation de l’espace disponible en buffers afin d’optimiser l’exécution de l’ordonnancement au sein de chaque machine. Ce document est une étude exhaustive des différents problèmes associés à l’hétérogénéité de l’application ; en effet si la résolution des problèmes est triviale avec des temps de reconfiguration et des buffers homogènes, elle devient bien plus complexe si ceux-ci sont hétérogènes. Nous proposons ainsi d’étudier nos trois problèmes majeurs pour différents degrés d’hétérogénéité de l’application. Nous proposons des heuristiques pour traiter ces problèmes lorsqu’il n’est pas possible de trouver une solution algorithmique optimale.
... Published by Sciedu Press controlling units. [9][10][11] In DCS, controllers are connected to different field devices like actuators and sensors, they continuously receive data from them and send the data to other controllers in the hierarchy through a communication bus. Various communication channels are used for this purpose some of them are Profibus, HART, arc net, Modbus, etc. DCS is being employed in different walks of life which includes, agriculture, chemical plants. ...
Article
Full-text available
As a disruptive technology, blockchain, particularly its original form of bitcoin as a type of digital currency, has attracted great attentions. The innovative distributed decision making and security mechanism lay the technical foundation for its success, making us consider to penetrate the power of blockchain technology to distributed control and cooperative robotics, in which the distributed and secure mechanism is also highly demanded. Actually, security and distributed communication have long been unsolved problems in the field of distributed control and cooperative robotics. It has been reported on the network failure and intruder attacks of distributed control and multi-robotic systems. Blockchain technology provides promise to remedy this situation thoroughly. This work is intended to create a global picture of blockchain technology on its working principle and key elements in the language of control and robotics, to provide a shortcut for beginners to step into this research field.
... In DCS, data accusation and the control tasks are performed via some microprocessors that are located near the control area. These controllers can communicate with each other as well as other controlling units [10]- [12]. In DCS, controllers are connected to different field devices like actuators and sensors, they continuously receive data from them and send the data to other controllers in the hierarchy through a communication bus. ...
Preprint
Full-text available
As a disruptive technology, blockchain, particularly its original form of bitcoin as a type of digital currency, has attracted great attentions. The innovative distributed decision making and security mechanism lay the technical foundation for its success, making us consider to penetrate the power of blockchain technology to distributed control and cooperative robotics, in which the distributed and secure mechanism is also highly demanded. Actually, security and distributed communication have long been unsolved problems in the field of distributed control and cooperative robotics. It has been reported on the network failure and intruder attacks of distributed control and multi-robotic systems. Blockchain technology provides promise to remedy this situation thoroughly. This work is intended to create a global picture of blockchain technology on its working principle and key elements in the language of control and robotics, to provide a shortcut for beginners to step into this research field.
... Organizations worry about cloud computing service availability [8]. Not only Desktop Grids, such as BOINC or XtremWeb [13], have centralized architectures causing a potential bottleneck in the continuing evolution of volunteer computing systems, but there are also worrying signs of stagnation of active users and projects. This causes problems that are related to data storage and distribution [3,16,17]. ...
Article
Public distributed computing is a type of distributed computing in which so-called volunteers provide computing resources to projects. Research show that public distributed computing has the required potential and capabilities to handle big data mining tasks. Considering that one of the biggest advantages of such computational model is low computational resource costs, this raises the question of why this method is not widely used for solving such today’s computational challenges as big data mining. The purpose of this paper is to overview public distributed computing capabilities for big data mining tasks. The outcome of this paper provides the foundation for future research required to bring back attention to this low-cost public distributed computing method and make it a suitable platform for big data analysis.
... Two types of grids are distinguished: grid computing and data grids. In grid computing solutions, like SETI@home (Anderson, 2002), BOINC (Anderson, 2004), XtremWEB (Cappello, 2004), Diet (Caron, 2006), Globus (Allcock, 2001), and CONFIIT (Flauzac, 2010), resources are associated to computing (processor, memory...). In data grid -OceanStore (Kubiatowicz, 2000), Freenet (Freenet, 2011) -, resources are associated to data storage. ...
... Developed from the SETI@Home [2] project, this platform allows a researcher to run the required computations in the BOINC grid clients, which can be any PCs with Internet connections. Other grid frameworks are XtremeWeb [3], UnaCloud [4], and OurGrid [5]. Each of these has a suite of tools to address problems related to the characteristics of a grid environment. ...
Article
Full-text available
In scientific computing, more computational power generally implies faster and possibly more detailed results. The goal of this study was to develop a framework to submit computational jobs to powerful workstations underused by nonintensive tasks. This is achieved by using a virtual machine in each of these workstations, where the computations are done. This group of virtual machines is called the Gridlan. The Gridlan framework is intermediate between the cluster and grid computing paradigms. The Gridlan is able to profit from existing cluster software tools, such as resource managers like Torque, so a user with previous experience in cluster operation can dispatch jobs seamlessly. A benchmark test of the Gridlan implementation shows the system's suitability for computational tasks, principally in embarrassingly parallel computations.
... This will allow the product to earn money for support and further development. Today, if the main goal for a researcher is to make calculations with a minimum efforts on their part, then the most appropriate solution would be to use GridUNAM platform for the organization of their calculations (Taufer, 2004) (Franck Cappello, 2005). However, if the researchers have always upcoming tasks for calculations in fairly large amounts, it's better to choose the original BOINC as a platform for the organization of the volunteer computing projects, and actively engage the involvement of participants in the project. ...
Article
Full-text available
The article based on the experience of running BOINC projects. We interviewed developers of projects on the platform BOINC in order to adopt their experience with the platform: issues with which they are confronted, how they have solved them, what changes have they done in BOINC and their opinion about BOINC platform, what should be improved in BOINC platform to make it better. Next we were study materials about experience of using the BOINC platform and BOINC issues. Finally we made conclusions about the actions to be taken for the development of BOINC: increase number of crunchers; rewrite the platform using modern architectural solutions and the latest technologies; initiate creation of services providing access to computing resources of crunchers.
... Finally, decentralized infrastructures do not count with any central point of reference for device discovery and task allocation; instead peers are charged with discovering other peers, handling task allocation and distribution, and collecting the results. This strategy is taken by peer-to-peer infrastructures like OurGrid [22], XtremWeb [15], and the Mini-Grid [6]. For a particular infrastructure the chosen organization schema does not restrict what each participating device can do in the infrastructure -whether it's symmetric or asymmetric; however, it does constrain the information each machine has access to; a fundamental condition for fulfilling the data acquisition requirement. ...
Thesis
Full-text available
The Mini-Grid is a volunteer computing infrastructure that gathers computational power from multiple participants and uses it to execute bio-informatics algorithms. The Mini-Grid is an instance of a larger set of systems that I call participative computational infrastructures (PCI). PCIs depend on their participants to provide a service, with every instance of the system executing similar tasks and collaborating with others. Participants to these infrastruc- tures come together to contribute resources like computational power, storage capacity, network connectivity and human reasoning skills. While plenty of research has focused on the technical aspect of these infrastructures (task parallelization, distribution, robustness, and security), the participative aspect, which deals with how to recruit and maintain participants, has been largely overlooked. Despite the multiple experiences with volunteer computing projects, only a few researchers have looked into the motivational factors affecting the enrolment and permanence of participants. This dissertation studies participation from the broader context of the relationship between users and infrastructures in the field of Human- Computer Interaction (HCI), and argues that participative computational infrastructures face a fundamental recruitment challenge derived from their being “invisible” computational systems. To counter this challenge this dissertation proposes the notion of Infrastructure Awareness: a feedback mechanism on the state of, and changes in, the properties of computational infrastructures provided in the periphery of the user’s attention, and supporting gradual disclosure of detailed information on user’s request. Working with users of the Mini-Grid, this thesis shows the design process of two infrastructure awareness systems aimed at supporting the recruitment of participants, the implementation of one possible technical strategy, and an in-the-wild evaluation. The thesis finalizes with a discussion of the results and implications of infrastructure awareness for participative and other computational infrastructures.
... Another well-known framework is XtremWeb [12] research and development project. It was designed to create light and flexible distributed computing networks locally on the universities, companies or any other local networks. ...
Article
Full-text available
Existing solutions to the problem of finding valuable information on the Websuffers from several limitations like simplified query languages, out-of-date in-formation or arbitrary results sorting. In this paper a different approach to thisproblem is described. It is based on the idea of distributed processing of Webpages content. To provide sufficient performance, the idea of browser-basedvolunteer computing is utilized, which requires the implementation of text pro-cessing algorithms in JavaScript. In this paper the architecture of Web pagescontent analysis system is presented, details concerning the implementation ofthe system and the text processing algorithms are described and test resultsare provided.
... The Cloud@Home goal is to use "domestic" computing resources to build desktop Clouds made of voluntarily contributed resources. Therefore, following the volunteer computing wave [1], across Grid computing and desktop Grids [2,5], we think about desktop Cloud platforms able to engage and retain contributors for providing virtual (processing, storage, networking, sensing) resources as a service, in the Infrastructure as a Service (IaaS) fashion. This novel, revised view of Cloud computing could perfectly fit with private and community needs, but our real, long-term challenge is to exploit it in hybrid and especially in business contexts towards public deployment models. ...
Conference Paper
Recent developments in Cloud computing technology provide capabilities for an extensible, reliable, effective and dynamic infrastructure to technology-enabled enterprises, in order to efficiently leverage (or even monetize) their on-premise equipment. Furthermore, the virtualization technologies powering the Cloud revolution expand their reach by the day, and are nowadays commonly available, nearly household, capabilities. In this light, the intersection between volunteering and Cloud computing may bring massive and ubiquitous compute power for IaaS users. For instance, scientists and researchers, as a category of very demanding users, may benefit from such an enlargement of the pool of resources to tap into for high complexity computational workloads and big data problems without concern for the setup and maintenance of the underlying infrastructure. We have investigated this concept in the past under the Cloud@Home project, aimed at implementing a desktop-powered Cloud. In this paper we propose a blueprint of a Cloud@Home implementation starting from OpenStack, a well-known platform for Cloud solutions, a de-facto standard with variety of features, high interoperability and Open Source support. The reference, layered architecture and the preliminary implementation of a Cloud@Home framework based on OpenStack are discussed in the paper.
... There are several centralized desktop grid systems such as BOINC [44], Condor [45], XtremWeb [46] and decentralized as CCOF Cluster Computing On the Fly [47]. However, none of these systems address the issue of providing incentives for the donation of resources. ...
Article
Full-text available
Desktop grids (DG) offer large amounts of computing power coming from internet-based volunteer networks. They suffer from the free-riding phenomenon. It may be possible for users to free ride, consuming resources donated by others but not donating any of their own. In this paper, we present PGTrust: our decentralized free-riding prevention model designed for PastryGrid. PastryGrid is a decentralized DG system which manages resources over a decentralized P2P network. PGTrust relies on the notion of score which is a metric of reputation used to evaluate the level of QoS of a peer. We have conducted out experimentations on Grid’5000 testbed. Obtained results prove the benefits of our free-riding prevention model. PGTrust is able to improve application running time by discouraging free-riders and motivating selfish peers to contribute. It offers a considerable speedup over distributed applications.
... On the other hand, the problems of effective exploitation of the existing highly performance resources taking into account their heterogeneity have neither been completely solved (see (Fougère et al., 2005), (Díaz et al., 2012) and (Cappello et al., 2005)). One of the ways of solution is evidently the attraction of technologies for virtualization of computer systems and their integration with technologies of parallel computing. ...
Conference Paper
Full-text available
Paper presents an advanced iterative MapReduce solution that employs Hadoop and MPI technologies. First, we present an overview of working implementations that make use of the same technologies. Then we define an academic example of numeric problem with an emphasis on its computational features. The named definition is used to justify the proposed solution design.
... The volunteer computing [1] paradigm has been exploited in several scientific applications (i.e., Seti@home, Folding@home, Einstein@home), but its adoption for mining applications is more challenging. The two most popular volunteer computing platforms available today, BOINC [2] and XtremWeb [6,8], are especially well suited for CPU-intensive applications but are somewhat inappropriate for data-intensive tasks, for two main reasons. First, the centralized nature of such systems requires all data to be served by a group of centrally maintained servers. ...
Conference Paper
Full-text available
Mining@Home was recently designed as a distributed architecture for running data mining applications according to the “volunteer computing” paradigm. Mining@Home already proved its efficiency and scalability when used for the discovery of frequent itemsets from a transactional database. However, it can also be adopted in several different scenarios, especially in those where the overall application can be divided into distinct jobs that may be executed in parallel, and input data can be reused, which naturally leads to the use of data cachers. This paper describes the architecture and implementation of the Mining@Home system and evaluates its performance for the execution of ensemble learning applications. In this scenario, multiple learners are used to compute models from the same input data, so as to extract a final model with stronger statistical accuracy. Performance evaluation on a real network, reported in the paper, confirms the efficiency and scalability of the framework.
... Celles-ci nécessitent en effet souvent de déployer plusieurs fois la même application en faisant varier des paramètres à chaque fois. Parmi les outils de déploiement nous pouvons citer DeployWare [53], OpenCCM [90,154], Pegasus [44,43], SmartFrog [65], Concerto [88], TAKTUK [89], APST [35], et XtremWeb [31]. Dans cette section, nous présentons plus en détail les outils faisant partie de la thématique du déploiement d'applications sur grilles : JDF [11], GODIET [32], ADAGE [81] et KADEPLOY [62]. ...
Article
Federating physical resources located in different universities, institutes and companies leads to the concept of grid computing. These infrastructures are particularly outfitted to support the heavy computing demand coming from scientific distributed applications. Unfortunately, both applications and infrastructures are complex to use, especially when dealing with the very initial deployment step. This requires from the user to select physical resources, transfer programs and monitor the execution of the application. As far as today, a large number of systems allow to automate these operations in very simple static cases. Unfortunately, only a few of them can handle complex deployments like the re-deployment of some additional parts of the application or the coordinated deployment of multiple applications. In this thesis we propose a model that helps in dynamically deploying applications over computing grids. This model offers two main functionalities. First, it translates high-level application-specific actions into low-level generic operations to manage resources. Second, it performs a pre-planification of deployments, as well as re-deployments and co-deployments. This model satisfies three properties. 1) Resource management is made transparent for the application and the user. 2) Actions are specific to each application type. 3) Applying the model is as few intrusive as possible regarding the application programming model and source code. CORDAGE is an architecture that has been proposed to illustrate this model. It has been developed on top of the OAR job scheduler and the ADAGE deployment tool. CORDAGE has been validated using the JXTA peer-to-peer framework, the JUXMEM data-sharing service and the GFARM distributed file-system. Ou approach has been tested within the GRID'5000 experimental testbed. http://cordage.gforge.inria.fr/
Article
Full-text available
Over the past six decades, the computing systems field has experienced significant transformations, profoundly impacting society with transformational developments, such as the Internet and the commodification of computing. Underpinned by technological advancements, computer systems, far from being static, have been continuously evolving and adapting to cover multifaceted societal niches. This has led to new paradigms such as cloud, fog, edge computing, and the Internet of Things (IoT), which offer fresh economic and creative opportunities. Nevertheless, this rapid change poses complex research challenges, especially in maximizing potential and enhancing functionality. As such, to maintain an economical level of performance that meets ever-tighter requirements, one must understand the drivers of new model emergence and expansion, and how contemporary challenges differ from past ones. To that end, this article investigates and assesses the factors influencing the evolution of computing systems, covering established systems and architectures as well as newer developments, such as serverless computing, quantum computing, and on-device AI on edge devices. Trends emerge when one traces technological trajectory, which includes the rapid obsolescence of frameworks due to business and technical constraints, a move towards specialized systems and models, and varying approaches to centralized and decentralized control. This comprehensive review of modern computing systems looks ahead to the future of research in the field, highlighting key challenges and emerging trends, and underscoring their importance in cost-effectively driving technological progress.
Article
Hybrid Cloud environments allow the utilization of local resources in private Clouds with resources from public Clouds when needed. Such environments represent systems with high failure rates because they feature heterogeneous components, a large number of servers with intensive workload are built as complex architectures. For these reasons, the availability of such systems could be easily compromised if the failure of these heterogeneous components is not handled correctly, which may cause request rejection and frequent performance degradation. Providing highly reliable Cloud applications, in particular in a hybrid Cloud environment, is a challenging and critical research problem. Therefore, the question we address in this paper is how to provision resources to user requests in the presence of failures in a hybrid Cloud environment. To this end, we propose a reconfigurable formal model of the hybrid Cloud architecture, then we utilize instantiations of this model, simulation and real-time execution runs to estimate different performance metrics related to fault detection and self-recovery strategies in hybrid Cloud. Our approach is based on the combination of the model-based and the probabilistic approaches.
Chapter
A desktop grid system is one of the most common types of distributed systems. The distinctive features of a desktop grid system are the high heterogeneity and unreliability of computing nodes. Desktop grid systems deployed on the BOINC platform are considered. To simulate the functioning of the desktop grid, a modified ComBos simulator based on SimGrid is used. The ComBos simulator adds support for applications with a limited number of tasks, asynchronous execution of multiple applications and various computing resources. Data from existing voluntary distributed computing projects were used to simulate the functioning of the desktop grid. The paper deals with the modification of scheduling system for a desktop grid. Algorithms FS, FCFS, SRPT, and SWRPT were selected from existing heuristic algorithms for comparison. Two heuristic algorithms for scheduling MSF and MPSF tasks were proposed. A simulation of the desktop grid was performed based on data from existing voluntary distributed computing projects. The simulation took into account asynchronous execution of five different computing applications on several types of computing resources. A comparative analysis of the results of various scheduling algorithms in the desktop grid is carried out. Analysis of the results showed that the proposed MPSF algorithm shows the best results from the compared algorithms. The proposed heuristic scheduling algorithm can be applied to umbrella distributed computing projects and to desktop grid in general.
Chapter
On the volunteer computing platforms, inter-task dependency leads to serious performance degradation for failed task re-execution because of volatile peers. This paper discusses a performance-oriented task dispatch policy based on the failure probability estimation. The tasks with the highest failure probabilities are selected for dispatch when multiple task enquiries come to the dispatcher. The estimated failure probability is used to find the optimized task assignment that minimizes the overall failure probability of these tasks. This performance-oriented task dispatch policy is evaluated with two real world trace data sets on a simulator. Evaluation results demonstrate the effectiveness of this policy.
Chapter
This paper shows how to parallelize a compute intensive application in mathematics (Group Theory) for an institutional Desktop Grid platform coordinated by a meta-grid middleware named BonjourGrid. The paper is twofold: it shows how to parallelize a sequential program for a multicore CPU which participates in the computation; and it demonstrates the effort for launching multiple instances of the solutions for the mathematical problem with the BonjourGrid middleware. BonjourGrid is a fully decentralized Desktop Grid middleware. The main results of the paper are: a) an efficient multi-threaded version of a sequential program to compute Littlewood-Richardson coefficients, namely the Multi-LR program and b) a proof of concept, centered around the user needs, for the BonjourGrid middleware dedicated to coordinate multiple instances of programsfor Desktop Grids and with the help of Multi-LR. In this paper, the scientific work consists in starting from a model for the solution of a compute intensive problem in mathematics, to incorporate the concrete model into a middleware and running it on commodity PCs platform managed by an innovative meta Desktop Grid middleware.
Chapter
Cloud Computing (CC) offers simple and cost effective outsourcing in dynamic service environments and allows the construction of service-based applications extensible with the latest achievements of diverse research areas. CC is built using dedicated and reliable resources and provides uniform seemingly unlimited capacities. Volunteer Computing (VC) on the other hand uses volatile, heterogeneous and unreliable resources. This chapter per the authors makes an attempt starting from a definition for Cloud Computing to identify the required steps and formulate a definition for what can be considered as the next evolutionary stage for Volunteer Computing: Volunteer Clouds (VCl). There are many idiosyncrasies of VC to overcome (e.g., volatility, heterogeneity, reliability, responsiveness, scalability, etc.). Heterogeneity exists in VC at different levels. The vision of CC promises to provide a homogeneous environment. The goal of this chapter per the authors is to identify methods and propose solutions that tackle the heterogeneities and thus, make a step towards Volunteer Clouds.
Chapter
This article proposes an adaptive fuzzy logic based decentralized scheduling mechanism that will be suitable for dynamic computing environment in which matchmaking is achieved between resource requirements of outstanding tasks and resource capabilities of available workers. Feasibility of the proposed method is done via real time system. Experimental results show that implementing the proposed fuzzy matchmaking based scheduling mechanism maximized the resource utilization of executing workers without exceeding the maximum execution time of the task. It is concluded that the efficiency of FMA-based decentralized scheduling, in the case of parallel execution, is reduced by increasing the number of subtasks.
Chapter
In this chapter we are going to introduce the key concepts of SOA, grid, and cloud computing and the relation between them. This chapter illustrates the paradigm shift in technological services due to the incorporation of these models and how we can combine them to develop a highly scalable application system such as petascale computing. Also there will be coverage for some concepts of Web 2.0 and why it needs grid computing and the on-demand enterprise model. Finally, we will discuss some standardization efforts on these models as a further step in developing interoperable grid systems.
Chapter
Network security is in a daily evolving domain. Every day, new attacks, viruses, and intrusion techniques are released. Hence, network devices, enterprise servers, or personal computers are potential targets of these attacks. Current security solutions like firewalls, intrusion detection systems (IDS), and virtual private networks (VPN) are centralized solutions, which rely mostly on the analysis of inbound network connections. This approach notably forgets the effects of a rogue station, whose communications cannot be easily controlled unless the administrators establish a global authentication policy using methods like 802.1x to control all network communications among each device. To the best of the authors’ knowledge, a distributed and easily manageable solution for the global security of an enterprise network does not exist. In this chapter, they present a new approach to deploy a distributed security solution where communication between each device can be control in a collaborative manner. Indeed, each device has its own security rules, which can be shared and improved through exchanges with others devices. With this new approach, called grid of security, a community of devices ensures that a device is trustworthy and that communications between devices progress in respect of the control of the system policies. To support this approach, the authors present a new communication model that helps structuring the distribution of security services among the devices. This can secure both ad-hoc, local-area or enterprise networks in a decentralized manner, preventing the risk of a security breach in the case of a failure.
Article
Full-text available
Volunteer computing resembles private desktop grids whereas desktop grids are not fully equivalent to volunteer computing. There are several attempts to distinguish and categorize them using informal and formal methods. However, most formal approaches model a particular middleware and do not focus on the general notion of volunteer or desktop grid computing. This work makes an attempt to formalize their characteristics and relationship. To this end formal modeling is applied that tries to grasp the semantic of their functionalities - as opposed to comparisons based on properties, features, etc. We apply this modeling method to formalize the Berkeley Open Infrastructure for Network Computing (BOINC) [Anderson D. P., 2004] volunteer computing system.
Chapter
Different forms of parallel computing have been proposed to address the high computational requirements of many applications. Building on advances in parallel computing, volunteer computing has been shown to be an efficient way to exploit the computational resources of under utilized devices that are available around the world. The idea of including mobile devices, such as smartphones and tablets, in existing volunteer computing systems has recently been investigated. In this chapter, we present the current state of the art in the mobile volunteer computing research field, where personal mobile devices are the elements that perform the computation. Starting from the motivations and challenges behind the adoption of personal mobile devices as computational resources, we then provide a literature review of the different architectures that have been proposed to support parallel computing on mobile devices. Finally, we present some open issues that need to be investigated in order to extend user participation and improve the overall system performance for mobile volunteer computing.
Article
This paper presents a comprehensive survey on filtering-based defense mechanisms against distributed denial of service (DDoS) attacks. Several filtering techniques are analyzed and their advantages and disadvantages are presented. In order to help network security analysts choose the most appropriate mechanism according to their security requirements, a comparative classification of these methods is provided. The relevant research efforts are identified and discussed for rendering the current state of the art in the literature. This classification will also serve researchers to address weaknesses of these filtering methods, and thus mitigate DDoS attacks using more effective defense mechanisms.
Article
MapReduce offers an ease-of-use programming paradigm for processing large datasets. In our previous work, we have designed a MapReduce framework called BitDew-MapReduce for desktop grid and volunteer computing environment, that allows nonexpert users to run data-intensive MapReduce jobs on top of volunteer resources over the Internet. However, network distance and resource availability have great impact on MapReduce applications running over the Internet. To address this, an availability and network-aware MapReduce framework over the Internet is proposed. Simulation results show that the MapReduce job response time could be decreased by 40.05%, thanks to Weighted Naive Bayes Classifier-based availability prediction and landmark-based network estimation. The effectiveness of the new MapReduce framework is further proved by performance evaluation in a real distributed environment.
Article
This paper presents the Virtual EZ Grid project, based on the XtremWeb-CH XWCH volunteer computing platform. The goal of the project is to introduce a flexible distributed computing system, with i an infrastructure with a non-trivial amount of computing resources from various institutes, ii a stable platform that manages these computing resources and provides advanced interfaces for applications, and iii a set of applications that take benefit of the platform. This paper concentrates on the application support of the new version of XWCH, and describes how two medical applications, MedGIFT and NeuroWeb, utilise it.
Conference Paper
This paper attempted to decentralize volunteer computing (VC) coordination with the goal of reducing the reliance on a central coordination server, which had been criticized for performance bottleneck and single point of failure. On analyzing the roles and functions that the VC components played for the centralized master/worker coordination model, this paper proposed a decentralized VC coordination framework based on distributed hash table (DHT) and peer-to-peer (P2P) overlay and then successfully mapped the centralized VC coordination into distributed VC coordination. The proposed framework has been implemented on the performance-proven DHT P2P overlay Chord. The initial verification has demonstrated the effectiveness of the framework when working in distributed environments.
Chapter
Different forms of parallel computing have been proposed to address the high computational requirements of many applications. Building on advances in parallel computing, volunteer computing has been shown to be an efficient way to exploit the computational resources of under utilized devices that are available around the world. The idea of including mobile devices, such as smartphones and tablets, in existing volunteer computing systems has recently been investigated. In this chapter, we present the current state of the art in the mobile volunteer computing research field, where personal mobile devices are the elements that perform the computation. Starting from the motivations and challenges behind the adoption of personal mobile devices as computational resources, we then provide a literature review of the different architectures that have been proposed to support parallel computing on mobile devices. Finally, we present some open issues that need to be investigated in order to extend user participation and improve the overall system performance for mobile volunteer computing.
Conference Paper
Desktop Grids are composed of several thousands of resources. They are characterized by high volatility of resources, due to voluntary disconnections or failures. This could affect the proper termination of applications execution. PastryGrid is a decentralized system which manages desktop grid resources and user applications over a fully decentralized P2P network. In this paper we present PastryGridCP: our rollback-recovery protocol, which is based on checkpoints designed for the decentralized Desktop Grid system PastryGrid. It provides fault tolerance for grid applications and ensures the termination of the execution of applications in a transparent way to users. We have conducted out experimentations on 110 nodes of Grid’5000. Obtained results validate our protocol and improve the performance of applications.
Article
The desktop grids are a kind of grid computing that incorporates desktop resources into grid infrastructure. In desktop grids, it is important that fast turnaround time is guaranteed in the presence of the dynamic properties such as volatility and heterogeneity. In this paper, we propose a nearest neighbor (NN)-based task scheduling that can selectively allocate tasks to those resources that are suitable for the current situation of a desktop grid environment. The experimental results show that our scheduling is more efficient than the existing scheduling with respect to reducing both turnaround time and the number of resources consumed.
Article
Service Oriented Architecture (SOA) and Web Services play an invaluable role in grid and cloud computing models and are widely seen as a base for new models of distributed applications and system management tools. SOA, grid and cloud computing models share core and common behavioral features and characteristics by which a synergy is there to develop and implement new services that facilitate the on-demand computing model. In this chapter we are going to introduce the key concepts of SOA, grid, and cloud computing and the relation between them. This chapter illustrates the paradigm shift in technological services due to the incorporation of these models and how we can combine them to develop a highly scalable application system such as petascale computing. Also there will be coverage for some concepts of Web 2.0 and why it needs grid computing and the on-demand enterprise model. Finally, we will discuss some standardization efforts on these models as a further step in developing interoperable grid systems.
Article
Introduction to peer-to-peer systems The peer-to-peer paradigms Services on structured overlays Building trust in P2P systems Conclusion Bibliography
Article
It is important to reduce turnaround time for all tasks against the presence of execution failures in desktop grids. To achieve this objective, this paper proposes a checkpoint sharing-based replication scheme where each task is basically allocated to multiple desktop resources under a hybrid P2P desktop grid architecture and intermediate execution results (i.e., checkpoints) can be transferred to other resources for its successive execution. To enhance turnaround time, the sequential task distribution based on checkpoints is applied to the scheme. Performance evaluation shows that our scheme is superior to the existing scheme with respect to reducing both turnaround time and total execution time, regardless of a failure rate.
Chapter
This paper reports on the activities of the IAG Working Group 1.1.1 on combination and comparison of precise orbits based on different space geodetic techniques. It will focus on the Dancer project which implements a distributed parameter estimation process that is scalable in the number of GPS receivers, so that an arbitrarily large number of receivers can be processed in a single reference frame realization. The background of this project will be summarized and its mathematical principles will be explained, as well as the essential aspects of the involved internet communication. It will show that the workload for data processing at a single participating receiver remains independent of the network size, while the data traffic only grows as a logarithmic function of the network size.
Article
In this paper We follow a simple approach which allows the implementation of machine learning (ML for short) techniques to large data sets. More specifically, we study the case of on-demand dynamic creation of a local model in the neighborhood of a target datum instead of creating a global one on the whole training data set. This approach exploits the advanced data structures and algorithms, embedded in modern relational databases, to identify the neighborhood of a target datum, rapidly. Preliminary experimental results from a large scale classification problem (HIGGS dataset) show that the typical machine learning techniques are applicable to large data sets through this approach, under particular conditions. We highlight some restrictions of the method and some issues arising by implementing it.
Article
Full-text available
Computational power grids are computing environments with massive resources for processing and storage. While these resources maybepervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparenttothe end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the speci#c approaches that have been and are currently being implemented within NetSolve. Keywords Fault-tolerance, Scienti#c Computing, Computational Servers, Checkpointing, Migration. 1 Introduction The advances in computer and network technologies that are shaping the global information infrastructure are also producing a new vision of how that infrastructure will be used. The concept of a Computational Power Grid has emerged t...
Conference Paper
Full-text available
We describe a theory of authentication and a system that implements it. Our theory is based on the notion of principal and a "speaks for" relation between principals. A simple principal either has a name or is a communication channel; a compound principal can express an adopted role or delegation of authority. The theory explains how to reason about a principal's authority by deducing the other principals that it can speak for; authenticating a channel is one important application. We use the theory to explain many existing and proposed mechanisms for security. In particular, we describe the system we have built. It passes principals efficiently as arguments or results of remote procedure calls, and it handles public and shared key encryption, name lookup in a large name space, groups of principals, loading programs, delegation, access control, and revocation.
Conference Paper
Full-text available
Omni remote procedure call facility, OmniRPC, is a threadsafe grid RPC facility for cluster and global computing environments. The remote libraries are implemented as executable programs in each remote computer, and OmniRPC automatically allocates remote library calls dynamically on appropriate remote computers to facilitate location transparency. We propose to use OpenMP as an easy-to-use and simple programming environment for the multi-threaded client of OmniRPC. We use the POSIX thread implementation of the Omni OpenMP compiler which allows multi-threaded execution of OpenMP programs by POSIX threads even in a single processor. Multiple outstanding requests of OmniRPC calls in OpenMP work-sharing construct are dispatched to different remote computers to exploit network-wide parallelism.
Conference Paper
Full-text available
This paper discusses preliminary work on standardizing and implementing a remote procedure call (RPC) mechanism for grid computing. The GridRPC API is designed to address the lack of a standardized, portable, and simple programming interface. Our initial work on GridRPC shows that client access to existing grid computing systems such as NetSolve and Ninf can be unified via a common API, a task that has proven to be problematic in the past.
Conference Paper
Full-text available
Ninf is an ongoing global network-wide computing infrastructure project which allows users to access computational resources including hard- ware, software and scientific data distributed across a wide area network. Ninf is intended not only to exploit high performance in network parallel computing, but also to provide high quality numerical computation services and accesses to scientific database published by other researchers. Computational resources are shared as Ninf remote libraries executable at a remote Ninf server. Users can build an application by calling the libraries with the Ninf Remote Procedure Call, which is designed to provide a programming interface similar to conventional function calls in existing languages, and is tailored for scientific computation. In order to facilitate location transparency and network-wide parallelism, Ninf metaserver maintains global resource information regarding computational server and databases, allocating and scheduling coarse-grained computation for global load balancing. Ninf also interfaces with the WWW browsers for easy accessi- bility.
Conference Paper
Full-text available
Omni remote procedure call facility, OmniRPC, is a thread-safe grid RPC facility for cluster and global computing environments. The remote libraries are implemented as executable programs in each remote computer, and OnmiRPC automatically allocates remote library calls dynamically on appropriate remote computers to facilitate location transparency. We propose to use OpenMP as an easy-to-use and simple programming environment for the multi-threaded client of OmniRPC. We use the POSIX thread implementation of the Omni OpenMP compiler which allows multi-threaded execution of OpenMP programs by POSIX threads even in a single processor. Multiple outstanding requests of OmniRPC calls in OpenMP work-sharing construct are dispatched to different remote computers to exploit network-wide parallelism.
Conference Paper
Full-text available
It has been reported [25] that life holds but two certainties, death and taxes. And indeed, it does appear that any society, and in the context of this article, any large-scale distributed system, must address both death (failure) and the establishment and maintenance of infrastructure (which we assert is a major motivation for taxes, so as to justify our title!). Two supposedly new approaches to distributed computing have emerged in the past few years, both claiming to address the problem of organizing large-scale computational societies: peer-to-peer (P2P) [15, 36, 49] and Grid computing [21]. Both approaches have seen rapid evolution, widespread deployment, successful application, considerable hype, and a certain amount of (sometimes warranted) criticism. The two technologies appear to have the same final objective, the pooling and coordinated use of large sets of distributed resources, but are based in different communities and, at least in their current designs, focus on different requirements.
Conference Paper
Full-text available
We describe a theory of authentication and a system that implements it. Our theory is based on the notion of principal and a ‘speaks for’ relation between principals. A simple principal either has a name or is a communication channel; a compound principal can express an adopted role or delegated authority. The theory shows how to reason about a principal’s authority by deducing the other principals that it can speak for; authenticating a channel is one important application. We use the theory to explain many existing and proposed security mechanisms. In particular, we describe the system we have built. It passes principals efficiently as arguments or results of remote procedure calls, and it handles public and shared key encryption, name lookup in a large name space, groups of principals, program loading, delegation, access control, and revocation.
Conference Paper
Full-text available
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.
Article
Full-text available
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
Article
Full-text available
The consensus problem involves an asynchronous system of processes, some of which may be unreliable. The problem is for the reliable processes to agree on a binary value. In this paper, it is shown that every protocol for this problem has the possibility of nontermination, even with only one faulty process. By way of contrast, solutions are known for the synchronous case, the “Byzantine Generals” problem.
Conference Paper
Full-text available
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.
Conference Paper
Full-text available
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
Conference Paper
Full-text available
Global computing achieves high throughput computing by harvesting a very large number of unused computing resources connected to the Internet. This parallel computing model targets a parallel architecture defined by a very high number of nodes, poor communication performance and continuously varying resources. The unprecedented scale of the global computing architecture paradigm requires us to revisit many basic issues related to parallel architecture programming models, performance models, and class of applications or algorithms suitable for this architecture. XtremWeb is an experimental global computing platform dedicated to provide a tool for such studies. The paper presents the design of XtremWeb. Two essential features of this design are multi-applications and high-performance. Accepting multiple applications allows institutions or enterprises to set up their own global computing applications or experiments. High-performance is ensured by scalability, fault tolerance, efficient scheduling and a large base of volunteer PCs. We also present an implementation of the first global application running on XtremWeb
Conference Paper
Full-text available
The design, implementation, and performance of the Condor scheduling system, which operates in a workstation environment, are presented. The system aims to maximize the utilization of workstations with as little interference as possible between the jobs it schedules and the activities of the people who own workstations. It identifies idle workstations and schedules background jobs on them. When the owner of a workstation resumes activity at a station, Condor checkpoints the remote job running on the station and transfers it to another workstation. The system guarantees that the job will eventually complete, and that very little, if any, work will be performed more than once. A performance profile of the system is presented that is based on data accumulated from 23 stations during one month
Article
Full-text available
The popularity of peer-to-peer multimedia file sharing applications such as Gnutella and Napster has created a flurry of recent research activity into peer-to-peer architectures. We believe that the proper evaluation of a peerto -peer system must take into account the characteristics of the peers that choose to participate. Surprisingly, however, few of the peer-to-peer architectures currently being developed are evaluated with respect to such considerations. In this paper, we remedy this situation by performing a detailed measurement study of the two popular peer-to-peer file sharing systems, namely Napster and Gnutella. In particular, our measurement study seeks to precisely characterize the population of end-user hosts that participate in these two systems. This characterization includes the bottleneck bandwidths between these hosts and the Internet at large, IP-level latencies to send packets to these hosts, how often hosts connect and disconnect from the system, how many files hosts share and download, the degree of cooperation between the hosts, and several correlations between these characteristics. Our measurements show that there is significant heterogeneity and lack of cooperation across peers participating in these systems.
Conference Paper
Grid technologies enable large-scale sharing of resources within formal or informal consortia of individuals and/or institutions: what are sometimes called virtual organizations. In these settings, the discovery, characterization, and monitoring of resources, services, and computations are challenging problems due to the considerable diversity, large numbers, dynamic behavior, and geographical distribution of the entities in which a user might be interested. Consequently, information services are a vital part of any Grid software infrastructure, providing fundamental mechanisms for discovery and monitoring, and hence for planning and adapting application behavior. We present here an information services architecture that addresses performance, security, scalability, and robustness requirements. Our architecture defines simple low-level enquiry and registration protocols that make it easy to incorporate individual entities into various information structures, such as aggregate directories that support a variety of different query languages and discovery strategies. These protocols can also be combined with other Grid protocols to construct additional higher-level services and capabilities such as brokering, monitoring, fault detection, and troubleshooting. Our architecture has been implemented as MDS-2, which forms part of the Globus Grid toolkit and has been widely deployed and applied.
Conference Paper
The Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java.
Article
Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model would have affected their performance. As current HPC systems increase in size with greater potential levels of individual node failure, the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called fault tolerant MPI (FT-MPI) that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications, debugging tools and some performance issues. Also discussed is the experimental HARNESS core (G_HCORE) implementation that FT-MPI is built to operate upon.
Article
In this paper, we address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results, by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting and show how it reduces error rates exponentially with redundancy, but requires all work to be done several times, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e. inversely) with the amount of work to be done, while only costing an extra fraction of the original time. Integrating these mechanisms, we then present the new idea of credibility-based fault-tolerance, wherein we estimate the conditional probability of results and workers being correct, based on the results of using voting, spot-checking and other techniques, and then use these probability estimates to direct the use of further redundancy. Using this technique, we are able to attain mathematically guaranteeable levels of correctness, and do so with much smaller slowdown than possible with voting or spot-checking alone. Finally, we validate these new ideas with Monte Carlo simulations, and discuss other possible variations of these techniques.
Article
MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.
Conference Paper
This paper presents the design and evaluation of Pastry, a scalable, distributed object location and routing substrate for wide-area peer-to-peer applications. Pastry performs application-level routing and object location in a potentially very large overlay network of nodes connected via the Internet. It can be used to support a variety of peer-to-peer applications, including global data storage, data sharing, group communication and naming. Each node in the Pastry network has a unique identifier (nodeId). When presented with a message and a key, a Pastry node efficiently routes the message to the node with a nodeId that is numerically closest to the key, among all currently live Pastry nodes. Each Pastry node keeps track of its immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. Pastry takes into account network locality; it seeks to minimize the distance messages travel, according to a to scalar proximity metric like the number of IP routing hops. Pastry is completely decentralized, scalable, and self-organizing; it automatically adapts to the arrival, departure and failure of nodes. Experimental results obtained with a prototype implementation on an emulated network of up to 100,000 nodes confirm Pastry’s scalability and efficiency, its ability to self-organize and adapt to node failures, and its good network locality properties.
Conference Paper
Global Computing harvest the idle time of Internet connected computers to run very large distributed applications. The unprecedented scale of the GCS paradigm requires to revisit the basic issues of distributed systems: performance models, security, fault-tolerance and scalability. The first parts of this paper review recent work in Global Computing, with particular interest in Peer-to-Peer systems. In the last section, we present XtremWeb, the Global Computing System we are currently developing.
Article
The “worm” programs were an experiment in the development of distributed computations: programs that span machine boundaries and also replicate themselves in idle machines. A “worm” is composed of multiple “segments,” each running on a different machine. The underlying worm maintenance mechanisms are responsible for maintaining the worm—finding free machines when needed and replicating the program for each additional segment. These techniques were successfully used to support several real applications, ranging from a simple multimachine test program to a more sophisticated real-time animation system harnessing multiple machines.
Article
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve.
Article
One of the most interesting challenges to some part of the high-performance community is to try to exploit the existing computing resources for executing long-running number-crunching applications. Several important issues have to be addressed like, portability, robustness, security, heterogeneity, loadbalancing and fault-tolerance. Java is an emerging language that is receiving an extraordinary enthusiasm and acceptance from several fields of programming. Interestingly, it presents some nice characteristics that partially solve some of those problems. This paper briefly describes JET, a parallel library implemented on Java that supports the execution of parallel applications over the Web. It is oriented to Master/Worker applications which present a coarse-grain task distribution. The library provides a high-level programming interface, support for fault-tolerance and some schemes to mask the latency of the communication. It can be used to execute massively distributed applications usi...
Article
Java offers the basic infrastructure needed to integrate computers connected to the Internet into a seamless parallel computational resource: a flexible, easily-installed infrastructure for running coarsegrained parallel applications on numerous, anonymous machines. Ease of participation is seen as a key property for such a resource to realize the vision of a multiprocessing environment comprising thousands of computers. We present Javelin, a Java-based infrastructure for global computing. The system is based on Internet software technology that is essentially ubiquitous: Web technology. Its architecture and implementation require participants to have access only to a Java-enabled Web browser. The security constraints implied by this, the resulting architecture, and current implementation are presented. The Javelin architecture is intended to be a substrate on which various programming models may be implemented. Several such models are presented: A Linda Tuple Space, an SPMD ...
Conference Paper
This paper presents design and implementation of a remote Procedure call (RPC) API for programming applications on Peer-to-Peer environments. The P2P-RPC API is designed to address one of neglected aspect of Peer-to-Peer the lack of a simple programming interface. In this paper we examine one concrete implementation of the P2P-RPC-API derived from OmniRPC (an existing RPC API for the Grid based on Ninf system). This new API is implemented on top of low-level functionalities of the XtremWeb Peer-to-Peer Computing System. The minimal API defined in this paper provides a basic mechanism to make migrate a wide variety of applications using RPC mechanism to the Peer-to-Peer systems. We evaluate P2P-RPC for a numerical application (NAS EP Benchmark) and demonstrate its performance and fault tolerance properties.
Conference Paper
Grid technologies enable large-scale sharing of resources within formal or informal consortia of individuals and/or institutions: what are sometimes called virtual organizations. In these settings, the discovery, characterization, and monitoring of resources, services, and computations are challenging problems due to the considerable diversity; large numbers, dynamic behavior, and geographical distribution of the entities in which a user might be interested. Consequently, information services are a vital part of any Grid software infrastructure, providing fundamental mechanisms for discovery and monitoring, and hence for planning and adapting application behavior. We present an information services architecture that addresses performance, security, scalability, and robustness requirements. Our architecture defines simple low-level enquiry and registration protocols that make it easy to incorporate individual entities into various information structures, such as aggregate directories that support a variety of different query languages and discovery strategies. These protocols can also be combined with other Grid protocols to construct additional higher-level services and capabilities such as brokering, monitoring, fault detection, and troubleshooting. Our architecture has been implemented as MDS-2, which forms part of the Globus Grid toolkit and has been widely deployed and applied
Conference Paper
We address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results by presenting sabotage-tolerance mechanisms that work without depending on checksums or cryptographic techniques. We first analyze the traditional technique of voting, and show how it reduces error rates exponentially with redundancy, but requires all work to be done at least twice, and does not work well when there are many saboteurs. We then present a new technique called spot-checking which reduces the error rate linearly (i.e., inversely) with the amount of work to be done, while only costing an extra function of the original time. We then integrate these mechanisms by presenting the new idea of credibility-based fault-tolerance, which uses probability estimates to efficiently limit and direct the use of redundancy. By using voting and spot-checking together credibility-based fault-tolerance effectively allows us to exponentially shrink an already linearly-reduced error rate, and thus achieve error-rates that are orders-of-magnitude smaller than those offered by voting or spot-checking alone. We validate this new idea with Monte Carlo simulations, and discuss how credibility-based fault tolerance can be used with other mechanisms and in other applications
Conference Paper
Describes MW (Master-Worker) - a software framework that allows users to quickly and easily parallelize scientific computations using the master-worker paradigm on the Computational Grid. MW provides both a “top-level” interface to application software and a “bottom-level” interface to existing Grid computing toolkits. Both interfaces are briefly described. We conclude with a case study, where the necessary Grid services are provided by the Condor high-throughput computing system, and the MW-enabled application code is used to solve a combinatorial optimization problem of unprecedented complexity
Conference Paper
The POPCORN project provides an infrastructure for globally distributed computation over the whole Internet. It provides any programmer connected to the Internet with a single huge virtual parallel computer composed of all processors on the Internet which care to participate at any given moment. The system provides a market-based mechanism of trade in CPU time to motivate processors to provide their CPU cycles for other peoples' computations. Selling CPU time is as easy as visiting a certain Web site with a Java-enabled browser. Buying CPU time is done by writing a parallel program, using our programming paradigm (and libraries). This paradigm was designed to fit the situation of global computation. A third entity in our system is a market for CPU time, which is where buyers and sellers meet and trade. The system has been implemented and may be visited and used on our Web site: http://www.cs.huji.ac.il/-popcorn
Conference Paper
The Internet, best known by most users as the World-Wide-Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment
Article
Increasingly, computing addresses collaboration, data sharing, and interaction modes that involve distributed resources, resulting in an increased focus on the interconnection of systems both within and across enterprises. These evolutionary pressures have led to the development of Grid technologies. The authors' work focuses on the nature of the services that respond to protocol messages. Grid provides an extensible set of services that can be aggregated in various ways to meet the needs of virtual organizations, which themselves can be defined in part by the services they operate and share
Article
Hash tables -- which map "keys" onto "values" -- are an essential building block in modern software systems. We believe a similar functionality would be equally valuable to large distributed systems. In this paper, we introduce the concept of a ContentAddressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales. The CAN design is scalable, fault-tolerant and completely self-organizing, and we demonstrate its scalability, robustness and low-latency properties through simulation.
Article
The Internet, best known by most users as the WorldWide -Web, continues to expand at an amazing pace. We propose a new infrastructure to harness the combined resources, such as CPU cycles or disk storage, and make them available to everyone interested. This infrastructure has the potential for solving parallel supercomputing applications involving thousands of cooperating components. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing embodied in languages such as Java. We developed a prototype of a global computing infrastructure, called SuperWeb, that consists of hosts, brokers and clients. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Client computations are then mapped by the broker onto the registered resources. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a globa...
Article
This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. The development of NetSolve was motivated by the need for an easy-to-use, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which require no programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resources available as efficiently as possible. NetSolve offers the ability to look for computational resources on a network, choose the best one available, solve a problem (with retry for fault-tolerance), and return the answer to the user.