A survey of rollback protocols in message-passing systems

Environmental-Aware Optimization of MPI Checkpointing Intervals

Conference Paper

Full-text available

Jan 2009

Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory due to large problem scaling while the I/O performance not keeping track, thereby causing substantial overhead for the overall system. Therefore, automated optimization of the checkpoint interval is essential , but the optimal point depends on hardware failure rates and I/O bandwidth, both of which may change fairly quickly over time, often very difficult to be determined by the user without considerable overhead and/or divergence, and as far as we know, no work had addressed the issue for large parallel systems. Our new model and an algorithm, which is an extension of Vaidya's proposal, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5% to 20% improvement over statically user-determined cases.

Runtime Recovery Actions Selection for Sporadic Operations on Cloud

Conference Paper

Sep 2015

Sporadic operations such as rolling upgrade or machine instance redeployment are prone to unpredictable failures in the cloud largely due to the inherent high variability nature of cloud. Previous dependability research has established several recovery methods for cloud failures. In this paper, we first propose eight recovery patterns for sporadic operations. We then present the filtering process which filters applicable recovery patterns for a given operational step. We also propose a methodology to evaluate the recovery actions generated for the applicable recovery patterns based on the metrics of Recovery Time, Recovery Cost and Recovery Impact. This quantitative evaluation will lead to selection of optimal recovery actions. We implement a recovery service and illustrate its applicability by recovering from errors occurring in Asgard rolling upgrade operation on cloud. The experimental results show that the recovery service enhances automated recovery from operational failures by selecting the optimal recovery actions.

Runtime recovery actions selection for sporadic operations on public cloud: Runtime Recovery Actions Selection for sporadic operations on cloud

Article

Jul 2016

Sporadic operations such as rolling upgrade or machine instance redeployment are prone to unpredictable failures in the public cloud largely because of the inherent high variability nature of public cloud. Previous dependability research has established several recovery methods for cloud failures. In this paper, we first propose eight recovery patterns for sporadic operations on public cloud. We then present the filtering process which filters applicable recovery patterns. We propose an automation mechanism to automatically generate recovery actions for those applicable recovery patterns based on our resource state transition algorithm. We also propose a methodology to evaluate the recovery actions generated for the applicable recovery patterns based on the recovery evaluation metrics of Recovery Time, Recovery Cost, and Recovery Impact. This quantitative evaluation will lead to selection of the acceptable recovery actions. We propose two recovery actions selection mechanisms: one is based on user constraints of the recovery evaluation metrics, and the other one is based on Pareto set searching algorithm. We implement a recovery service and illustrate its applicability by recovering from errors occurring in the rolling upgrade operation on AWS cloud. Copyright

Reliable Software Updates for On-orbit CubeSat Satellites

Article

Jun 2012

Sean Fitzsimmons

CubeSat satellites have redefined the standard solution for conducting missions in space due to their unique form factor and cost. The harsh environment of space necessitates examining features that improve satellite robustness and ultimately extend lifetime, which is typical and vital for mission success. The CubeSat development team at Cal Poly, PolySat, has recently redefined its standard avionics platform to support more complex mission capabilities with this robustness in mind. A significant addition was the integration of the Linux operating system, which provides the flexibility to develop much more elaborate protection mechanisms within software, such as support for remote on-orbit software updates. This thesis details the design and development of such a feature-set with critical software recovery and multiple-mission single-CubeSat functionality in mind. As a result, features that focus on software update usability, validation, system recovery, upset tolerance, and extensibility have been developed. These include backup Linux kernel and file system image availability, image validation prior to boot, and the use of multiple file system devices to protect against system upsets. Furthermore, each feature has been designed for usability on current and future missions.

Log-based middleware server recovery with transaction support

Article

Full-text available

Jun 2011

Providing enterprises with reliable and available Web-based application programs is a challenge. Applications are traditionally spread over multiple nodes, from user (client), to middle tier servers, to back end transaction systems, e.g. databases. It has proven very difficult to ensure that these applications persist across system crashes so that “exactly once” execution is produced, always important and sometimes essential, e.g., in the financial area. Our system provides a framework for exactly once execution of multi-tier Web applications, built on a commercially available Web infrastructure. Its capabilities include low logging overhead, recovery isolation (independence), and consistency between mid-tier and transactional back end. Good application performance is enabled via persistent shared state in the middle tier while providing for private session state as well. Our extensive experiments confirm both the desired properties and the good performance. KeywordsApplication fault tolerance–Exactly once execution–Transaction processing–Recovery–Optimistic logging–Distributed systems

A Comparative Analysis of Some RGS-collation Protocols for Mobile Computing Environments

Conference Paper

Full-text available

Mar 2023

Artificial Intelligence Algorithmic Based Fault Tolerance with OLC Codes for Parallel Fault Detection and Correction FFT Soc Design

Article

Nov 2019

Different errors are attacks the VLSI SoC designs these are harm to Mathematical operations and obstacles to results because of this soft core errors are trending to screen. With the increase of information communication, sources of noise (SON) and interference and parallel processing, increases the fault tolerances so designers have been striving to achieve with the require for extra competent and consistent techniques for detecting and correcting faults in parallel “transmission_(TX)” and “reception_(RX)” of data. even if some methods and advances have been projected and apply in past years but information dependability in TX and TX is at rest a trouble. In this research we recommend a more efficient mutual “error_detection” & “correction_technique” stand on the Artificial intelligent algorithmic based fault tolerance (AIABFT) with parallel Orthogonal Codes, and vertical parity. With the help of proposed method designing a parallel processing faults detection and correction FFT. This AIABFT method has been experimentally executed and replicated using Xilinx_vivadoResults of the simulation indicate that the suggested method detects 97% of the mistakes and corrections as expected in the received impaired n-bit code up to (n/2-1) bits of mistakes

Real-time Attack-recovery for Cyber-physical Systems Using Linear-quadratic Regulator

Article

Full-text available

Oct 2021

The increasing autonomy and connectivity in cyber-physical systems (CPS) come with new security vulnerabilities that are easily exploitable by malicious attackers to spoof a system to perform dangerous actions. While the vast majority of existing works focus on attack prevention and detection, the key question is “what to do after detecting an attack?”. This problem attracts fairly rare attention though its significance is emphasized by the need to mitigate or even eliminate attack impacts on a system. In this article, we study this attack response problem and propose novel real-time recovery for securing CPS. First, this work’s core component is a recovery control calculator using a Linear-Quadratic Regulator (LQR) with timing and safety constraints. This component can smoothly steer back a physical system under control to a target state set before a safe deadline and maintain the system state in the set once it is driven to it. We further propose an Alternating Direction Method of Multipliers (ADMM) based algorithm that can fast solve the LQR-based recovery problem. Second, supporting components for the attack recovery computation include a checkpointer, a state reconstructor, and a deadline estimator. To realize these components respectively, we propose (i) a sliding-window-based checkpointing protocol that governs sufficient trustworthy data, (ii) a state reconstruction approach that uses the checkpointed data to estimate the current system state, and (iii) a reachability-based approach to conservatively estimate a safe deadline. Finally, we implement our approach and demonstrate its effectiveness in dealing with totally 15 experimental scenarios which are designed based on 5 CPS simulators and 3 types of sensor attacks.

A Survey on Distributed Machine Learning

Article

Mar 2020

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.

Optimisation du débit pour des applications linéaires multi-tâches sur plateformes distribuées incluant des temps de reconfiguration

Thesis

Jan 2015

Mathias Coqblin

Les travaux présentés dans cette thèse portent sur l’ordonnancement d’applications multi-tâches linéaires de type workflow sur des plateformes distribuées. La particularité du système étudié est que le nombre de machines composant la plateforme est plus petit que le nombre de tâches à effectuer. Dans ce cas les machines sont supposées être capables d’effectuer toutes les tâches de l’application moyennant une reconfiguration, sachant que toute reconfiguration demande un temps donné dépendant ou non des tâches. Le problème posé est de maximiser le débit de l’application,c’est à dire le nombre moyen de sorties par unité de temps, ou de minimiser la période, c’est à dire le temps moyen entre deux sorties. Par conséquent le problème se décompose en deux sous problèmes: l’assignation des tâches sur les machines de la plateforme (une ou plusieurs tâches par machine), et l’ordonnancement de ces tâches au sein d’une même machine étant donné les temps de reconfiguration. Pour ce faire la plateforme dispose d’espaces appelés buffers, allouables ou imposés, pour stocker des résultats de production temporaires et ainsi éviter d’avoir à reconfigurer les machines après chaque tâche. Si les buffers ne sont pas pré-affectés nous devons également résoudre le problème de l’allocation de l’espace disponible en buffers afin d’optimiser l’exécution de l’ordonnancement au sein de chaque machine. Ce document est une étude exhaustive des différents problèmes associés à l’hétérogénéité de l’application ; en effet si la résolution des problèmes est triviale avec des temps de reconfiguration et des buffers homogènes, elle devient bien plus complexe si ceux-ci sont hétérogènes. Nous proposons ainsi d’étudier nos trois problèmes majeurs pour différents degrés d’hétérogénéité de l’application. Nous proposons des heuristiques pour traiter ces problèmes lorsqu’il n’est pas possible de trouver une solution algorithmique optimale.

Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges

Article

Full-text available

Jun 2018

The world is becoming a more conjunct place and the number of data sources such as social networks, online transactions, web search engines, and mobile devices is increasing even more than had been predicted. A large percentage of this growing dataset exists in the form of linked data, more generally, graphs, and of unprecedented sizes. While today's data from social networks contain hundreds of millions of nodes connected by billions of edges, inter-connected data from globally distributed sensors that forms the Internet of Things can cause this to grow exponentially larger. Although analyzing these large graphs is critical for the companies and governments that own them, big data tools designed for text and tuple analysis such as MapReduce cannot process them efficiently. So, graph distributed processing abstractions and systems are developed to design iterative graph algorithms and process large graphs with better performance and scalability. These graph frameworks propose novel methods or extend previous methods for processing graph data. In this article, we propose a taxonomy of graph processing systems and map existing systems to this classification. This captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks. Our effort helps to highlight key distinctions in architectural approaches, and identifies gaps for future research in scalable graph systems.

System Programming in Rust: Beyond Safety

Conference Paper

May 2017

Rust is a new system programming language that offers a practical and safe alternative to C. Rust is unique in that it enforces safety without runtime overhead, most importantly, without the overhead of garbage collection. While zero-cost safety is remarkable on its own, we argue that the superpowers of Rust go beyond safety. In particular, Rust's linear type system enables capabilities that cannot be implemented efficiently in traditional languages, both safe and unsafe, and that dramatically improve security and reliability of system software. We show three examples of such capabilities: zero-copy software fault isolation, efficient static information flow analysis, and automatic checkpointing. While these capabilities have been in the spotlight of systems research for a long time, their practical use is hindered by high cost and complexity. We argue that with the adoption of Rust these mechanisms will become commoditized.

Strategies for replica consistency in data grid - a comprehensive survey: Strategies for Replica Consistency in Data Grid

Article

Full-text available

Jul 2016

Data grid provides an efficient solution for data-oriented applications that need to manage and process large data sets located at geographically distributed storage resources. Data grid relies on data replicas to enhance the performance and to ensure the fault tolerant results to the users. Replicas are developed to increase the availability of data and to provide better data access. Replicas have their own advantages, but there are a number of issues that must be resolved. Among various existing issues, the critical concern is replica consistency. Various replica consistency strategies are available in the literature. These strategies rationalize and investigate various parameters like bandwidth consumption, access cost, scalability, execution time, storage consumption, staleness, and freshness of replicas. In this paper, several asynchronous replica consistencies are classified and analyzed based on various strategies such as topology, level of abstraction, update propagation, and locality. Some other strategies are also discussed and analyzed like adaptive consistency, quorum-based consistency, load balancing, and agent-based economically efficient, check-pointing, fault tolerance, and conflict management. Parameters on which these strategies are analyzed are methodology, replication classification, consistency, grid topology, environment, evaluation parameters, and performance. Copyright

An Integrity Protection Model based on Trusted Recovery Technology

Article

Apr 2016

This paper firstly through IBAC, integration of TE and RBAC, the use of compensatory well-formed transaction is proposed, the integrity of the structure can be recovered partial malicious transaction monitoring machine model. In the partial revocation of constitutive affairs, for the operation of data and tracking the affected, with two recovery policies. Conservative recovery policy to stop system the recovery of normal transaction execution, by analyzing log file dependencies list, according to operation performed after first order, cancel each affected operation. Another optimistic recovery policy can be in the normal operation of the system at the same time, the establishment of compensation operation corresponding to the operation to recover, and submitted to the monitoring machine scheduling integrity. This method can recover the system to a secure state in the face of failures and improves the availability of the system. It provides an important exploration for the design and implementation of the trusted recovery mechanisms of high-level secure operating system.

Study of the Best Checkpoint Interval in the Distributed Simulation System Based on Virtualization Technology

Conference Paper

Full-text available

Jan 2015

In order to keep the system's transfer, every once in a while, it was needed to grab a snapshot for the VM of all the nodes which were in the process of the distributed simulation system. It is very important to set the reasonable checkpoint interval for optimizing the system's average utilization rate. By establishing the mathematical model, The availability of distributed simulation system based on virtual technology is analyzed, and the solving equations for getting the biggest system availability is obtained, then the best checkpoint interval of the system fault tolerance could be obtained too, and the system fault tolerance of the simulation system was solved, which was so important to improve the effectiveness and credibility of the system.

A Distributed Snapshot Protocol for Virtual Machines

Article

Gang Peng

MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware: Special Issue on: Dependable Distributed Systems (Guest Editors: Alan D. George and Patricia D. Hough)

Article

Full-text available

Jan 2004

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation.

Towards Abstractions for Distributed Systems

Article

M. Berger

For historical, sociological and technical reasons, -calculi have been the dominanttheoretical paradigm in the study of functional computation. Similarly, but to alesser degree, -calculi dominate advanced mathematical accounts of concurrency.Alas, and despite its ever increasing ubiquity, an equally convincing formal foundationfor distributed computing has not been forthcoming. This thesis seek tocontribute towards ameliorating that omission. To this end, guided by the assumptionthat...

Parallel adaptive computing on meta-systems including NOWs

Article

Feb 2000
PARALLEL COMPUT

Load analysis of meta-systems including NOWs or COWs has shown that only a few percentage of the available power is used during long periods of time. Therefore, in order to exploit the idle time when executing a parallel application work load must be sent to a machine as soon as the latter becomes available. Furthermore, in order to keep respected the ownership of workstations work has to be stopped and resumed later as soon as the machine executing it is requisitioned by its owner. As a consequence, users need an adaptive system allowing to return events related to the goings and comings of workstations. On the other hand, it is necessary to provide them a parallel adaptive programming methodology that plans the handling of these events.In this paper, we present the MARS (MARS: multi-user adaptive resource scheduler, developed at LIFL laboratory, Universitéde Lille I) system and its parallel adaptive programming methodology through the block-based Gauss–Jordan algorithm used in numerical analysis to invert large matrices. Moreover, we propose a work scheduling strategy and an application-oriented solution for the fault tolerance issue. Furthermore, we present some experimental results obtained on a DEC/ALPHA COW and a SUN/Sparc4 NOW. The results show that very high absolute efficiencies can be obtained if the size of the blocks is well chosen. We also present some experimentations related to the adaptability of the application in a meta-system including the DEC/ALPHA COW and the SUN/Sparc4 NOW. The results show that the management of the adaptability consumes just a short percentage of execution time.

Metapromela: A toolkit for simulation of checkpointing algorithms

Article

Distributed checkpointing algorithms play an important role in the majority of the fault tolerant software components existent today. Unfortunately, there is a lack of comprehensive and uniform performance testing of those algorithms. Our research focuses on the provision of a toolkit, Metapromela, that helps with the implementation and testing of distributed checkpointing algorithms. This paper is concerned primarily with the description of Metapromela and the characteristics that make it a good tool for the evaluation of checkpointing algorithms.

WormHealer: Replay-based Full-System Recovery from Control-Flow Hijacking Attacks

Article

System availability is difficult for systems to maintain in the face of Internet worms. Large systems have vulnerabilities, and if a system attempts to continue operation after an attack, it may not behave properly. Traditional mechanisms for detecting attacks disrupt service and can convert such attacks into denial-of-service. Current recovery approaches have at least one of the following limitations: they cannot recover the complete system state, they cannot recover from zero-day exploits, they undo the effects of the attack speculatively or they require the application's source code be available. This paper presents WormHealer, a replay-based, architecture-level post-attack recovery framework using VM technology. After a control-flow hijacking attack has been detected, we replay the checkpointed run using symbolic execution to discover the source of the malicious attack. We then replay the run a second time but ignore inputs from the malicious source. We evaluated WormHealer on five exploits for Linux and Windows. In all cases, it recovered the full system state and resumed execution. It also recovered all TCP connections with non-malicious clients and the communication that had been taken place during the attack, except for some limited cases.

Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

Article

Jan 2003

This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.

An Efficient and Scalable Checkpointing and Recovery Algorithm for Distributed Systems

Conference Paper

Full-text available

Dec 2006

In this paper, we describe an efficient coordinated check- pointing and recovery algorithm which can work even when the chan- nels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking check- points. Based on the local conditions, any process can request the pre- vious coordinator for the ’permission’ to initiate a new checkpoint. Al- lowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.

CIC: An Integrated Approach to Checkpointing in Mobile Agent Systems

Conference Paper

Full-text available

Dec 2006

As a widely used fault tolerance technique, checkpointing has evolved into several schemes: independent, coordinated, and communication-induced (CIC). Independent and coordinated checkpointing have been adopted in many works on fault tolerant mobile agent (MA) systems. However, CIC, a flexible, efficient, and scalable checkpointing scheme, has not been applied to MA systems. Based on the analysis of the behavior of mobile agent, we argue that CIC is a well suited checkpointing scheme for MA systems. CIC not only establishes the consistent recovery lines efficiently but also integrates well with the independent checkpointing for reliable MA migration. In this paper, we propose an important improvement to CIC, referred to as the deferred message processing based CIC algorithm (DM-CIC), which achieves higher efficiency by exempting the CIC algorithm from making the forced checkpoints in MA systems. Through simulation, we find out that DM-CIC is stable and better suited to large scale MA systems.

Fault-tolerant solutions for a MPI compute intensive application

Conference Paper

Full-text available

Feb 2007

The running times of large–scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean–time– between–failures (MTBF). Hardware failures must be tol-erated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpoint-ing and rollback recovery is a very useful technique to im-plement fault–tolerant applications. Although extensive re-search has been carried out in this field, there are few avail-able tools to help parallel programmers to enhace with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment–level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable– level solution has been implemented manually in the code. The main differences between both approaches are portabil-ity, transparency–level and checkpointing overheads. Ex-perimental results comparing both strategies on a cluster of PCs are shown in the paper.

Phoenix: A Substrate for Resilient Distributed Graph Analytics

Conference Paper

Full-text available

Apr 2019

This paper presents Phoenix, a communication and synchronization substrate that implements a novel protocol for recovering from fail-stop faults when executing graph analytics applications on distributed-memory machines. The standard recovery technique in this space is checkpointing, which rolls back the state of the entire computation to a state that existed before the fault occurred. The insight behind Phoenix is that this is not necessary since it is sufficient to continue the computation from a state that will ultimately produce the correct result. We show that for graph analytics applications, the necessary state adjustment can be specified easily by the programmer using a thin API supported by Phoenix. Phoenix has no observable overhead during fault-free execution, and it is resilient to any number of faults while guaranteeing that the correct answer will be produced at the end of the computation. This is in contrast to other systems in this space which may either have overheads even during fault-free execution or produce only approximate answers when faults occur during execution. We incorporated Phoenix into D-Galois, the state-of-the-art distributed graph analytics system, and evaluated it on two production clusters. Our evaluation shows that in the absence of faults, Phoenix is ~24x faster than GraphX, which provides fault tolerance using the Spark system. Phoenix also outperforms the traditional checkpoint-restart technique implemented in D-Galois: in fault-free execution, Phoenix has no observable overhead, while the checkpointing technique has 31% overhead. Furthermore, Phoenix mostly outperforms checkpointing when faults occur, particularly in the common case when only a small number of hosts fail simultaneously.

Parallel Operating Systems

Chapter

Jan 2000

Parallel operating systems are the interface between parallel computers (or computer systems) and the applications (parallel or not) that are executed on them. They translate the hardware’s capabilities into concepts usable by programming languages. Great diversity marked the beginning of parallel architectures and their operating systems. This diversity has since been reduced to a small set of dominating configurations: symmetric multiprocessors running commodity applications and operating systems (UNIX and Windows NT) and multicomputers running custom kernels and parallel applications. Additionally, there is some (mostly experimental) work done towards the exploitation of the shared memory paradigm on top of networks of workstations or personal computers. In this chapter, we discuss the operating system components that are essential to support parallel systems and the central concepts surrounding their operation: scheduling, synchronization, multi-threading, inter-process communication, memory management and fault tolerance. Currently, SMP computers are the most widely used multiprocessors. Users find it a very interesting model to have a computer, which, although it derives its processing power from a set of processors, does not require any changes to applications and only minor changes to the operating system. Furthermore, the most popular parallel programming languages have been ported to SMP architectures enabling also the execution of demanding parallel applications on these machines. However, users who want to exploit parallel processing to the fullest use those same parallel programming languages on top of NORMA computers. These multicomputers with fast interconnects are the ideal hardware support for message passing parallel applications. The surviving commercial models with NORMA architectures are very expensive machines, which one will find running calculus intensive applications, such as weather forecasting or fluid dynamics modelling. We also discuss some of the experiments that have been made both in hardware (DASH, Alewife) and in software systems (TreadMarks, Shasta) to deal with the scalability issues of maintaining consistency in shared-memory systems and to prove their applicability on a large-scale.

Antecedence Graph based Checkpointing and Recovery for Mobile Agents

Conference Paper

Oct 2010

Mobile agents are distributed programs which can move autonomously in a network, to perform tasks on behalf of user. Though mobile agents offer much more flexibility as compared to client-server computing, yet they have additional cost and issues such as security, reliability and fault tolerance which need to be addressed for successful adaptability of mobile agent technology for developing real life applications. Fault tolerance aims to provide reliable execution of agents even in face of failures that may occur on account of various errors that emerge during migration request failure, communication exceptions, system crashes or security violations. The graph based fault tolerance protocols have been successfully used for the implementation of fault tolerance in distributed computing. This paper proposes use of antecedence graphs and message logs for maintaining fault tolerance information of mobile agents. In order to reduce the overheads of the carrying fault tolerance information in form of large antecedence graphs, we propose the use of parallel checkpointing algorithm. For checkpointing, dependent agents are marked out using antecedence graphs; and only these agents are involved in process of taking checkpoints. In case of failures, the antecedence graphs and message logs are regenerated for recovery and then normal operation continued. Analysis of results show considerable improvement in terms of reduced message overhead, execution and recovery times as compared to the graph based existing approach.

Schedulability analysis for Fault tolerance real-time system under fault bursts

Article

Mar 2015

Burst error model describes phenomena which cause potentially random faults over a bounded time interval. We study the worst response time for tasks under fixed-priority with checkpoints. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. In this paper, we provide an exact schedulability analysis for burst error model and derive the optimal number of checkpoints. We also explore the fault-tolerant priority assignment policy to get the most efficient scheme.

Fault-Tolerance Mechanisms for a Parallel Programming System — A Responsiveness Perspective

Article

Jan 2000

Holger Karl

Clusters of workstations are an attractive environment for high performance computing. For some applications, however, clusters still lack certain properties. One such property is responsive (dependable and timely) execution of programs. This paper studies two mechanisms (checkpointing and replication) to improve the responsiveness (the probability of meeting a deadline in the presence of faults) of a parallel programming system, Calypso, by ameliorating a single point of failure of Calypso. Experiments show that checkpointing is a suitable tool to achieve high responsiveness and that already a very modest degree of replication is sufficient for improved responsiveness.

Antecedence Graph Approach to Checkpointing for Fault Tolerance in Mobile Agent Systems

Article

Feb 2013

The flexibility offered by mobile agents is quite noticeable in distributed computing environments. However, the greater flexibility of the mobile agent paradigm compared to the client/server computing paradigm comes at an additional threats since agent systems are prone to failures originating from bad communication, security attacks, agent server crashes, system resources unavailability, network congestion, or even deadlock situations. In such events, mobile agents either get lost or damaged (partially or totally) during execution. In this paper, we propose parallel checkpointing approach based on the use of antecedence graphs for providing fault tolerance in mobile agent systems. During normal computation message transmission, the dependency information among mobile agents is recorded in the form of antecedence graphs by participating mobile agents of mobile agent group. When a checkpointing procedure begins, the initiator concurrently informs relevant mobile agents, which minimizes the identifying time. The proposed scheme utilizes the checkpointed information for fault tolerance which is stored in form of antecedence graphs. In case of failures, using checkpointed information, the antecedence graphs and message logs are regenerated for recovery and then normal operation continued. Moreover, compared with the existing schemes, our algorithm involves the minimum number of mobile agents during the identifying and checkpoiting procedure, which leads to the improvement of the system performance. In addition, the proposed algorithm is a domino-free checkpointing algorithm, which is especially desirable for mobile agent systems. Quantitative analysis and experimental simulation show that our algorithm outperforms other coordinated checkpointing schemes in terms of the identifying time and the number of blocked mobile agents and then can provide a better system performance. The main contribution of the proposed checkpointing scheme is the enhancement of graph-based ap- roach in terms of considerable improvement by reducing message overhead, execution, and recovery times.

Platform independent non blocking mechanism for prevention of blocking attacks in mobile agents based e-service applications

Article

Dec 2011

In recent years many researchers are incorporating the mobile agents in e-service applications especially in e-learning and e-commerce to improve the network latency and to reduce the network traffic. On the other side, the security issues degrade the mobile agent usage. The main intention of the attacker is to kill or modify the behaviour of the agent in the middle of the journey to degrade the trustiness of the agent environment. In this paper, we propose fault tolerance mechanism for preventing the agent blocking in scenarios where the agent is captured by malicious host in the network. This approach makes use of acknowledgements and partial result retrieval and when implemented in mobile agent platform allows the originator to retrieve partial results and track the location of mobile agent at any time during the process of transaction execution. During the recovery of the mobile agent all the components (agent code, itinerary, credential information, collected information and state) are able to recover. The proposed mechanism is capable of improving fault tolerant time, reliability and performance, especially for mobile agents in e-commerce Internet applications.

Checkpointing and Migration of parallel processes based on Message Passing Interface

Article

Full-text available

This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. Some important fault tolerant mechanisms are designed and implemented in this system, which include coordinated checkpointing protocol, synchronized rollback recovery, process migration, and so on. Owing to ChaRM4MPI, the node transient faults can be recovered automatically, and the permanent fault can also be recovered through checkpoint mirroring and process migration techniques. Moreover, users can migrate MPI processes from one node to another manually for load balance or system maintenance. ChaRM4MPI is a user-transparent implementation and introduces a little running time overhead.

Contract E-Negotiation in Agricultural Supply Chains

Article

Full-text available

Jul 2008

Supply chains are made up of distributed, heterogeneous, and autonomous elements in a dynamic relationship with one another. Agricultural supply chains are strictly regulated to ensure food safety and multi-level traceability. Contracts in such chains need sophisticated specification and management of chain agents to ensure auditability. A framework that attacks these problems is proposed. It is centered on three elements that support and manage agent interactions: contracts, coordination plans (i.e., business processes), and regulations (i.e., business rules). The main contributions are (1) a contract model suitable for agricultural supply chains, (2) a negotiation protocol able to produce contracts, and (3) negotiation implementation via Web services. Interoperability among chain processes is fostered by maintaining independence between business processes and contract negotiation.

MRT — An Approach to Minimize the Replay Time During Debugging Message Passing Programs

Article

Jan 2002

Cyclic debugging, where a program is executed over and over again, is a popular methodology for tracking down and eliminating bugs. To debug parallel programs, it requires additional techniques due to nondeterministic behavior. Such techniques are record&replay mechanisms. A problem is the cost associated with restarting the program’s execution every time from the beginning. A corresponding solution is offered by combining checkpointing and debugging, which allows restarting executions at an intermediate state. However, minimizing the replay time is still a challenge. Previous methods cannot ensure that the replay time has an upper bound. This quality is important for developing a debugging tool, in which some degree of interactivity for the user’s investigations is required. This problem is discussed in this paper and an approach to minimize the replay time, the MRT method, is described. It ensures a small upper bound for the replay time with low overhead. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.

Software Fault Tolerance in Distributed Systems Using Controlled Re-execution

Article

Ashis Tarafdar

Improving Rollback-Recovery Efficiency by Tuning Pessimism Grain

Article

Jul 2007

Wide-area systems are becoming a popular infrastructure for long-running applications. Rollback-recovery, as a common technology for fault tolerance and load balance, must meet the challenges of scalability and inherent variability in such applications. Most of the rollback-recovery protocols, however, are poor in scalability. Although pessimistic message logging protocols have no such problem, their fault-free overhead sometimes is prohibitive. Aiming at good scalability and acceptable overhead, this paper introduces the concept of pessimism grain and presents a coarse-grained pessimistic message-logging scheme. The paper also evaluates the impact of pessimism grain on the performance of the recovery scheme. Experimental results show that pessimism grain is one of the key configuration parameters to reach a desired performance level. In practice, the proper pessimism grain should be selected based on the characteristics of the applications.

Performance Comparison of Hierarchical Checkpoint Protocols Grid Computing

Article

Full-text available

Jun 2012

Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution that poses a number of problems related to the heterogeneity of hardware, operating systems, networks, middleware, applications, the dynamic resource, the scalability, the lack of common memory, the lack of a common clock, the asynchronous communication between processes. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. It appears as an indispensable element in distributed systems. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery. These protocols are classified into two categories: coordinated checkpointing and rollback protocols and log-based independent checkpointing protocols or message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate. Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. While some protocols have been used successfully in small scale, they are not suitable for use in large scale. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of four well-known protocols. We have implemented and compare the performance of these protocols in clusters and grid computing using the Omnet++ simulator.

TxLinux and MetaTM: Transactional Memory and the Operating System

Article

Full-text available

Sep 2008

TxLinux is the first operating system to use hardware transactional memory (HTM) as a synchronization primitive, and the first to manage HTM in the scheduler. TxLinux, which is a modification of Linux, is the first real-scale benchmark for transactional memory (TM). MetaTM is a modification of the x86 architecture that supports HTM in general and TxLinux specifically. This paper describes and measures TxLinux and MetaTM, the HTM model that supports it. TxLinux greatly benefits from a new primitive, called the cooperative transactional spinlock (cxspinlock) that allows locks and transactions to protect the same data while maintaining the advantages of both synchronization primitives. Integrating the TxLinux scheduler with the MetaTM's architectural support for HTM eliminates priority inversion for several real-world benchmarks.

Program Diagnostics

Chapter

Dec 1999

James Plank

The sections in this article are

Rollback-dependency trackability: an optimal characterization and its protocol,"

Article

Full-text available

Jan 1997

Reversible Communicating Concurrent Systems

Article

Full-text available

One obtains in this paper a process algebra RCCS, in the style of CCS, where processes can backtrack. Backtrack, just as plain forward computation, is seen as a synchronization and incurs no addi-tional cost on the communication structure. It is shown that, given a past, a computation step can be taken back if and only if it leads to a causally equivalent past.

An Efficient Time-Based Checkpointing Protocol for Mobile Computing Systems over Mobile IP

Article

Full-text available

Dec 2003

Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computing systems over Mobile IP is proposed. The protocol reduces the number of checkpoints per checkpointing process to nearly minimum, so that fewer checkpoints need to be transmitted through the costly wireless link. Our protocol also performs very well in the aspects of minimizing the number and size of messages transmitted in the wireless network. In addition, the protocol is nonblocking because inconsistencies can be avoided by the piggybacked information in every message. Therefore, the protocol brings very little overhead to a mobile host with limited resource. Additionally, by taking advantage of reliable timers in mobile support stations, the time-based checkpointing protocol can adapt to wide area networks.

Intrusion-Tolerant Intrusion Detection System

Conference Paper

Aug 2004
Lect Notes Comput Sci

Recently, numerous studies have focused on multi-agentbased intrusion detection systems (IDSs) in order to detect intrusion behavior more efficiently. However, since an agent is easily subverted by a process that is faulty, a multi-agent based intrusion detection system must be fault tolerant by being able to recover from system crashes, caused either accidentally or by malicious activity. Many of the existing IDSs have no means of providing such failure recovery. In this paper, we propose the novel intrusion-tolerant IDS using communication-induced checkpointing and pessimistic message logging techniques. When the failed agent is restarted, therefore, our proposed system can recover its previous state and resume its operation unaffected. In addition, agents communicate with each other by sending messages without causality violation using vector timestamps.

Process Interconnection Structures in Dynamically Changing Topologies

Conference Paper

Jan 2000

This paper presents a mechanism that organizes processes in the hierarchy and efficiently maintains it in the presence of addition/removal of nodes to the system, and in the presence of node failures. This mechanism can support total order of broadcasts and does not rely on any specific system features or special hardware. In addition, it can concurrently support multiple logical structures, such as a ring, a hypercube, a mesh, and a tree.

Evaluations of domino-free communication-induced checkpointing protocols

Article

Jan 1999
INFORM PROCESS LETT

We give a detailed analysis of communication-induced checkpointing protocols that are free of domino effect. We investigate the validity of a common intuition in the literature and demonstrate that there is no optimal on-line domino-free protocol in terms of the number of forced checkpoints. Formal proofs on comparing existing protocols in the literature are given.

Application and middleware transparent checkpointing with TCKPT on ClusterGrids

Article

Mar 2010
FUTURE GENER COMP SY

This paper introduces a combination of the existing parallel checkpointing techniques for software heterogeneous ClusterGrid infrastructures. Most of the existing solutions are aiming at supporting application transparency (no checkpoint related code development in application), but some others build middleware transparent (no service modification) solutions. The main contribution of this paper is to introduce a solution providing both application and middleware transparency at the same time. Compatibility and integrity requirements are identified and corresponding conditions are established using Abstract State Machines. The most relevant checkpointing systems are checked against the conditions in order to examine their conformity. Based on the conditions, a novel checkpointing method is defined and a proof of concept checkpointing tool, called TotalCheckpoint (TCKPT) is introduced.

The Two-Phase Commitment Protocol in an Extended π-Calculus

Article

Full-text available

Feb 2003
Electron Notes Theor Comput Sci

We examine extensions to the π-calculus for representing basic elements of distributed systems. In spite of its expressiveness for encoding various programming constructs, some of the phenomena inherent in distributed systems are hard to model in the π-calculus. We consider message loss, sites, timers, site failure and persistence as extensions to the calculus and examine their descriptive power, taking the Two Phase Commit Protocol (2PCP), a basic instance of an atomic commitment protocol, as a testbed. Our extensions enable us to represent the 2PCP under various failure assumptions, as well as to reason about the essential properties of the protocol.

MPICH-CM: A communication library design for a P2P MPI implementatio

Conference Paper

Full-text available

Sep 2002

The paper presents MPICH-CM - a new architecture of communications in message-passing systems, developed for MPICH-V - a MPI implementation for P2P systems. MPICH-CM implies communications between nodes through special Channel Memories introducing fully decoupled communication media. Some new properties of communications based on MPICH-CM are described in comparison with other communication architectures, with emphasis on grid-like and volunteer computing systems. The first implementation of MPICH-CM is performed as a special MPICH device connected with Channel Memory servers. To estimate the overhead of MPICH-CM, the performance of MPICH-CM is presented for basic point-to-point and collective operations in comparison with MPICH p4 implementation.

A Rollback Recovery Algorithm for Intrusion Tolerant Intrusion Detection System

Conference Paper

May 2004
Lect Notes Comput Sci

To cope with various intrusion patterns, an intrusion detection system, which is based on multiple agents working collectively, was proposed recently. Since an agent is easily subverted by a process that is faulty, a multi-agent based intrusion detection system must be fault tolerant by being able to recover from system crashes, either accidental or malicious activity. However, there have been very few attempts to provide fault tolerance in intrusion detection system. In this paper, we propose the rollback recovery algorithm for intrusion-tolerant intrusion detection system using communication-induced checkpointing and pessimistic message logging techniques. Thus, our proposed scheme guarantees a consistent global snapshot.

A survey of rollback protocols in message-passing systems

No full-text available

Recommended publications

aeromagnetic survey(ing)

Riemannian Manifolds with Positive Sectional Curvature

Possible infrared/optical counterpart of IGR J17098-3628

Why Inequality Matters: The Lessons of Brexit