Article

A survey of rollback protocols in message-passing systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We have implemented a simple PFTP component that adopts the coordinated checkpointing algorithm [5]. This component creates global consistent states at intervals specified in a configuration file. ...
... The global consistent state is created as figure 8. With this algorithm, a global state, which consists of checkpoints of all the MPI processes, is consistent [5]. ...
Conference Paper
Full-text available
Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory due to large problem scaling while the I/O performance not keeping track, thereby causing substantial overhead for the overall system. Therefore, automated optimization of the checkpoint interval is essential , but the optimal point depends on hardware failure rates and I/O bandwidth, both of which may change fairly quickly over time, often very difficult to be determined by the user without considerable overhead and/or divergence, and as far as we know, no work had addressed the issue for large parallel systems. Our new model and an algorithm, which is an extension of Vaidya's proposal, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5% to 20% improvement over statically user-determined cases.
... The states of the resources during an operation are captured by the resource state capturing ser- vice. Unlike existing traditional checkpointing mechanisms which capture the states for the whole software system [13], our state capturing method only captures the states of the resources involved in the determined resource space of the operation rather than the resources of the whole cloud sys- tem, hence increasing the efficiency of state capturing especially when the system scale is large. The state capturing algorithm is shown in Figure 8. ...
... Undoability checker is able to identify which opera- tions are not undoable and why. If undo is possible and desired, an AI planning technique [26] can be applied to automatically create a workflow that takes the system back to the desired earlier state, which is similar to rolling back a system state to a previous consistent system state in the context of message-passing systems [13]. ...
Conference Paper
Sporadic operations such as rolling upgrade or machine instance redeployment are prone to unpredictable failures in the cloud largely due to the inherent high variability nature of cloud. Previous dependability research has established several recovery methods for cloud failures. In this paper, we first propose eight recovery patterns for sporadic operations. We then present the filtering process which filters applicable recovery patterns for a given operational step. We also propose a methodology to evaluate the recovery actions generated for the applicable recovery patterns based on the metrics of Recovery Time, Recovery Cost and Recovery Impact. This quantitative evaluation will lead to selection of optimal recovery actions. We implement a recovery service and illustrate its applicability by recovering from errors occurring in Asgard rolling upgrade operation on cloud. The experimental results show that the recovery service enhances automated recovery from operational failures by selecting the optimal recovery actions.
... The states of the resources during an operation are captured by the resource state capturing service. Unlike existing traditional checkpointing mechanisms which capture the states for the whole software system [13], our state capturing method only captures the states of the resources involved in the determined resource space of the operation rather than the resources of the whole cloud system, hence increasing the efficiency of state capturing especially when the system scale is large. The state capturing algorithm is shown in Figure 8. ...
... Undoability checker is able to identify which operations are not undoable and why. If undo is possible and desired, an AI planning technique [26] can be applied to automatically create a workflow that takes the system back to the desired earlier state, which is similar to rolling back a system state to a previous consistent system state in the context of message-passing systems [13]. ...
Article
Sporadic operations such as rolling upgrade or machine instance redeployment are prone to unpredictable failures in the public cloud largely because of the inherent high variability nature of public cloud. Previous dependability research has established several recovery methods for cloud failures. In this paper, we first propose eight recovery patterns for sporadic operations on public cloud. We then present the filtering process which filters applicable recovery patterns. We propose an automation mechanism to automatically generate recovery actions for those applicable recovery patterns based on our resource state transition algorithm. We also propose a methodology to evaluate the recovery actions generated for the applicable recovery patterns based on the recovery evaluation metrics of Recovery Time, Recovery Cost, and Recovery Impact. This quantitative evaluation will lead to selection of the acceptable recovery actions. We propose two recovery actions selection mechanisms: one is based on user constraints of the recovery evaluation metrics, and the other one is based on Pareto set searching algorithm. We implement a recovery service and illustrate its applicability by recovering from errors occurring in the rolling upgrade operation on AWS cloud. Copyright
... Among other systems, distributed systems make use of rollback recovery techniques for error recovery similar to the log-based technique used in this thesis. The papers [29,30,31] were evaluated for guidance during its design. ...
... that all previous computational work completed by one or multiple nodes is lost [31]. ...
Article
CubeSat satellites have redefined the standard solution for conducting missions in space due to their unique form factor and cost. The harsh environment of space necessitates examining features that improve satellite robustness and ultimately extend lifetime, which is typical and vital for mission success. The CubeSat development team at Cal Poly, PolySat, has recently redefined its standard avionics platform to support more complex mission capabilities with this robustness in mind. A significant addition was the integration of the Linux operating system, which provides the flexibility to develop much more elaborate protection mechanisms within software, such as support for remote on-orbit software updates. This thesis details the design and development of such a feature-set with critical software recovery and multiple-mission single-CubeSat functionality in mind. As a result, features that focus on software update usability, validation, system recovery, upset tolerance, and extensibility have been developed. These include backup Linux kernel and file system image availability, image validation prior to boot, and the use of multiple file system devices to protect against system upsets. Furthermore, each feature has been designed for usability on current and future missions.
... Global state consistency requires that if the state of a process (a middleware server or an end client process) includes a message receive, either request or reply, the sender process's state must include the message send [13]. Figure 2 illustrates a violation of such consistency. ...
... Log-based recovery for general application processes over a Java virtual machine was partially implemented [30], but no consistency among interacting processes was considered. Pessimistic logging and optimistic logging were invented in the fault tolerance community [13] and were used for consistent recovery of message passing systems, where entities interacted with one another via message exchanges only. Individual threads were considered as separate recovery units with optimistic logging [10]; however, log management of a multi-threaded process with optimistic logging was not explored. ...
Article
Full-text available
Providing enterprises with reliable and available Web-based application programs is a challenge. Applications are traditionally spread over multiple nodes, from user (client), to middle tier servers, to back end transaction systems, e.g. databases. It has proven very difficult to ensure that these applications persist across system crashes so that “exactly once” execution is produced, always important and sometimes essential, e.g., in the financial area. Our system provides a framework for exactly once execution of multi-tier Web applications, built on a commercially available Web infrastructure. Its capabilities include low logging overhead, recovery isolation (independence), and consistency between mid-tier and transactional back end. Good application performance is enabled via persistent shared state in the middle tier while providing for private session state as well. Our extensive experiments confirm both the desired properties and the good performance. KeywordsApplication fault tolerance–Exactly once execution–Transaction processing–Recovery–Optimistic logging–Distributed systems
... In forward error replenishment techniques, the character of errors and damage caused by faults must be wholly and precisely assessed and so it becomes possible to eliminate those errors in the subroutine state and enable the subroutine to move onward [27]. In distributed system, precise estimation of all the faults may not be possible. ...
... Platform utilizes the computerized pen and contact delicate screen advancements to fabricate an advanced work area, which is demonstrated to have the option to adequately bolster customary addressing exercises and to straightforwardly catch educators' homeroom introductions with negligible interruption. [10] This study covers rollback-recuperation strategies That does not involve the development of a distinctive speech. We describe reset recovery conferences in the original section of the research onto bank-based and log-based events. ...
Article
Different errors are attacks the VLSI SoC designs these are harm to Mathematical operations and obstacles to results because of this soft core errors are trending to screen. With the increase of information communication, sources of noise (SON) and interference and parallel processing, increases the fault tolerances so designers have been striving to achieve with the require for extra competent and consistent techniques for detecting and correcting faults in parallel “transmission_(TX)” and “reception_(RX)” of data. even if some methods and advances have been projected and apply in past years but information dependability in TX and TX is at rest a trouble. In this research we recommend a more efficient mutual “error_detection” & “correction_technique” stand on the Artificial intelligent algorithmic based fault tolerance (AIABFT) with parallel Orthogonal Codes, and vertical parity. With the help of proposed method designing a parallel processing faults detection and correction FFT. This AIABFT method has been experimentally executed and replicated using Xilinx_vivadoResults of the simulation indicate that the suggested method detects 97% of the mistakes and corrections as expected in the received impaired n-bit code up to (n/2-1) bits of mistakes
... The state of the cyber recovery refers to computing information such as variable values in the program or the cyber state. The conventional method is that after a fault is detected, the computing tasks will be rolled back to a globally consistent state checkpointed in the history [17]. By contrast, this article aims to restore the state of a physical system or the physical state. ...
Article
Full-text available
The increasing autonomy and connectivity in cyber-physical systems (CPS) come with new security vulnerabilities that are easily exploitable by malicious attackers to spoof a system to perform dangerous actions. While the vast majority of existing works focus on attack prevention and detection, the key question is “what to do after detecting an attack?”. This problem attracts fairly rare attention though its significance is emphasized by the need to mitigate or even eliminate attack impacts on a system. In this article, we study this attack response problem and propose novel real-time recovery for securing CPS. First, this work’s core component is a recovery control calculator using a Linear-Quadratic Regulator (LQR) with timing and safety constraints. This component can smoothly steer back a physical system under control to a target state set before a safe deadline and maintain the system state in the set once it is driven to it. We further propose an Alternating Direction Method of Multipliers (ADMM) based algorithm that can fast solve the LQR-based recovery problem. Second, supporting components for the attack recovery computation include a checkpointer, a state reconstructor, and a deadline estimator. To realize these components respectively, we propose (i) a sliding-window-based checkpointing protocol that governs sufficient trustworthy data, (ii) a state reconstruction approach that uses the checkpointed data to estimate the current system state, and (iii) a reachability-based approach to conservatively estimate a safe deadline. Finally, we implement our approach and demonstrate its effectiveness in dealing with totally 15 experimental scenarios which are designed based on 5 CPS simulators and 3 types of sensor attacks.
... The first reason is the generally lower equipment cost, both in terms of initial investment and maintenance. The second reason is the resilience against failures because, when a single processor fails within an HPC application, the system can still continue operating by initiating a partial recovery (e.g., based on communication-driven checkpointing [46] or partial re-computation [168]). The third reason is the increase in aggregate I/O bandwidth compared to a single machine [49]. ...
Article
The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.
... La littérature concernant les pannes et la tolérance aux pannes est vaste [10,29,41,42,64,65,73,79,80,89]. La notion de tolérance aux pannes (fault tolerance), que nous développons par la suite, est la capacité d'un système à maintenir un fonctionnement malgré les pannes survenant. ...
Thesis
Les travaux présentés dans cette thèse portent sur l’ordonnancement d’applications multi-tâches linéaires de type workflow sur des plateformes distribuées. La particularité du système étudié est que le nombre de machines composant la plateforme est plus petit que le nombre de tâches à effectuer. Dans ce cas les machines sont supposées être capables d’effectuer toutes les tâches de l’application moyennant une reconfiguration, sachant que toute reconfiguration demande un temps donné dépendant ou non des tâches. Le problème posé est de maximiser le débit de l’application,c’est à dire le nombre moyen de sorties par unité de temps, ou de minimiser la période, c’est à dire le temps moyen entre deux sorties. Par conséquent le problème se décompose en deux sous problèmes: l’assignation des tâches sur les machines de la plateforme (une ou plusieurs tâches par machine), et l’ordonnancement de ces tâches au sein d’une même machine étant donné les temps de reconfiguration. Pour ce faire la plateforme dispose d’espaces appelés buffers, allouables ou imposés, pour stocker des résultats de production temporaires et ainsi éviter d’avoir à reconfigurer les machines après chaque tâche. Si les buffers ne sont pas pré-affectés nous devons également résoudre le problème de l’allocation de l’espace disponible en buffers afin d’optimiser l’exécution de l’ordonnancement au sein de chaque machine. Ce document est une étude exhaustive des différents problèmes associés à l’hétérogénéité de l’application ; en effet si la résolution des problèmes est triviale avec des temps de reconfiguration et des buffers homogènes, elle devient bien plus complexe si ceux-ci sont hétérogènes. Nous proposons ainsi d’étudier nos trois problèmes majeurs pour différents degrés d’hétérogénéité de l’application. Nous proposons des heuristiques pour traiter ces problèmes lorsqu’il n’est pas possible de trouver une solution algorithmique optimale.
... Error handling in a graph processing system, as with other systems, has two main phases: (1) failure detection, in which the system discovers the error, and (2) fault recovery, in which the system tries to resolve the problem and resume the operation. Numerous research has been done on various fault-tolerance techniques on parallel and distributed systems (Kavila et al. 2013;Treaster 2005;Elnozahy et al. 2002). In Treaster (2005), for example, two types of components in an application, called central components and parallel components, are investigated, where both mostly use rollback and replication methods for fault recovery. ...
Article
Full-text available
The world is becoming a more conjunct place and the number of data sources such as social networks, online transactions, web search engines, and mobile devices is increasing even more than had been predicted. A large percentage of this growing dataset exists in the form of linked data, more generally, graphs, and of unprecedented sizes. While today's data from social networks contain hundreds of millions of nodes connected by billions of edges, inter-connected data from globally distributed sensors that forms the Internet of Things can cause this to grow exponentially larger. Although analyzing these large graphs is critical for the companies and governments that own them, big data tools designed for text and tuple analysis such as MapReduce cannot process them efficiently. So, graph distributed processing abstractions and systems are developed to design iterative graph algorithms and process large graphs with better performance and scalability. These graph frameworks propose novel methods or extend previous methods for processing graph data. In this article, we propose a taxonomy of graph processing systems and map existing systems to this classification. This captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks. Our effort helps to highlight key distinctions in architectural approaches, and identifies gaps for future research in scalable graph systems.
... Many techniques for improving the performance and reliability of systems hinge on the ability to automatically manipulate program state in memory. In particular, checkpointing[14], transactions[20,21,33], replication[36,37], multiversion concurrency[1,4], etc., involve snapshotting parts of program state. This, in turn, requires traversing pointer-linked data structures in memory. ...
Conference Paper
Rust is a new system programming language that offers a practical and safe alternative to C. Rust is unique in that it enforces safety without runtime overhead, most importantly, without the overhead of garbage collection. While zero-cost safety is remarkable on its own, we argue that the superpowers of Rust go beyond safety. In particular, Rust's linear type system enables capabilities that cannot be implemented efficiently in traditional languages, both safe and unsafe, and that dramatically improve security and reliability of system software. We show three examples of such capabilities: zero-copy software fault isolation, efficient static information flow analysis, and automatic checkpointing. While these capabilities have been in the spotlight of systems research for a long time, their practical use is hindered by high cost and complexity. We argue that with the adoption of Rust these mechanisms will become commoditized.
... Synchronization is needed as the concurrent process, while passing the message can change their internal state. Three check pointing strategies were described for concurrent processes [95]. ...
Article
Full-text available
Data grid provides an efficient solution for data-oriented applications that need to manage and process large data sets located at geographically distributed storage resources. Data grid relies on data replicas to enhance the performance and to ensure the fault tolerant results to the users. Replicas are developed to increase the availability of data and to provide better data access. Replicas have their own advantages, but there are a number of issues that must be resolved. Among various existing issues, the critical concern is replica consistency. Various replica consistency strategies are available in the literature. These strategies rationalize and investigate various parameters like bandwidth consumption, access cost, scalability, execution time, storage consumption, staleness, and freshness of replicas. In this paper, several asynchronous replica consistencies are classified and analyzed based on various strategies such as topology, level of abstraction, update propagation, and locality. Some other strategies are also discussed and analyzed like adaptive consistency, quorum-based consistency, load balancing, and agent-based economically efficient, check-pointing, fault tolerance, and conflict management. Parameters on which these strategies are analyzed are methodology, replication classification, consistency, grid topology, environment, evaluation parameters, and performance. Copyright
... All the raw data and operations that depend on the malicious transaction can be found by the recovery algorithm. Because of the high cost of the former (redo) recovery method [12] , we must make clear the execution semantics of each transaction, so we adopt the backward recovery (undo) method, and the file system will be restored to the previous consistent state. ...
Article
This paper firstly through IBAC, integration of TE and RBAC, the use of compensatory well-formed transaction is proposed, the integrity of the structure can be recovered partial malicious transaction monitoring machine model. In the partial revocation of constitutive affairs, for the operation of data and tracking the affected, with two recovery policies. Conservative recovery policy to stop system the recovery of normal transaction execution, by analyzing log file dependencies list, according to operation performed after first order, cancel each affected operation. Another optimistic recovery policy can be in the normal operation of the system at the same time, the establishment of compensation operation corresponding to the operation to recover, and submitted to the monitoring machine scheduling integrity. This method can recover the system to a secure state in the face of failures and improves the availability of the system. It provides an important exploration for the design and implementation of the trusted recovery mechanisms of high-level secure operating system.
... Due to the processor's twice failure time interval obey the exponential distribution which was the continuous random variables. Only one rollback recovery occurs in the system In the N processors, only one broke down at 0 ~ T time when the checkpoint interval was beginning [6,7]. Then the rollback recovery system began to run until which was completed. ...
Conference Paper
Full-text available
In order to keep the system's transfer, every once in a while, it was needed to grab a snapshot for the VM of all the nodes which were in the process of the distributed simulation system. It is very important to set the reasonable checkpoint interval for optimizing the system's average utilization rate. By establishing the mathematical model, The availability of distributed simulation system based on virtual technology is analyzed, and the solving equations for getting the biggest system availability is obtained, then the best checkpoint interval of the system fault tolerance could be obtained too, and the system fault tolerance of the simulation system was solved, which was so important to improve the effectiveness and credibility of the system.
... Communication induced protocols [31] [18] [24] try to combine the advantages of uncoordinated protocols and coordinated protocols. They have two kinds of snapshots, local snapshots and forced snapshots [20]. On one hand, like uncoordinated protocols, communication induced protocols allow the processes to take local snapshots whenever they want. ...
... Unlike most checkpointing middleware, CoCheck is visible to the user, and is available at a layer above the message-passing middleware. Common problems with checkpointing and recovery such as global inconsistent states and domino effects [11] are eliminated through the use of a protocol to " flush " all in-transit messages before a checkpoint is created. CoCheck is primarily intended to facilitate process migration , load balancing, and stalling of long running applications for resumption at a later time. ...
Article
Full-text available
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation.
... This chapter added process failure and savepoints to π mlt , not to study the properties of the resulting calculus in any depth, but rather as a stepping stone towards the full 2PCP. Persistence and savepointing have received much attention from the distributed systems community [33,47], but to the best of our knowledge process theoretic accounts are lacking. This leaves much work to be done, in particular, a satisfactory theory of the induced reduction congruence. ...
Article
For historical, sociological and technical reasons, -calculi have been the dominanttheoretical paradigm in the study of functional computation. Similarly, but to alesser degree, -calculi dominate advanced mathematical accounts of concurrency.Alas, and despite its ever increasing ubiquity, an equally convincing formal foundationfor distributed computing has not been forthcoming. This thesis seek tocontribute towards ameliorating that omission. To this end, guided by the assumptionthat...
... · Respect of machines' ownership: When a machine executing some work of a MARS application is requisitioned by its owner the MARS system folds this work. · Fault tolerance: MARS integrates a fault tolerance mechanism which includes a checkpointing algorithm [11]. This characteristic is developed in Section 5.3. ...
Article
Load analysis of meta-systems including NOWs or COWs has shown that only a few percentage of the available power is used during long periods of time. Therefore, in order to exploit the idle time when executing a parallel application work load must be sent to a machine as soon as the latter becomes available. Furthermore, in order to keep respected the ownership of workstations work has to be stopped and resumed later as soon as the machine executing it is requisitioned by its owner. As a consequence, users need an adaptive system allowing to return events related to the goings and comings of workstations. On the other hand, it is necessary to provide them a parallel adaptive programming methodology that plans the handling of these events.In this paper, we present the MARS (MARS: multi-user adaptive resource scheduler, developed at LIFL laboratory, Universitéde Lille I) system and its parallel adaptive programming methodology through the block-based Gauss–Jordan algorithm used in numerical analysis to invert large matrices. Moreover, we propose a work scheduling strategy and an application-oriented solution for the fault tolerance issue. Furthermore, we present some experimental results obtained on a DEC/ALPHA COW and a SUN/Sparc4 NOW. The results show that very high absolute efficiencies can be obtained if the size of the blocks is well chosen. We also present some experimentations related to the adaptability of the application in a meta-system including the DEC/ALPHA COW and the SUN/Sparc4 NOW. The results show that the management of the adaptability consumes just a short percentage of execution time.
... 2.Figure 3 shows a sample of the data generated by running a scenario. Here we compare three index based [5] algorithms: BCS [8], BQF [3] and HMNR [6]. In this scenario we define a fully connected network and vary the number of process in the system. ...
Article
Distributed checkpointing algorithms play an important role in the majority of the fault tolerant software components existent today. Unfortunately, there is a lack of comprehensive and uniform performance testing of those algorithms. Our research focuses on the provision of a toolkit, Metapromela, that helps with the implementation and testing of distributed checkpointing algorithms. This paper is concerned primarily with the description of Metapromela and the characteristics that make it a good tool for the evaluation of checkpointing algorithms.
... In the area of software fault-tolerance the goals are to be able to restart an application after a fault, or periodically rejuvenate a system in order to avoid a fault, or roll-back and/or replay an application after a fault. This can be achieved through reboot [8, 15, 30] and its variations [2, 7], software rejuvenation [18] and roll-back recovery and replay techniques [1, 5, 11, 13, 21]. These approaches cannot be directly applied for post-attack recovery because or they do not address availability, or may allow the system receive several attacks from the same source, or may loose important system state while recovering or do not include an undo or repair mechanism. ...
Article
System availability is difficult for systems to maintain in the face of Internet worms. Large systems have vulnerabilities, and if a system attempts to continue operation after an attack, it may not behave properly. Traditional mechanisms for detecting attacks disrupt service and can convert such attacks into denial-of-service. Current recovery approaches have at least one of the following limitations: they cannot recover the complete system state, they cannot recover from zero-day exploits, they undo the effects of the attack speculatively or they require the application's source code be available. This paper presents WormHealer, a replay-based, architecture-level post-attack recovery framework using VM technology. After a control-flow hijacking attack has been detected, we replay the checkpointed run using symbolic execution to discover the source of the malicious attack. We then replay the run a second time but ignore inputs from the malicious source. We evaluated WormHealer on five exploits for Linux and Windows. In all cases, it recovered the full system state and resumed execution. It also recovered all TCP connections with non-malicious clients and the communication that had been taken place during the attack, except for some limited cases.
... There are several systems that offer C/R capabilities, e.g., Condor [23], Manetho [13], and LoadLeveler [25], and quite a few protocols and techniques for C/R have been proposed. Generally, C/R protocols can be categorized as either coordinated , in which case all processes coordinate their checkpointing to form a global consistent state [10,15,32] , or as uncoordinated , in which case every process can perform checkpointing independently [1,29,32,34,41]. One of the important aspects of Starfish architecture is that it enables us to implement and study both coordinated and uncoordinated checkpointing within a single framework. ...
Article
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.
... For other distributed initiator schemes like Prakash and Singhal's algorithm[3], there may be multiple instances of checkpointing in the system at any given time. A detailed description of various checkpoint and rollback-recovery protocols can be found in [4, 5] . Our aim is to design a checkpointing algorithm with multiple initiators, but which has only one instance of checkpointing going on in the system at any point in time and has all of the above features. ...
Conference Paper
Full-text available
In this paper, we describe an efficient coordinated check- pointing and recovery algorithm which can work even when the chan- nels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking check- points. Based on the local conditions, any process can request the pre- vious coordinator for the ’permission’ to initiate a new checkpoint. Al- lowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.
... Nearly all existing MA platforms are Java based, so it is easy to utilize the serialization technique provided by Java to make checkpoints. Checkpointing techniques have been classified into three major schemes, namely independent (or uncoordinated), coordinated, or communication- induced [7]. Among the three checkpointing schemes, independent checkpointing is the simplest one. ...
Conference Paper
Full-text available
As a widely used fault tolerance technique, checkpointing has evolved into several schemes: independent, coordinated, and communication-induced (CIC). Independent and coordinated checkpointing have been adopted in many works on fault tolerant mobile agent (MA) systems. However, CIC, a flexible, efficient, and scalable checkpointing scheme, has not been applied to MA systems. Based on the analysis of the behavior of mobile agent, we argue that CIC is a well suited checkpointing scheme for MA systems. CIC not only establishes the consistent recovery lines efficiently but also integrates well with the independent checkpointing for reliable MA migration. In this paper, we propose an important improvement to CIC, referred to as the deferred message processing based CIC algorithm (DM-CIC), which achieves higher efficiency by exempting the CIC algorithm from making the forced checkpoints in MA systems. Through simulation, we find out that DM-CIC is stable and better suited to large scale MA systems.
... Checkpoint consistency has been well-studied in the last decade [8]. Approaches to the consistent recovery can be categorized into different protocols: uncoordinated, coordinated and communication induced checkpoint, and message logging. ...
Conference Paper
Full-text available
The running times of large–scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean–time– between–failures (MTBF). Hardware failures must be tol-erated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpoint-ing and rollback recovery is a very useful technique to im-plement fault–tolerant applications. Although extensive re-search has been carried out in this field, there are few avail-able tools to help parallel programmers to enhace with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment–level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable– level solution has been implemented manually in the code. The main differences between both approaches are portabil-ity, transparency–level and checkpointing overheads. Ex-perimental results comparing both strategies on a cluster of PCs are shown in the paper.
Conference Paper
Full-text available
This paper presents Phoenix, a communication and synchronization substrate that implements a novel protocol for recovering from fail-stop faults when executing graph analytics applications on distributed-memory machines. The standard recovery technique in this space is checkpointing, which rolls back the state of the entire computation to a state that existed before the fault occurred. The insight behind Phoenix is that this is not necessary since it is sufficient to continue the computation from a state that will ultimately produce the correct result. We show that for graph analytics applications, the necessary state adjustment can be specified easily by the programmer using a thin API supported by Phoenix. Phoenix has no observable overhead during fault-free execution, and it is resilient to any number of faults while guaranteeing that the correct answer will be produced at the end of the computation. This is in contrast to other systems in this space which may either have overheads even during fault-free execution or produce only approximate answers when faults occur during execution. We incorporated Phoenix into D-Galois, the state-of-the-art distributed graph analytics system, and evaluated it on two production clusters. Our evaluation shows that in the absence of faults, Phoenix is ~24x faster than GraphX, which provides fault tolerance using the Spark system. Phoenix also outperforms the traditional checkpoint-restart technique implemented in D-Galois: in fault-free execution, Phoenix has no observable overhead, while the checkpointing technique has 31% overhead. Furthermore, Phoenix mostly outperforms checkpointing when faults occur, particularly in the common case when only a small number of hosts fail simultaneously.
Chapter
Parallel operating systems are the interface between parallel computers (or computer systems) and the applications (parallel or not) that are executed on them. They translate the hardware’s capabilities into concepts usable by programming languages. Great diversity marked the beginning of parallel architectures and their operating systems. This diversity has since been reduced to a small set of dominating configurations: symmetric multiprocessors running commodity applications and operating systems (UNIX and Windows NT) and multicomputers running custom kernels and parallel applications. Additionally, there is some (mostly experimental) work done towards the exploitation of the shared memory paradigm on top of networks of workstations or personal computers. In this chapter, we discuss the operating system components that are essential to support parallel systems and the central concepts surrounding their operation: scheduling, synchronization, multi-threading, inter-process communication, memory management and fault tolerance. Currently, SMP computers are the most widely used multiprocessors. Users find it a very interesting model to have a computer, which, although it derives its processing power from a set of processors, does not require any changes to applications and only minor changes to the operating system. Furthermore, the most popular parallel programming languages have been ported to SMP architectures enabling also the execution of demanding parallel applications on these machines. However, users who want to exploit parallel processing to the fullest use those same parallel programming languages on top of NORMA computers. These multicomputers with fast interconnects are the ideal hardware support for message passing parallel applications. The surviving commercial models with NORMA architectures are very expensive machines, which one will find running calculus intensive applications, such as weather forecasting or fluid dynamics modelling. We also discuss some of the experiments that have been made both in hardware (DASH, Alewife) and in software systems (TreadMarks, Shasta) to deal with the scalability issues of maintaining consistency in shared-memory systems and to prove their applicability on a large-scale.
Conference Paper
Mobile agents are distributed programs which can move autonomously in a network, to perform tasks on behalf of user. Though mobile agents offer much more flexibility as compared to client-server computing, yet they have additional cost and issues such as security, reliability and fault tolerance which need to be addressed for successful adaptability of mobile agent technology for developing real life applications. Fault tolerance aims to provide reliable execution of agents even in face of failures that may occur on account of various errors that emerge during migration request failure, communication exceptions, system crashes or security violations. The graph based fault tolerance protocols have been successfully used for the implementation of fault tolerance in distributed computing. This paper proposes use of antecedence graphs and message logs for maintaining fault tolerance information of mobile agents. In order to reduce the overheads of the carrying fault tolerance information in form of large antecedence graphs, we propose the use of parallel checkpointing algorithm. For checkpointing, dependent agents are marked out using antecedence graphs; and only these agents are involved in process of taking checkpoints. In case of failures, the antecedence graphs and message logs are regenerated for recovery and then normal operation continued. Analysis of results show considerable improvement in terms of reduced message overhead, execution and recovery times as compared to the graph based existing approach.
Article
Burst error model describes phenomena which cause potentially random faults over a bounded time interval. We study the worst response time for tasks under fixed-priority with checkpoints. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. In this paper, we provide an exact schedulability analysis for burst error model and derive the optimal number of checkpoints. We also explore the fault-tolerant priority assignment policy to get the most efficient scheme.
Article
Clusters of workstations are an attractive environment for high performance computing. For some applications, however, clusters still lack certain properties. One such property is responsive (dependable and timely) execution of programs. This paper studies two mechanisms (checkpointing and replication) to improve the responsiveness (the probability of meeting a deadline in the presence of faults) of a parallel programming system, Calypso, by ameliorating a single point of failure of Calypso. Experiments show that checkpointing is a suitable tool to achieve high responsiveness and that already a very modest degree of replication is sufficient for improved responsiveness.
Article
The flexibility offered by mobile agents is quite noticeable in distributed computing environments. However, the greater flexibility of the mobile agent paradigm compared to the client/server computing paradigm comes at an additional threats since agent systems are prone to failures originating from bad communication, security attacks, agent server crashes, system resources unavailability, network congestion, or even deadlock situations. In such events, mobile agents either get lost or damaged (partially or totally) during execution. In this paper, we propose parallel checkpointing approach based on the use of antecedence graphs for providing fault tolerance in mobile agent systems. During normal computation message transmission, the dependency information among mobile agents is recorded in the form of antecedence graphs by participating mobile agents of mobile agent group. When a checkpointing procedure begins, the initiator concurrently informs relevant mobile agents, which minimizes the identifying time. The proposed scheme utilizes the checkpointed information for fault tolerance which is stored in form of antecedence graphs. In case of failures, using checkpointed information, the antecedence graphs and message logs are regenerated for recovery and then normal operation continued. Moreover, compared with the existing schemes, our algorithm involves the minimum number of mobile agents during the identifying and checkpoiting procedure, which leads to the improvement of the system performance. In addition, the proposed algorithm is a domino-free checkpointing algorithm, which is especially desirable for mobile agent systems. Quantitative analysis and experimental simulation show that our algorithm outperforms other coordinated checkpointing schemes in terms of the identifying time and the number of blocked mobile agents and then can provide a better system performance. The main contribution of the proposed checkpointing scheme is the enhancement of graph-based ap- roach in terms of considerable improvement by reducing message overhead, execution, and recovery times.
Article
In recent years many researchers are incorporating the mobile agents in e-service applications especially in e-learning and e-commerce to improve the network latency and to reduce the network traffic. On the other side, the security issues degrade the mobile agent usage. The main intention of the attacker is to kill or modify the behaviour of the agent in the middle of the journey to degrade the trustiness of the agent environment. In this paper, we propose fault tolerance mechanism for preventing the agent blocking in scenarios where the agent is captured by malicious host in the network. This approach makes use of acknowledgements and partial result retrieval and when implemented in mobile agent platform allows the originator to retrieve partial results and track the location of mobile agent at any time during the process of transaction execution. During the recovery of the mobile agent all the components (agent code, itinerary, credential information, collected information and state) are able to recover. The proposed mechanism is capable of improving fault tolerant time, reliability and performance, especially for mobile agents in e-commerce Internet applications.
Article
Full-text available
This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. Some important fault tolerant mechanisms are designed and implemented in this system, which include coordinated checkpointing protocol, synchronized rollback recovery, process migration, and so on. Owing to ChaRM4MPI, the node transient faults can be recovered automatically, and the permanent fault can also be recovered through checkpoint mirroring and process migration techniques. Moreover, users can migrate MPI processes from one node to another manually for load balance or system maintenance. ChaRM4MPI is a user-transparent implementation and introduces a little running time overhead.
Article
Full-text available
Supply chains are made up of distributed, heterogeneous, and autonomous elements in a dynamic relationship with one another. Agricultural supply chains are strictly regulated to ensure food safety and multi-level traceability. Contracts in such chains need sophisticated specification and management of chain agents to ensure auditability. A framework that attacks these problems is proposed. It is centered on three elements that support and manage agent interactions: contracts, coordination plans (i.e., business processes), and regulations (i.e., business rules). The main contributions are (1) a contract model suitable for agricultural supply chains, (2) a negotiation protocol able to produce contracts, and (3) negotiation implementation via Web services. Interoperability among chain processes is fostered by maintaining independence between business processes and contract negotiation.
Article
Cyclic debugging, where a program is executed over and over again, is a popular methodology for tracking down and eliminating bugs. To debug parallel programs, it requires additional techniques due to nondeterministic behavior. Such techniques are record&replay mechanisms. A problem is the cost associated with restarting the program’s execution every time from the beginning. A corresponding solution is offered by combining checkpointing and debugging, which allows restarting executions at an intermediate state. However, minimizing the replay time is still a challenge. Previous methods cannot ensure that the replay time has an upper bound. This quality is important for developing a debugging tool, in which some degree of interactivity for the user’s investigations is required. This problem is discussed in this paper and an approach to minimize the replay time, the MRT method, is described. It ensures a small upper bound for the replay time with low overhead. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.
Article
Wide-area systems are becoming a popular infrastructure for long-running applications. Rollback-recovery, as a common technology for fault tolerance and load balance, must meet the challenges of scalability and inherent variability in such applications. Most of the rollback-recovery protocols, however, are poor in scalability. Although pessimistic message logging protocols have no such problem, their fault-free overhead sometimes is prohibitive. Aiming at good scalability and acceptable overhead, this paper introduces the concept of pessimism grain and presents a coarse-grained pessimistic message-logging scheme. The paper also evaluates the impact of pessimism grain on the performance of the recovery scheme. Experimental results show that pessimism grain is one of the key configuration parameters to reach a desired performance level. In practice, the proper pessimism grain should be selected based on the characteristics of the applications.
Article
Full-text available
Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution that poses a number of problems related to the heterogeneity of hardware, operating systems, networks, middleware, applications, the dynamic resource, the scalability, the lack of common memory, the lack of a common clock, the asynchronous communication between processes. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. It appears as an indispensable element in distributed systems. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery. These protocols are classified into two categories: coordinated checkpointing and rollback protocols and log-based independent checkpointing protocols or message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate. Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. While some protocols have been used successfully in small scale, they are not suitable for use in large scale. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of four well-known protocols. We have implemented and compare the performance of these protocols in clusters and grid computing using the Omnet++ simulator.
Article
Full-text available
TxLinux is the first operating system to use hardware transactional memory (HTM) as a synchronization primitive, and the first to manage HTM in the scheduler. TxLinux, which is a modification of Linux, is the first real-scale benchmark for transactional memory (TM). MetaTM is a modification of the x86 architecture that supports HTM in general and TxLinux specifically. This paper describes and measures TxLinux and MetaTM, the HTM model that supports it. TxLinux greatly benefits from a new primitive, called the cooperative transactional spinlock (cxspinlock) that allows locks and transactions to protect the same data while maintaining the advantages of both synchronization primitives. Integrating the TxLinux scheduler with the MetaTM's architectural support for HTM eliminates priority inversion for several real-world benchmarks.
Chapter
The sections in this article are
Article
Full-text available
One obtains in this paper a process algebra RCCS, in the style of CCS, where processes can backtrack. Backtrack, just as plain forward computation, is seen as a synchronization and incurs no addi-tional cost on the communication structure. It is shown that, given a past, a computation step can be taken back if and only if it leads to a causally equivalent past.
Article
Full-text available
Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computing systems over Mobile IP is proposed. The protocol reduces the number of checkpoints per checkpointing process to nearly minimum, so that fewer checkpoints need to be transmitted through the costly wireless link. Our protocol also performs very well in the aspects of minimizing the number and size of messages transmitted in the wireless network. In addition, the protocol is nonblocking because inconsistencies can be avoided by the piggybacked information in every message. Therefore, the protocol brings very little overhead to a mobile host with limited resource. Additionally, by taking advantage of reliable timers in mobile support stations, the time-based checkpointing protocol can adapt to wide area networks.
Conference Paper
Recently, numerous studies have focused on multi-agentbased intrusion detection systems (IDSs) in order to detect intrusion behavior more efficiently. However, since an agent is easily subverted by a process that is faulty, a multi-agent based intrusion detection system must be fault tolerant by being able to recover from system crashes, caused either accidentally or by malicious activity. Many of the existing IDSs have no means of providing such failure recovery. In this paper, we propose the novel intrusion-tolerant IDS using communication-induced checkpointing and pessimistic message logging techniques. When the failed agent is restarted, therefore, our proposed system can recover its previous state and resume its operation unaffected. In addition, agents communicate with each other by sending messages without causality violation using vector timestamps.
Conference Paper
This paper presents a mechanism that organizes processes in the hierarchy and efficiently maintains it in the presence of addition/removal of nodes to the system, and in the presence of node failures. This mechanism can support total order of broadcasts and does not rely on any specific system features or special hardware. In addition, it can concurrently support multiple logical structures, such as a ring, a hypercube, a mesh, and a tree.
Article
We give a detailed analysis of communication-induced checkpointing protocols that are free of domino effect. We investigate the validity of a common intuition in the literature and demonstrate that there is no optimal on-line domino-free protocol in terms of the number of forced checkpoints. Formal proofs on comparing existing protocols in the literature are given.
Article
This paper introduces a combination of the existing parallel checkpointing techniques for software heterogeneous ClusterGrid infrastructures. Most of the existing solutions are aiming at supporting application transparency (no checkpoint related code development in application), but some others build middleware transparent (no service modification) solutions. The main contribution of this paper is to introduce a solution providing both application and middleware transparency at the same time. Compatibility and integrity requirements are identified and corresponding conditions are established using Abstract State Machines. The most relevant checkpointing systems are checked against the conditions in order to examine their conformity. Based on the conditions, a novel checkpointing method is defined and a proof of concept checkpointing tool, called TotalCheckpoint (TCKPT) is introduced.
Article
Full-text available
We examine extensions to the π-calculus for representing basic elements of distributed systems. In spite of its expressiveness for encoding various programming constructs, some of the phenomena inherent in distributed systems are hard to model in the π-calculus. We consider message loss, sites, timers, site failure and persistence as extensions to the calculus and examine their descriptive power, taking the Two Phase Commit Protocol (2PCP), a basic instance of an atomic commitment protocol, as a testbed. Our extensions enable us to represent the 2PCP under various failure assumptions, as well as to reason about the essential properties of the protocol.
Conference Paper
Full-text available
The paper presents MPICH-CM - a new architecture of communications in message-passing systems, developed for MPICH-V - a MPI implementation for P2P systems. MPICH-CM implies communications between nodes through special Channel Memories introducing fully decoupled communication media. Some new properties of communications based on MPICH-CM are described in comparison with other communication architectures, with emphasis on grid-like and volunteer computing systems. The first implementation of MPICH-CM is performed as a special MPICH device connected with Channel Memory servers. To estimate the overhead of MPICH-CM, the performance of MPICH-CM is presented for basic point-to-point and collective operations in comparison with MPICH p4 implementation.
Conference Paper
To cope with various intrusion patterns, an intrusion detection system, which is based on multiple agents working collectively, was proposed recently. Since an agent is easily subverted by a process that is faulty, a multi-agent based intrusion detection system must be fault tolerant by being able to recover from system crashes, either accidental or malicious activity. However, there have been very few attempts to provide fault tolerance in intrusion detection system. In this paper, we propose the rollback recovery algorithm for intrusion-tolerant intrusion detection system using communication-induced checkpointing and pessimistic message logging techniques. Thus, our proposed scheme guarantees a consistent global snapshot.
ResearchGate has not been able to resolve any references for this publication.