Article

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. In this context, a popular approach is based on checkpointing and rollback recovery (see, e.g., the survey by Elnozahy et al [4]). Checkpointing requires each process to take a snapshot of its state at specific points in time. ...
... -The reduction of check creates a checkpoint, which turns on the reversible mode of a process as a side-effect (assuming it was not already on). As in [21,23], reversibility is achieved by defining an appropriate Landauer embedding [17], i.e., by adding a history of the computation to each process configuration. 3 A checkpoint is propagated to other processes when a causally dependent action is performed (i.e., spawn and send); following the terminology of [4], these checkpoints are called forced checkpoints. -A call of the form commit(τ ) removes τ from the list of active checkpoints of a process, turning the reversible mode off when the list of active checkpoints is empty. ...
... There is abundant literature on checkpoint-based rollback recovery to improve fault tolerance (see, e.g., the survey by Elnozahy et al [4] and references there in). In contrast to most of them, our distinctive features are the extension of the underlying language with explicit operators for rollback recovery, the automatic generation of forced checkpoints (somehow similarly to communication-induced checkpointing [32]), and the use of a reversible semantics. ...
Preprint
Full-text available
The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing primitives and the use of a (partially) reversible semantics for rolling back the system.
... Despite a broad body of work regarding checkpoint-based techniques, there are just a few comprehensive reports about the advances in the field. In particular, the surveys focus on specific areas in the distributed systems, such as global checkpoints in messagepassing systems [Elnozahy et al. 2002], or checkpoints in HPC [Egwutuoha et al. 2013]. This paper sheds some light on advances in checkpoint-based research over the last 50 years. ...
... In contrast to coordinated checkpointing, uncoordinated checkpointing allows processes to take checkpoints independently [Mostefaoui and Raynal 1996, Elnozahy et al. 2002, Mendizabal et al. 2014. Each process captures its state without the knowledge of others, enabling them to take the checkpoint when it is more convenient and resume normal operation immediately after checkpoint completion. ...
... Many research works have investigated the side effects and challenges of developing consistent checkpoint-based recovery approaches for general-purpose distributed systems, including the comprehensive survey of checkpoint/recovery protocols in message-passing systems presented by [Elnozahy et al. 2002]. The paper defines the system model of message passing, in which different processes exchange messages with each other and the outside world (defined as something not controlled by the current system). ...
Conference Paper
This paper concisely reviews checkpointing techniques in distributed systems, focusing on various aspects such as coordinated and uncoordinated checkpointing, incremental checkpoints, fuzzy checkpoints, adaptive checkpoint intervals, and kernel-based and user-space checkpoints. The review highlights interesting points, outlines how each checkpoint approach works, and discusses their advantages and drawbacks. It also provides a brief overview of the adoption of checkpoints in different contexts in distributed computing, including Database Management Systems (DBMS), State Machine Replication (SMR), and High-Performance Computing (HPC) environments. Additionally, the paper briefly explores the application of checkpointing strategies in modern cloud and container environments, discussing their role in live migration and application state management. The review offers valuable insights into their adoption and application across various distributed computing contexts by summarizing the historical development, advances, and challenges in checkpointing techniques.
... To counteract the negative effects of the misfortune, low overhead fault-tolerance techniques are essentially required for the system [4]. For this purpose, the rollback recovery technique is one of the appropriate tools and classified into two kinds, the checkpoint-based and message logging-based [5]. First, the checkpoint-based technique fulfills the fault-tolerance by letting each process periodically record its local state, called a checkpoint, on stable storage [6]. ...
... In this paper, a novel SBML protocol is presented to solve this important problem with the following features. First, the proposed protocol makes no rollback of the live processes permitted during recovery even in the case of concurrent process failures, called the always no rollback property [5], by the symmetric distribution of redundant determinants of each message. Second, it can still preserve the first feature even in the network environment, enabling a composite of point-topoint and group communication. ...
... However, it is assumed that the channels are immune to partitioning. In addition, we assume that processes may fail based on the crash-failure model, in which they lose contents in their volatile memories and stop their executions [5]. This system is augmented with an unreliable failure detector [20] in order to solve the impossibility problem on distributed consensus. ...
Article
Full-text available
Most of the existing sender-based message logging protocols cannot commonly handle simultaneous failures because, if both the sender and the receiver(s) of each message fail together, the receiver(s) cannot obtain the recovery information of the message. This unfortunate situation may happen due to their asymmetric logging behavior. This paper presents a novel sender-based message logging protocol for broadcast network based distributed systems to overcome the critical constraint of the previous ones with the following three features. First, when more than one process crashes at the same time, the protocol enables the system to ensure the always no rollback property by symmetrically replicating the recovery information at each process or group member connected on a network. Second, it can make the first feature persist even if the general form of communication for the system is a combination of point-to-point and group ones. Third, the communication overhead resulting from the replication can be highly lessened by making full use of the capability of the standard broadcast network in both communication modes. Experimental outcomes verify that, no matter which communication patterns are applied, it can reduce about 4.23∼9.96% of the total application execution time against the latest enabling the traditional ones to cope with simultaneous failures.
... The delay in the message transfer is finite and unpredictable. These are the characteristics of the well-known "asynchronous distributed systems" [6]. When computation is extensive, the possibility of process failures may be on the rise. ...
... The present work discards such a message by adopting a technique in receiving whereas in another approach [7] any process refrains from sending during the interval between the receipt of CP initiation message and completion of committing that CP. Distributed systems that use the recovery block approach [6,10] and have a common time base may estimate a time by which the participating processes would take acceptance test. These estimated instants form the pseudo point times as described in [14]. ...
... The performance (in Mega Flops) of DCP is (48% and 30%), (54.5% and 50.34%) and (56.19% and 58.59%) higher during compression with CCP and CICP protocols when 1, 4 and 8 processors are in action. Table 1 summarizing the comparison of different checkpointing techniques that are discussed in the survey paper [6,10]. The following notations are used to compare the present work: ...
Article
Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally committing their checkpoints. Thus, the checkpointing pattern described in this paper takes only those checkpoints that will contribute to the consistent global snapshot thereby eliminating the number of useless checkpoints.
... This redundancy allows the system to recover from failures by accessing the data from alternate locations [22]. Regular system state backups, known as checkpointing, involve periodically saving the state of the system to persistent storage [23]. Checkpointing enables the system to recover from failures by rolling back to a previously saved consistent state [24]. ...
... Comparison of fault tolerance strategies[17][18][19][20][21][22][23][24][25][26][27] ...
Research
Full-text available
Resilient design patterns play a crucial role in developing fault-oblivious stateful workflow systems in distributed computing. This article explores advanced techniques and strategies for building resilient distributed systems that can gracefully handle failures and maintain operational continuity. It delves into fault tolerance strategies, such as data redundancy, checkpointing, and transactional consistency, to ensure system reliability and data integrity. The article discusses the benefits of microservices architecture in achieving fault isolation and minimizing the impact of failures. It highlights the importance of self-healing mechanisms, including automated fault detection and correction, to ensure continuous operation. Scalability and load balancing techniques, such as dynamic resource adjustment and workload distribution, are explored to accommodate fluctuating demands and optimize system performance. The article also examines error handling and recovery mechanisms, including automated rollbacks and distributed consensus protocols, to maintain data consistency and coordinate recovery actions across nodes. Additionally, it emphasizes the significance of proactive system health monitoring and rapid fault identification and resolution in minimizing downtime and ensuring a smooth user experience. The article concludes by discussing emerging trends, open research problems, and providing recommendations for building resilient distributed systems that can withstand the challenges of modern computing environments.
... There has been much work in the field of distributed checkpointing and rollback recovery. There are sound checkpointing protocols survey by (Elnozahy M et al, 1996) [1] . Though it lacks scalability properties for future and expanded supercomputers. ...
... There has been much work in the field of distributed checkpointing and rollback recovery. There are sound checkpointing protocols survey by (Elnozahy M et al, 1996) [1] . Though it lacks scalability properties for future and expanded supercomputers. ...
Research
Full-text available
Network of computers or other systems step from simple to complex. When a system is referred as complex and stochastic then the challenges of availability, dependability, stability and reliability become serious indicators for effective performance. Fault-tolerance plays a crucial role towards achieving dependability, reliability, stability and the fundamental requirement for the design and development of effective and efficient fault-tolerance mechanisms. It is also important that the power, weight, space and cost constraints of systems are addressed by efficiently using the available resources for fault-tolerance. In life critical mission systems, reliability is a great option, hence this paper investigate a fault tolerance mechanism using checkpointing in distributed systems.
... There are several techniques to make an HPC system fault-tolerant (Egwutuoha et al., 2013;Herault and Robert, 2015). Those techniques include, among others, rollback recovery (Elnozahy et al., 2002), replication (Bougeret et al., 2014), computation migration (Filiposka et al., 2019), and algorithm-based fault tolerance (ABFT). ABFT relies on properties of the parallel algorithm itself to tolerate faults during its execution (Huang and Abraham, 1984;Chen and Jack, 2008;Hursey and Graham, 2011;Bagherpour et al., 2017). ...
... Rollback recovery is perhaps the most widely adopted technique to improve the reliability of HPC systems (Egwutuoha et al., 2013). This technique consists of establishing checkpoints to which the system can roll back in case of failures, instead of restarting from the very beginning (Elnozahy et al., 2002). It is a challenge to apply rollback recovery to HPC systems that take a very long time to complete its execution and have a low MTBF. ...
... I I . Bac k g ro u n d Rollback recovery techniques [28] have been studied in the context of a wide variety of applications ranging from programming language constructs [29] to recovery protocols for distributed system [28]. At a high-level, such techniques described in prior work can be divided into checkpointbased and log-based approaches. ...
... I I . Bac k g ro u n d Rollback recovery techniques [28] have been studied in the context of a wide variety of applications ranging from programming language constructs [29] to recovery protocols for distributed system [28]. At a high-level, such techniques described in prior work can be divided into checkpointbased and log-based approaches. ...
... But detecting faults for dependent tasks with multiple workflows is really challenging. Several solutions can be found in the literature for classic distributed system research [152,[161][162][163]. However, considering the uniqueness of MCC, incorporating them straightforwardly into MCC would not be effective. ...
... Over the years, several approaches have been proposed to mitigate faults in distributed systems [164]. One of the most common approaches is redundancy or replication [163]. If the same task is submitted to multiple nodes, it is highly improbable that all the instances would be faulty. ...
Article
Full-text available
Owing to the enormous advancement in miniature hardware, modern smart mobile devices (SMDs) have become computationally powerful. Mobile crowd computing (MCC) is the computing paradigm that uses public-owned SMDs to garner affordable high-performance computing (HPC). Though several empirical works have established the feasibility of mobile-based computing for various applications, there is a lack of comprehensive coverage of MCC. This paper aims to explore the fundamentals and other nitty–gritty of the idea of MCC in a comprehensive manner. Starting with an explicit definition of MCC, the enabling backdrops and the detailed architectural layouts of different models of MCC are presented, along with categorising different types of MCC based on infrastructure and application demands. MCC is compared extensively with other HPC systems (e.g. desktop grid, cloud, clusters and supercomputers) and similar mobile computing systems (e.g. mobile grid, mobile cloud, ad hoc mobile cloud, and mobile crowdsourcing). MCC being a complex system, various design requirements and considerations are extensively analysed. The potential benefits of MCC are meticulously mentioned, with special discussions on the ubiquity and sustainability of MCC. The issues and challenges of MCC are critically presented in light of further research scopes. Several real-world applications of MCC are identified and propositioned. Finally, to carry forward the accomplishment of the MCC vision, the future prospects are briefly elucidated.
... The many various overheads involved with the every method for increase the performance of multiple fault tolerance which needs to decrease along with the improved algorithm. Additionally, for focusing these complicated factors needs to analyze the method & for low performance observation critical factors are responsible so that along with the performance multiple fault capability will get enhanced [6]. ...
... A. MPICH: The MPICH [6] (also called MPICH1) is developed at Argonne National Laboratory this is a freely available, portable implementation of MPI. It implements the MPI-1 standard. ...
Article
the present communication societies are based on use of High-Performance Computing (HPC) systems for balancing the messaging layers. However the HPC systems are vulnerable to different types of software and hardware failures which resulted into extra efforts to resume the working of such systems. Into the communication framework because of this type of vulnerabilities it was creating misbalancing around the layers of message passing. Hence there is need of fault tolerance methods in such HPC systems. There are different solutions proposed for efficient fault tolerance in HPC systems, but suffered from various limitations. The check pointing/restart technique is most commonly & frequently studied technique. Therefore in this research goal is to present efficient and new check pointing/restart method of fault tolerance in HPC systems for MPI (Message Passing Interface) application in order to balance the messaging layers. For fault tolerant systems checking the pointing consist as most important function, but additional overhead was the main problem of restriction for the method of fault tolerance due to the pointing check or the extra storage wanted to be both or check the pointer it is creating the issues to program added time. Hence, we wanted to develop the new enhanced application which helped to checkpoint restart technique for the applications of MPI. this check point method are supported by efficient & trusted distributed storage system, This kind of checkpoints assuring the availability of data at the time of hardware failure.
... A alta frequência de replicação reduz o trabalho que deverá ser refeito no backup quando esta réplica tornar-se ativa [Elnozahy et al. 2002], propiciando rápido failover. A influência causada pela replicação no desempenho das aplicações hospedadas nãoé desprezível; estudos anteriores [da Silva et al. 2014, Gerofi and Ishikawa 2011] mostram que, quanto maior a frequência de checkpointing, menor a latência na comunicação com o cliente que interage com a MV protegida. ...
... Remus permite que os clientes percebam um comportamento consistente do sistema em caso de faltas. Essa consistência, chamada de linearizabilidade, define que somente será processado no primário o que já foi previamente salvo em armazenamento 17 estável [Guerraoui and Schiper 1997], no caso o disco ou memória do hospedeiro backup [Elnozahy et al. 2002]. Remus considera apenas o tráfego de saída de rede como forma de garantir a linearizabilidade. ...
Conference Paper
Remus é um mecanismo de replicação de máquinas virtuais (MVs) que fornece alta disponibilidade diante de faltas de parada. A replicação é realizada através de checkpointing, seguindo um intervalo fixo de tempo predeterminado. Todavia, existe um antagonismo entre processamento e comunicação em relação ao intervalo ideal entre checkpoints: enquanto intervalos maiores beneficiam aplicações com processamento intensivo, intervalos menores favorecem as aplicações cujo desempenho é dominado pela rede. Logo, o intervalo utilizado nem sempre e o adequado para as características de uso de recursos da aplicação em execução na MV, limitando a aplicabilidade de Remus em determinados cenários. Este trabalho apresenta uma proposta de checkpointing adaptativo para Remus, ajustando dinamicamente a frequência de replicação de acordo com as características das aplicações em execução. Os resultados indicam que a proposta obtém um melhor desempenho de aplicações que utilizam tanto recursos de processamento como de comunicação, sem prejudicar aplicações que usam apenas um dos tipos de recursos.
... Typically, these tools create incremental checkpoints by identifying and storing dirty memory pages, then piece back together the process image (across multiple files) for restoration [26]. The main limitations of these tools is (1) large incremental checkpoint file sizes resulting from coarse page-level deltas [36], (2) inability to checkpoint across multiple processes [27], and (3) they can only restore a process from scratch: while we found a patent [88] and paper [40] addressing these limitations enabling OS-level checkpointing for multiprocessing jobs and sub-memory-page granularity incremental checkpointing, respectively, we have been unable to locate working implementations. In comparison, for notebook states, Kishu is able to achieve significantly lower checkpoint overheads via finer Co-variable granularity deltas, checkpoint multiprocessing and off-CPU notebooks via application-level instructions, and fast incremental restore with minimal data loading via state difference detection and leveraging existing data in the process/kernel. ...
Preprint
Full-text available
Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session state consisting of user-defined variables can be irreversibly modified - e.g., the user cannot 'un-drop' a dataframe column. This is because, unlike DBMS, existing notebook systems do not keep track of the session state. Existing techniques for checkpointing and restoring session states, such as OS-level memory snapshot or application-level session dump, are insufficient: checkpointing can incur prohibitive storage costs and may fail, while restoration can only be inefficiently performed from scratch by fully loading checkpoint files. In this paper, we introduce a new notebook system, Kishu, that offers time-traveling to and from arbitrary notebook states using an efficient and fault-tolerant incremental checkpoint and checkout mechanism. Kishu creates incremental checkpoints that are small and correctly preserve complex inter-variable dependencies at a novel Co-variable granularity. Then, to return to a previous state, Kishu accurately identifies the state difference between the current and target states to perform incremental checkout at sub-second latency with minimal data loading. Kishu is compatible with 146 object classes from popular data science libraries (e.g., Ray, Spark, PyTorch), and reduces checkpoint size and checkout time by up to 4.55x and 9.02x, respectively, on a variety of notebooks.
... Second, incarnation numbers are key elements used in several failure recovery schemes and distributed algorithms. Under various names (epoch numbers, incarnation numbers) they are required in several rollback-recovery schemes surveyed in [EAWJ02], including optimistic recovery schemes [SY85,DTG99] and causal logging schemes [EZ92]. They are also a key ingredient in several distributed algorithms such as, e.g., algorithms for scalable distributed failure detection [GCG01], membership management [DGM02], and diskless crash-recovery [MPSS17]. ...
Preprint
Full-text available
Distributed systems can be subject to various kinds of partial failures, therefore building fault-tolerance or failure mitigation mechanisms for distributed systems remains an important domain of research. In this paper, we present a calculus to formally model distributed systems subject to crash failures with recovery. The recovery model considered in the paper is weak, in the sense that it makes no assumption on the exact state in which a failed node resumes its execution, only its identity has to be distinguishable from past incarnations of itself. Our calculus is inspired in part by the Erlang programming language and in part by the distributed $\pi$-calculus with nodes and link failures (D$\pi$F) introduced by Francalanza and Hennessy. In order to reason about distributed systems with failures and recovery we develop a behavioral theory for our calculus, in the form of a contextual equivalence, and of a fully abstract coinductive characterization of this equivalence by means of a labelled transition system semantics and its associated weak bisimilarity. This result is valuable for it provides a compositional proof technique for proving or disproving contextual equivalence between systems.
... The VSM component is responsible for the recovery of stateful VNFs and is based on checkpoint/restore (Elnozahy et al., 2002). Since VNFs run on virtual devices, which can be either virtual machines or containers, capturing the VNF state without modifying the VNF source code is perfectly feasible and represents a very attractive option. ...
... Note that the recovery addressed in this work is different from the recovery concept in computer systems. The latter is to recover computing tasks, e.g., variables' values, and thus is limited to the cyber part [44]- [49]. In contrast, this paper focuses on recovering the state of the physical system or the physical state, e.g., the speed of a vehicle. ...
... An orphan message is a message which is received but never sent between a pair of local checkpoints of two different processes during recovery [20,21]. Definition 3. The consistency of a global checkpoint is ensured if and only if there exists no orphan message between every pair of local checkpoints belonging to the global checkpoint [22]. ...
Article
Full-text available
The existing communication-induced checkpointing protocols may not scale well due to their slow acquisition of the most recent timestamps of the next checkpoints of other processes. Accurate situation awareness with diversified information conveyance paths is needed to reduce the number of unnecessary forced checkpoints taken as few as possible. In this paper, a scalable communication-induced checkpointing protocol is proposed to considerably cut down the possibility of performing unnecessary forced checkpointing by exploiting the beneficial features of reliable communication channels. The protocol enables the sender of an application message to swiftly attain the most recent timestamp-related information of the next checkpoint of its receiver and accelerate the spread of the information to others, with little overhead. This behavioral feature may significantly elevate the accuracy of the awareness of the situations in which forced checkpointing is actually needed for useless checkpoint-free recovery. In addition, it generates no extra control message and no message logging overhead while significantly lessening the latency of message sending. Moreover, the protocol can always be operated under the non-deterministic execution model. The evaluation results indicate that the proposed protocol outperforms the existing ones at the reduced forced checkpointing overheads from 12.5% to 84.2%, and at the reduced total execution times from 2.5% to 11.5%.
... (CCP) is a 2-tuple (Ĥ, CĤ ), whereĤ is a partially ordered set modeling a distributed computation and CĤ is a set of local checkpoints involved inĤ [2]. ...
... Another common approach to imposing fault tolerance in distributed systems is rollback and recovery. The system periodically maintains checkpoints which save the system state at particular instances [476]. If the system crashes, it returns to the most recent checkpoint and restarts. ...
Thesis
Full-text available
The effects of environmental pollution and global warming have become a reality and severe. In addition to other causes, wide adoption and huge demands for computational resources have aggravated it significantly. The production process of the computing devices involves hazardous and toxic substances which not only harm human and other living beings’ health but also contaminate the water and soil. The production and operations of these computers on a large scale also result in massive energy consumption and greenhouse gas generation. Moreover, the low use cycle of these devices produces a huge amount of not easy-to-decompose e-waste. In this outlook, instead of buying new devices, it is advisable to use the existing resources to their fullest, which will minimize the environmental penalties of production and e-waste. In this study, we advocate for mobile crowd computing (MCC) to ease off the use of centralized high-performance computers (HPCs) such as data centres and supercomputers by utilising SMDs (smart mobile devices) as computing devices. We envision establishing MCC as the most feasible computing system solution for sustainable computing. Successful real-world implementation of MCC involves several challenges. In this study, we primarily focus on resource management. We devised a methodological and structured approach for resource profiling. We present the resource selection problem as an MCDM (multi-criteria decision making) problem and compared five MCDM approaches of dissimilar families to find the appropriate methodology for dynamic resource selection in MCC. To improve the overall performance of MCC, we present two scheduling algorithms, considering different objectives such as makespan, resource utilisation, load balance, and energy efficiency. We propose a deep learning based resource availability prediction to minimise the job offloading in a local MCC. We further propose a mobility-aware resource provisioning scheme for a P2P MCC. Finally, we present a proof-of-concept of the proposed MCC model as a working prototype. For this, we consider a smart HVAC (heating, ventilation, and cooling) system in a multistorey building and use MCC as a local edge computing infrastructure. The successful implementation of the prototype proves the feasibility and potential of MCC as alternative sustainable computing.
... O uso de recuperação de processos para garantir tolerância a falhas em sistemas distribuídos jáé um assunto bastante conhecido. Vários protocolos foram propostos na literatura [Elnozahy et al., 2002] para checkpointing (salvamento das informações que poderão ser usadas posteriormente) e recuperação de processos em ambientes distribuídos, mas pouco se conhece a respeito de implementações destas propostas. A implementação de algoritmos de recuperação em sistemas distribuídos, principalmente aqueles enquadrados na categoria assíncrona, tem sido alvo de pesquisa, uma vez que ainda restam pontos em aberto. ...
Conference Paper
A recuperação por retorno baseada em pontos de recuperação é largamente usada como técnica de tolerância a falhas. O modelo complexo de sistemas distribuídos tem motivado o desenvolvimento de diversos algoritmos buscando soluções mais simples e eficientes. No Grupo de Tolerância a Falhas da UFRGS, foi proposto recentemente um algoritmo que é voltado para aplicações em sistemas distribuídos assíncronos baseados na troca de mensagens, opera com salvamento coordenado de pontos de recuperação e prevê o tratamento de mensagens órfãs e perdidas. Este artigo descreve as decisões de projeto, a implementação do algoritmo e resultados obtidos até o momento.
... The VSM is the component responsible for the recovery of stateful VNFs. The VSM is based on checkpoint/restore [4]. As VNFs run on virtual devices, which can be either virtual machines or containers, saving the network function instance is a feasible and attractive strategy to capture the VNF state without requiring any modification of the VNF source code. ...
Article
Full-text available
Developing embedded software applications is a challenging task, chiefly due to the limitations that are imposed by the hardware devices or platforms on which they operate, as well as due to the heterogeneous non-functional requirements that they need to exhibit. Modern embedded systems need to be energy efficient and dependable, whereas their maintenance costs should be minimized, in order to ensure the success and longevity of their application. Being able to build embedded software that satisfies the imposed hardware limitations, while maintaining high quality with respect to critical non-functional requirements is a difficult task that requires proper assistance. To this end, in the present paper, we present the SDK4ED Platform, which facilitates the development of embedded software that exhibits high quality with respect to important quality attributes, with a main focus on energy consumption, dependability, and maintainability. This is achieved through the provision of state-of-the-art and novel quality attribute-specific monitoring and optimization mechanisms, as well as through a novel fuzzy multi-criteria decision-making mechanism for facilitating the selection of code refactorings, which is based on trade-off analysis among the three main attributes of choice. Novel forecasting techniques are also proposed to further support decision making during the development of embedded software. The usefulness, practicality, and industrial relevance of the SDK4ED platform were evaluated in a real-world setting, through three use cases on actual commercial embedded software applications stemming from the airborne, automotive, and healthcare domains, as well as through an industrial study. To the best of our knowledge, this is the first quality analysis platform that focuses on multiple quality criteria, which also takes into account their trade-offs to facilitate code refactoring selection.
Article
A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy is in operation, aiming to protect against possibly lengthy recovery periods by backing up the current state at periodic checkpoints. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.
Article
Full-text available
Fault tolerance is crucial in ensuring smooth working of distributed and cloud computing. It is challenging to implement because of the constantly changing infrastructure and complex configurations in distributed and cloud computing. Implementation of various fault tolerance methods require domain‐specific knowledge as well as in‐depth understanding of the existing techniques and approaches. Recent surveys on fault tolerance in cloud and distributed environments exist, but they have limitations. This article systematically reviews fault tolerance approaches in distributed and cloud computing and discusses their taxonomy. Based on the taxonomy provided, fault‐tolerance approaches are divided into four types, that is, reactive approaches, proactive approaches, adaptive approaches, and hybrid approaches. Reactive approaches provide a preventive measure after the occurrence of faults in the system. Proactive approaches prevent the system or minimize failure effects by predicting in advance. The adaptive approaches predict, learn, and adapt the changes to deal with new faults in the system. The hybrid approaches combine reactive, proactive, and adaptive approaches. The objective of this article is to give a better understanding of handling faults using suitable approaches and further compare them on various parameters. The paper also presents a promising research direction based on the challenges and issues in multiple approaches.
Chapter
Serverless computing promises to significantly simplify cloud computing by providing Functions-as-a-Service where invocations of functions, triggered by events, are automatically scheduled for execution on compute nodes. Notably, the serverless computing model does not require the manual provisioning of virtual machines; instead, FaaS enables load-based billing and auto-scaling according to the workload, reducing costs and making scheduling more efficient. While early serverless programming models only supported stateless functions and severely restricted program composition, recently proposed systems offer greater flexibility by adopting ideas from actor and dataflow programming. This paper presents a survey of actor-like programming abstractions for stateful serverless computing, and provides a characterization of their properties and highlights their origin.
Chapter
The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.
Article
Full-text available
Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.
Article
Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput. We present Rhythm , a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.
Article
A multi‐tenant microservice architecture involving components with asynchronous interactions and batch jobs requires efficient strategies for managing asynchronous workloads. This article addresses this issue in the context of a leading company developing tax software solutions for many national and multi‐national corporations in Brazil. A critical process provided by the company's cloud‐based solutions encompasses tax integration, which includes coordinating complex tax calculation tasks and needs to be supported by asynchronous operations using a message broker to guarantee order correctness. We explored and implemented two approaches for managing asynchronous workloads related to tax integration within a multi‐tenant microservice architecture in the company's context: (i) a polling‐based approach that employs a queue as a distributed lock (DL) and (ii) a push‐based approach named single active consumer (SAC) that relies on the message broker's logic to deliver messages. These approaches aim to achieve efficient resource allocation when dealing with a growing number of container replicas and tenants. In this article, we evaluate the correctness and performance of the DL and SAC approaches to shed light on how asynchronous workloads impact the management of multi‐tenant microservice architectures from delivery and deployment perspectives.
Conference Paper
Full-text available
Condition monitoring of machinery is becoming more and more popular particularly in the process plan, where sudden breakdown may prove costly or may even fatal. The uncertainties faced by the rotating machinery such as misalignment, looseness, bearing defect and electrical faults are corrected by familiar and ordinary maintenance procedure. Replacing the faulty bearings, gears, drive belts, couplings and other machine components are also rather straight forward process. However, correcting the unbalance requires some special knowledge and understanding. The unbalanced rotor always cause more vibrations and generates excessive forces on the bearing areas and reduces the life of the machine. In this work vector balancing method is presented, which minimize the number of trial runs by eliminating all guesswork. In case of non stationary signal, the use of Spectrum analysis based on Fourier transform has some limitations. Hence, the vibration signal analysis is evaluated by using Continuous wavelet transform method. It is one of the most important tool for signal processing particularly in nonstationary signals. It is capable of giving time frequency representation of the signal.
Chapter
To react to unforeseen circumstances or amend abnormal situations in communication-centric systems, programmers are in charge of “undoing” the interactions which led to an undesired state. To assist this task, session-based languages can be endowed with reversibility mechanisms. In this paper we propose a language enriched with programming facilities to commit session interactions, to roll back the computation to a previous commit point, and to abort the session. Rollbacks in our language always bring the system to previous visited states and a rollback cannot bring the system back to a point prior to the last commit. Programmers are relieved from the burden of ensuring that a rollback never restores a checkpoint imposed by a session participant different from the rollback requester. Such undesired situations are prevented at design-time (statically) by relying on a decidable compliance check at the type level, implemented in MAUDE. We show that the language satisfies error-freedom and progress of a session.
Article
Cloud developers have to build applications that are resilient to failures and interruptions. We advocate for a fault-tolerant programming model for the cloud based on actors, retry orchestration, and tail calls. This model builds upon persistent data stores and message queues readily available on the cloud. Retry orchestration not only guarantees that (1) failed actor invocations will be retried but also that (2) completed invocations are never repeated and (3) it preserves a strict happen-before relationship across failures within call stacks. Tail calls can break complex tasks into simple steps to minimize re-execution during recovery. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanisms of fault tolerance in this model. We briefly describe our implementation. Using an application inspired by a typical enterprise scenario, we validate the functional correctness of our implementation and assess the impact of fault preparedness and recovery on performance.
Article
Real-time systems are susceptible to adversarial factors such as faults and attacks, leading to severe consequences. This paper presents an optimal checkpoint scheme to bolster fault resilience in real-time systems, addressing both logical consistency and timing correctness. First, we partition message-passing processes into a directed acyclic graph (DAG) based on their dependencies, ensuring checkpoint logical consistency. Then, we identify the DAG’s critical path, representing the longest sequential path, and analyze the optimal checkpoint strategy along this path to minimize overall execution time, including checkpointing overhead. Upon fault detection, the system rolls back to the nearest valid checkpoints for recovery. Our algorithm derives the optimal checkpoint count and intervals, and we evaluate its performance through extensive simulations and a case study. Results show a 99.97% and 67.86% reduction in execution time compared to checkpoint-free systems in simulations and the case study, respectively. Moreover, our proposed strategy outperforms prior work and baseline methods, increasing deadline achievement rates by 31.41% and 2.92% for small-scale tasks and 78.53% and 4.15% for large-scale tasks.
Chapter
Three variants of consensus problems are consensus, Byzantine agreement, and interactive consistency. These problems are equivalent. If any of the three has a solution, we can obtain a solution for the remaining ones. Knowing impossibility results makes us aware of the limitation of solutions. Reaching a consensus with non‐faulty processes is difficult. Even under a minimal set of assumptions, the consensus solutions require exponential time during which processes may experience timeouts and send wrong messages to others saying that non‐responding processes have failed. Failures could block commits in distributed transactions. There may be different types of faults, such as crash, omission, timeout, value, and Byzantine. It forms the basis for the three‐phase commit protocol under fail‐stop assumptions. Processes experiencing Byzantine faults send arbitrary messages to others. We can focus only on the Byzantine agreement, also known as the Byzantine general problem (BGP) has a solution under certain restrictions. The solution requires many execution rounds in a synchronous distributed system and can sustain f failures with 3f+1 processes. Reaching consensus in an asynchronous distributed system is impossible. Lamport proposed the Paxos algorithm to reach an agreement in the eventual synchronous model. Paxos is a complicated protocol to understand and requires non‐trivial architectural adaptations for implementation. Raft is a simpler alternative that is as efficient as Paxos and more amenable to building practical systems.
Chapter
A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy aiming to protect against possibly lengthy recovery periods is in operation. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.KeywordsBreakdownsRepairsRecoveryM/G/1 queueEmbedded Markov chainsServer vacationsCheckpoint optimization
Article
Full-text available
In coordinated DRL-accumulation (Dependable Recovery Line Accumulation) protocols for mobile distributed systems, if a single operation fails to capture its reclamation-dot (checkpoint); all the DRL-accumulation effort goes waste, because, each operation has to abort its partially-committed reclamation-dot. In order to capture its partially-committed reclamation-dot, a Mob Nod (Mobile Node) needs to transfer large reclamation-dot data to its local Mob-Spt-Stn (Mobile Support Station) over wireless channels. The DRL-accumulation effort may be exceedingly high due to repeated abandons especially in mobile systems. We try to minimize the loss of DRL-accumulation effort when any operation fails to capture its reclamation-dot in coordination with others. In the first phase, we capture perishable reclamation-dots only. In this case, if any operation fails to capture its reclamation-dot in the first phase, all concerned operations need to abort their perishable reclamation-dots only and not the partially-committed ones. We design a least process DRL-accumulation protocol for Mobile Distributed systems, where no useless reclamation-dots are captured and an effort has been made to optimize the intrusion of operations. We put forward to delay the processing of selective reckoning-messages at the receiver end only during the DRL-accumulation period. An operation is allowed to perform its normal computations and send reckoning-messages during its intrusion period. In this way, we try to keep intrusion of operations to bare least. In order to keep the intrusion time least, we collect the dependency vectors and compute the exact least set in the beginning of the protocol.
Conference Paper
Full-text available
This paper discusses the modifications made to the UNIX operating system for the VAX-11/780 to convert it from a swap-based segmented system to a paging-based virtual memory system. Of particular interest is that the host machine architecture does not include page-referenced bits. We discuss considerations in the design of page-replacement and load-control policies for such an architecture, and outline current work in modeling the policies employed by the system. We describe our experience with the chosen algorithms based on benchmark-driven studies and production system use.
Article
Full-text available
We describe a method for implementing checkpoints on a UNIX® system. The method requires no special operating system support. The checkpoints (a term we use both for the act of saving state and the result) are created in the file system name space. Availability in the name space allows facilities to duplicate and transfer files to be applied; in this case, we get replicated processes and process migration rather naturally. We describe the process migration implementation. Our process migration implementation was easily optimized to achieve an execution speed improvement of greater than 7 times over our first implementation; this was accomplished by a combination of a faster file transfer mechanism and a change in the underlying protocol. We have incorporated the mechanism into a library routine, rfork(). We conclude with a discussion of advantages, limitations and applications of our approach.
Chapter
Full-text available
Many important problems in distributed computing admit solutions that contain a phase where some global property needs to be detected. This subproblem can be seen as an instance of the Global Predicate Evaluation (GPE) problem where the objective is to establish the truth of a Boolean expression whose variables may refer to the global system state. Given the uncertainties in asynchronous distributed systems that arise from commu- nication delays and relative speeds of computations, the formulation and solution of GPE reveal most of the subtleties in global reasoning with im- perfect information. In this chapter, we use GPE as a canonical problem in order to survey concepts and mechanisms that are useful in understanding global states of distributed computations. We illustrate the utility of the developed techniques by examining distributed deadlock detection and distributed debugging as two instances of GPE.
Conference Paper
Full-text available
Recovery techniques may be distinguished on the basis of the time when the recovery lines are built: at the time of recording the recovery point, at the time of rollback. Consequently we distinguish "planned“ and "unplanned" policies for determining recovery lines. With an unplanned policy a "domino effect" can occur. The planned policy is usually intended as being static, in the sense that the recovery lines are a priori established at design time. In this paper an algorithm for "dynamic" planning of recovery line is specified. We shall define a computational model for a distributed system of communicating processes using asynchronous message passing and shall describe the recovery algorithms by means of axioms.
Article
Full-text available
Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent software solutions. This paper describes our implementation of a software instruction counter for program debugging. We show that an instruction counter can be reasonably implemented in software, often with less than 10% execution overhead. Our experience suggests that a hardware instruction counter is not necessary for a practical implementation of watch-points and reverse execution, however it will make program instrumentation much easier for the system developer.
Article
Full-text available
The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Article
Full-text available
This paper presents a new algorithm for supporting fault tolerant objects in distributed object oriented systems. The fault tolerance provided by the algorithm is fully user transparent. The algorithm uses checkpointing and message logging scheme. However the novelty of this scheme is in identifying the checkpointing instances such that the checkpointing time will not affect the regular response time for the object requests. It also results in storing the minimum amount of object state (object address space). A simple message logging scheme that pairs the logging of response message and the next request message reduces the message logging time by half on an average compared to other similar logging schemes. The scheme exploits the general features and concepts associated with the notion of the objects and object interactions to its advantage.
Article
Full-text available
In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.
Article
Full-text available
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failure. Although most recoverable DSM require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and preliminary performances evaluation of our recoverable DSM on an Intel Paragon with 56 nodes.
Article
Full-text available
The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.
Article
Typical debugging tools are insufficiently powerful to find the most difficult types of program misbehaviors. We have implemented a prototype of a new debugging system, IGOR, which provides a great deal more useful information and offers new abilities that are quite promising. The system runs fast enough to be quite useful while providing many features that are usually available only in an interpreted environment. We describe here some improved facilities (reverse execution, selective searching of execution history, substitution of data and executable parts of the programs) that are needed for serious debugging and are not found in traditional single-thread debugging tools. With a little help from the operating system, we provide these capabilities at reasonable cost without modifying the executable code and running fairly close to full speed. The prototype runs under the DUNE distributed operating system. The current system only supports debugging of single-thread programs. The paper describes planned extensions to make use of extra processors to speed the system and for applying the technique to multi-thread and time dependent executions.
Conference Paper
In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.
Article
Fault-tolerance is an essential feature of distributed systems designed for mission-critical applications, e.g., on-board computers for rocket launch vehicles. In many systems, when a processor fault is detected, it is replaced by a spare. However, to continue working normally, the system must restart from a globally consistent state. Hence, the state of the system must be periodically checkpointed. In this paper, we describe a checkpointing and recovery scheme in which recovery is extremely fast because checkpointing is done continuously and no explicit rollback is involved.
Article
Many fault tolerance techniques that are implemented via software are based on the use of process checkpointing and restore primitives. This is true both for methods used in system fault tolerance and for methods used in software fault tolerance, such as Recovery Blocks, but usually system and software fault tolerance appear to require different ad hoc primitives. Moreover, the use of checkpointing primitives within components implementing different kinds of fault tolerance should be coordinated, to save space and time. In this paper we present a unified interface for checkpointing and restore primitives, which is suitable both for software and for system fault tolerance in UNIX-type systems. We provide examples of the use of such primitives, including the use in a dedicated software component (the Recovery Meta Program) which may implement various techniques for fault tolerance. Finally, we discuss the implementation of the proposed primitives, and provide a comparison with some complementary approaches.
Article
This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in ″backward error recovery″ , i. e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e. g. distributed, systems. The formalisation is based on the use of ″Occurrence Graphs″ to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restortation. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.
Article
A general requirement on checkpoint-based fault recovery schemes in distributed systems (DS) is maintaining a consistent DS state, i.e. the effects of all interactions of the failed process with other processes after the checkpoint must be taken into consideration. This paper proposes an approach to autonomous logging of asynchronous messages and their recovery simulation, using the communication dependencies in a transputer based High Performance Computing System. The proposed checkpointing service aims at providing system support to several fault-tolerance methods. Rollback recovery, Recovery block, Back-up recovery. The concept of asynchronous checkpointing allows us to separate the checkpointing functions from the fault-tolerance methods.
Article
A scheme for facilitating efficient backward recovery in loosely coupled networks is presented. The scheme allows for the independent and uncoordinated design of error detection and recovery capabilities of distributed processes. It makes provision for properly coordinating such distributed processes at run-time for cooperative recovery without incurring a cyclic chain of rollback propagations called a domino effect. The operational rules of the scheme have been devised to minimize the number of recovery-points used for maintaining the capability for recovery with minimum-distance rollbacks. The system design philosophy is that each process must be solely responsible for detecting the errors that it originates. An approach to making judicious exceptions (i. e. , utilizing the cooperative error detection capabilities of processes without incurring a domino effect) has been devised in order to further enhance the system robustness.
Article
The backward recovery of a computation to a previously existing state is a well-known method for attaining a degree of fault tolerance in digital systems. In this paper a protocol is developed for the purpose of providing “unplanned” recovery control in a distributed system of communicating processes. The protocol has the property of ensuring that the whole system reverts to a consistent state in the event of one or more processes initiating recovery action and it supports the determination of recovery point safety; that is when a recovery point cannot possibly be recovered to. It provides recovery control that is “unplanned” in the sense that the consistent state to which the system reverts after the initiation of recovery action is not predetermined. It is determined dynamically when recovery action is initiated and is based on the recorded information flow between the processes. The protocol is first developed for a model of computation in which each process independently implements a succession of single level, i.e. non-nested, recovery regions and where no restrictions are placed on inter-process message passing. The model is then extended to cover the case where processes may implement nested recovery regions. A development of the basic protocol which covers this case is presented and is shown to be significantly more complicated.
Article
Fault tolerance is an issue of high importance to distributed systems, a fact that is well identified in the ISO/ITU Reference Model of ODP by the inclusion of failure transparency. The Persistent Object Group Service (POGS) described in this article keeps track of the state of a distributed application, as far as global checkpoint consistency is concerned. Application objects take checkpoints of their own in a noncoordinated fashion, using the POGS to detect global state inconsistencies. As a consequence of consulting POGS, objects take additional checkpoints that would not have occurred otherwise but which are necessary to ensure global state consistency. The advantage of the POGS approach lies in the fact that global checkpoint consistency control is separated from the objects that actually do the checkpointing. This is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process. A prototype of the POGS has been implemented using CORBA as a standard distributed systems technology.
Article
To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].
Article
A mathematical model of a transaction-oriented system under intermittent failures is proposed. The system is assumed to operate with a checkpointing and rollback/recovery method to ensure reliable information processing. The model is used to derive the principal performance measures, including availability, response time, and the system saturation point.
Article
Above-ground tree biomass in Zambian miombo woodland was determined by the harvest method. After clear-cutting and measuring 271 tress in 5 sample plots, each 400 m2, multiple regression analyses gave a set of equations relating above-ground fresh biomass, as the dependent variable, to tree diameter and height as the independant variables. Equations were calculated for 6 dominant miombo species separately, and for undisturbed miombo generally. With knowledge of tree diameter and height, these equations will enable above-ground biomass of single trees to be predicted fairly accurately. This is obviously of importance for the monitoring of forest resources in the African miombo, but has also relevance for agroforestry and the study of the shifting cultivators in the area.