Article

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

June 2002
ACM Computing Surveys 34(3)

June 2002
34(3)

DOI:10.1145/568522.568525

Source
DBLP

Authors:

Elmootazbellah Elnozahy

King Abdullah University of Science and Technology

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

Preprint

Full-text available

Sep 2023

German Vidal

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing primitives and the use of a (partially) reversible semantics for rolling back the system.

Checkpointing Techniques in Distributed Systems: A Synopsis of Diverse Strategies Over the Last Decades

Conference Paper

May 2023

This paper concisely reviews checkpointing techniques in distributed systems, focusing on various aspects such as coordinated and uncoordinated checkpointing, incremental checkpoints, fuzzy checkpoints, adaptive checkpoint intervals, and kernel-based and user-space checkpoints. The review highlights interesting points, outlines how each checkpoint approach works, and discusses their advantages and drawbacks. It also provides a brief overview of the adoption of checkpoints in different contexts in distributed computing, including Database Management Systems (DBMS), State Machine Replication (SMR), and High-Performance Computing (HPC) environments. Additionally, the paper briefly explores the application of checkpointing strategies in modern cloud and container environments, discussing their role in live migration and application state management. The review offers valuable insights into their adoption and application across various distributed computing contexts by summarizing the historical development, advances, and challenges in checkpointing techniques.

Efficient Sender-Based Message Logging Tolerating Simultaneous Failures with Always No Rollback Property

Article

Full-text available

Mar 2023

Jinho Ahn

Most of the existing sender-based message logging protocols cannot commonly handle simultaneous failures because, if both the sender and the receiver(s) of each message fail together, the receiver(s) cannot obtain the recovery information of the message. This unfortunate situation may happen due to their asymmetric logging behavior. This paper presents a novel sender-based message logging protocol for broadcast network based distributed systems to overcome the critical constraint of the previous ones with the following three features. First, when more than one process crashes at the same time, the protocol enables the system to ensure the always no rollback property by symmetrically replicating the recovery information at each process or group member connected on a network. Second, it can make the first feature persist even if the general form of communication for the system is a combination of point-to-point and group ones. Third, the communication overhead resulting from the replication can be highly lessened by making full use of the capability of the standard broadcast network in both communication modes. Experimental outcomes verify that, no matter which communication patterns are applied, it can reduce about 4.23∼9.96% of the total application execution time against the latest enabling the traditional ones to cope with simultaneous failures.

CONSISTENCY OF DISTRIBUTED SYSTEM WITH ACTIVE INITIATOR PROCESS WITHOUT USELESS CHECKPOINTS

Article

Aug 2014

Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally committing their checkpoints. Thus, the checkpointing pattern described in this paper takes only those checkpoints that will contribute to the consistent global snapshot thereby eliminating the number of useless checkpoints.

BUILDING RESILIENT DISTRIBUTED SYSTEMS: FAULT-TOLERANT DESIGN PATTERNS FOR STATEFUL WORKFLOWS Building Resilient Distributed Systems: Fault-Tolerant Design Patterns for Stateful Workflows

Research

Full-text available

Jun 2024

Resilient design patterns play a crucial role in developing fault-oblivious stateful workflow systems in distributed computing. This article explores advanced techniques and strategies for building resilient distributed systems that can gracefully handle failures and maintain operational continuity. It delves into fault tolerance strategies, such as data redundancy, checkpointing, and transactional consistency, to ensure system reliability and data integrity. The article discusses the benefits of microservices architecture in achieving fault isolation and minimizing the impact of failures. It highlights the importance of self-healing mechanisms, including automated fault detection and correction, to ensure continuous operation. Scalability and load balancing techniques, such as dynamic resource adjustment and workload distribution, are explored to accommodate fluctuating demands and optimize system performance. The article also examines error handling and recovery mechanisms, including automated rollbacks and distributed consensus protocols, to maintain data consistency and coordinate recovery actions across nodes. Additionally, it emphasizes the significance of proactive system health monitoring and rapid fault identification and resolution in minimizing downtime and ensuring a smooth user experience. The article concludes by discussing emerging trends, open research problems, and providing recommendations for building resilient distributed systems that can withstand the challenges of modern computing environments.

Real-Time fault tolerance mechanism for distributed systems

Research

Full-text available

May 2017

Network of computers or other systems step from simple to complex. When a system is referred as complex and stochastic then the challenges of availability, dependability, stability and reliability become serious indicators for effective performance. Fault-tolerance plays a crucial role towards achieving dependability, reliability, stability and the fundamental requirement for the design and development of effective and efficient fault-tolerance mechanisms. It is also important that the power, weight, space and cost constraints of systems are addressed by efficiently using the available resources for fault-tolerance. In life critical mission systems, reliability is a great option, hence this paper investigate a fault tolerance mechanism using checkpointing in distributed systems.

Algorithm-Based Fault-Tolerant Parallel Sorting

Article

Full-text available

Jan 2024

Rewind & Discard: Improving Software Resilience using Isolated Domains

Conference Paper

Full-text available

Jun 2023

Mobile crowd computing: potential, architecture, requirements, challenges, and applications

Article

Full-text available

Jul 2023
J SUPERCOMPUT

Owing to the enormous advancement in miniature hardware, modern smart mobile devices (SMDs) have become computationally powerful. Mobile crowd computing (MCC) is the computing paradigm that uses public-owned SMDs to garner affordable high-performance computing (HPC). Though several empirical works have established the feasibility of mobile-based computing for various applications, there is a lack of comprehensive coverage of MCC. This paper aims to explore the fundamentals and other nitty–gritty of the idea of MCC in a comprehensive manner. Starting with an explicit definition of MCC, the enabling backdrops and the detailed architectural layouts of different models of MCC are presented, along with categorising different types of MCC based on infrastructure and application demands. MCC is compared extensively with other HPC systems (e.g. desktop grid, cloud, clusters and supercomputers) and similar mobile computing systems (e.g. mobile grid, mobile cloud, ad hoc mobile cloud, and mobile crowdsourcing). MCC being a complex system, various design requirements and considerations are extensively analysed. The potential benefits of MCC are meticulously mentioned, with special discussions on the ubiquity and sustainability of MCC. The issues and challenges of MCC are critically presented in light of further research scopes. Several real-world applications of MCC are identified and propositioned. Finally, to carry forward the accomplishment of the MCC vision, the future prospects are briefly elucidated.

Fault Tolerance to Balance for Messaging Layers in Communication Society

Article

Jan 2023

the present communication societies are based on use of High-Performance Computing (HPC) systems for balancing the messaging layers. However the HPC systems are vulnerable to different types of software and hardware failures which resulted into extra efforts to resume the working of such systems. Into the communication framework because of this type of vulnerabilities it was creating misbalancing around the layers of message passing. Hence there is need of fault tolerance methods in such HPC systems. There are different solutions proposed for efficient fault tolerance in HPC systems, but suffered from various limitations. The check pointing/restart technique is most commonly & frequently studied technique. Therefore in this research goal is to present efficient and new check pointing/restart method of fault tolerance in HPC systems for MPI (Message Passing Interface) application in order to balance the messaging layers. For fault tolerant systems checking the pointing consist as most important function, but additional overhead was the main problem of restriction for the method of fault tolerance due to the pointing check or the extra storage wanted to be both or check the pointer it is creating the issues to program added time. Hence, we wanted to develop the new enhanced application which helped to checkpoint restart technique for the applications of MPI. this check point method are supported by efficient & trusted distributed storage system, This kind of checkpoints assuring the availability of data at the time of hardware failure.

Replicação de Máquinas Virtuais Xen com Checkpointing Adaptável

Conference Paper

May 2015

Remus é um mecanismo de replicação de máquinas virtuais (MVs) que fornece alta disponibilidade diante de faltas de parada. A replicação é realizada através de checkpointing, seguindo um intervalo fixo de tempo predeterminado. Todavia, existe um antagonismo entre processamento e comunicação em relação ao intervalo ideal entre checkpoints: enquanto intervalos maiores beneficiam aplicações com processamento intensivo, intervalos menores favorecem as aplicações cujo desempenho é dominado pela rede. Logo, o intervalo utilizado nem sempre e o adequado para as características de uso de recursos da aplicação em execução na MV, limitando a aplicabilidade de Remus em determinados cenários. Este trabalho apresenta uma proposta de checkpointing adaptativo para Remus, ajustando dinamicamente a frequência de replicação de acordo com as características das aplicações em execução. Os resultados indicam que a proposta obtém um melhor desempenho de aplicações que utilizam tanto recursos de processamento como de comunicação, sem prejudicar aplicações que usam apenas um dos tipos de recursos.

Kishu: Time-Traveling for Computational Notebooks

Preprint

Full-text available

Jun 2024

Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session state consisting of user-defined variables can be irreversibly modified - e.g., the user cannot 'un-drop' a dataframe column. This is because, unlike DBMS, existing notebook systems do not keep track of the session state. Existing techniques for checkpointing and restoring session states, such as OS-level memory snapshot or application-level session dump, are insufficient: checkpointing can incur prohibitive storage costs and may fail, while restoration can only be inefficiently performed from scratch by fully loading checkpoint files. In this paper, we introduce a new notebook system, Kishu, that offers time-traveling to and from arbitrary notebook states using an efficient and fault-tolerant incremental checkpoint and checkout mechanism. Kishu creates incremental checkpoints that are small and correctly preserve complex inter-variable dependencies at a novel Co-variable granularity. Then, to return to a previous state, Kishu accurately identifies the state difference between the current and target states to perform incremental checkout at sub-second latency with minimal data loading. Kishu is compatible with 146 object classes from popular data science libraries (e.g., Ray, Spark, PyTorch), and reduces checkpoint size and checkout time by up to 4.55x and 9.02x, respectively, on a variety of notebooks.

A Behavioral Theory for Distributed Systems with Weak Recovery

Preprint

Full-text available

Jun 2024

Distributed systems can be subject to various kinds of partial failures, therefore building fault-tolerance or failure mitigation mechanisms for distributed systems remains an important domain of research. In this paper, we present a calculus to formally model distributed systems subject to crash failures with recovery. The recovery model considered in the paper is weak, in the sense that it makes no assumption on the exact state in which a failed node resumes its execution, only its identity has to be distinguishable from past incarnations of itself. Our calculus is inspired in part by the Erlang programming language and in part by the distributed $\pi$-calculus with nodes and link failures (D$\pi$F) introduced by Francalanza and Hennessy. In order to reason about distributed systems with failures and recovery we develop a behavioral theory for our calculus, in the form of a contextual equivalence, and of a fully abstract coinductive characterization of this equivalence by means of a labelled transition system semantics and its associated weak bisimilarity. This result is valuable for it provides a compositional proof technique for proving or disproving contextual equivalence between systems.

Highly Available Virtual Network Functions and Services Based on Checkpointing/Restore

Article

Full-text available

Jan 2024

Real-Time Data-Predictive Attack-Recovery for Complex Cyber-Physical Systems

Conference Paper

May 2023

Scalable Communication-Induced Checkpointing Protocol with Little Overhead for Distributed Computing Environments

Article

Full-text available

Jun 2023

Jinho Ahn

The existing communication-induced checkpointing protocols may not scale well due to their slow acquisition of the most recent timestamps of the next checkpoints of other processes. Accurate situation awareness with diversified information conveyance paths is needed to reduce the number of unnecessary forced checkpoints taken as few as possible. In this paper, a scalable communication-induced checkpointing protocol is proposed to considerably cut down the possibility of performing unnecessary forced checkpointing by exploiting the beneficial features of reliable communication channels. The protocol enables the sender of an application message to swiftly attain the most recent timestamp-related information of the next checkpoint of its receiver and accelerate the spread of the information to others, with little overhead. This behavioral feature may significantly elevate the accuracy of the awareness of the situations in which forced checkpointing is actually needed for useless checkpoint-free recovery. In addition, it generates no extra control message and no message logging overhead while significantly lessening the latency of message sending. Moreover, the protocol can always be operated under the non-deterministic execution model. The evaluation results indicate that the proposed protocol outperforms the existing ones at the reduced forced checkpointing overheads from 12.5% to 84.2%, and at the reduced total execution times from 2.5% to 11.5%.

Theoretical and Experimental Evaluation of Communication-Induced Checkpointing Protocols in F E Family

Conference Paper

Jun 2023

Sustainable Computing with Mobile Crowd Computing

Thesis

Full-text available

Jan 2023

Pijush Kanti Dutta Pramanik

The effects of environmental pollution and global warming have become a reality and severe. In addition to other causes, wide adoption and huge demands for computational resources have aggravated it significantly. The production process of the computing devices involves hazardous and toxic substances which not only harm human and other living beings’ health but also contaminate the water and soil. The production and operations of these computers on a large scale also result in massive energy consumption and greenhouse gas generation. Moreover, the low use cycle of these devices produces a huge amount of not easy-to-decompose e-waste. In this outlook, instead of buying new devices, it is advisable to use the existing resources to their fullest, which will minimize the environmental penalties of production and e-waste. In this study, we advocate for mobile crowd computing (MCC) to ease off the use of centralized high-performance computers (HPCs) such as data centres and supercomputers by utilising SMDs (smart mobile devices) as computing devices. We envision establishing MCC as the most feasible computing system solution for sustainable computing. Successful real-world implementation of MCC involves several challenges. In this study, we primarily focus on resource management. We devised a methodological and structured approach for resource profiling. We present the resource selection problem as an MCDM (multi-criteria decision making) problem and compared five MCDM approaches of dissimilar families to find the appropriate methodology for dynamic resource selection in MCC. To improve the overall performance of MCC, we present two scheduling algorithms, considering different objectives such as makespan, resource utilisation, load balance, and energy efficiency. We propose a deep learning based resource availability prediction to minimise the job offloading in a local MCC. We further propose a mobility-aware resource provisioning scheme for a P2P MCC. Finally, we present a proof-of-concept of the proposed MCC model as a working prototype. For this, we consider a smart HVAC (heating, ventilation, and cooling) system in a multistorey building and use MCC as a local edge computing infrastructure. The successful implementation of the prototype proves the feasibility and potential of MCC as alternative sustainable computing.

Implementando Recuperação por Retorno Baseada em Checkpointing em Sistemas Distribuídos Assíncronos

Conference Paper

May 2004

A recuperação por retorno baseada em pontos de recuperação é largamente usada como técnica de tolerância a falhas. O modelo complexo de sistemas distribuídos tem motivado o desenvolvimento de diversos algoritmos buscando soluções mais simples e eficientes. No Grupo de Tolerância a Falhas da UFRGS, foi proposto recentemente um algoritmo que é voltado para aplicações em sistemas distribuídos assíncronos baseados na troca de mensagens, opera com salvamento coordenado de pontos de recuperação e prevê o tratamento de mensagens órfãs e perdidas. Este artigo descreve as decisões de projeto, a implementação do algoritmo e resultados obtidos até o momento.

NHAM: An NFV High Availability Architecture for Building Fault-Tolerant Stateful Virtual Functions and Services

Conference Paper

Full-text available

Jan 2023

DRACO: Distributed Resource-aware Admission Control for Large-Scale, Multi-Tier Systems

Article

Jun 2024
J PARALLEL DISTR COM

SDK4ED: a platform for building energy efficient, dependable, and maintainable embedded software

Article

Full-text available

Jun 2024
AUTOMAT SOFTW ENG

Developing embedded software applications is a challenging task, chiefly due to the limitations that are imposed by the hardware devices or platforms on which they operate, as well as due to the heterogeneous non-functional requirements that they need to exhibit. Modern embedded systems need to be energy efficient and dependable, whereas their maintenance costs should be minimized, in order to ensure the success and longevity of their application. Being able to build embedded software that satisfies the imposed hardware limitations, while maintaining high quality with respect to critical non-functional requirements is a difficult task that requires proper assistance. To this end, in the present paper, we present the SDK4ED Platform, which facilitates the development of embedded software that exhibits high quality with respect to important quality attributes, with a main focus on energy consumption, dependability, and maintainability. This is achieved through the provision of state-of-the-art and novel quality attribute-specific monitoring and optimization mechanisms, as well as through a novel fuzzy multi-criteria decision-making mechanism for facilitating the selection of code refactorings, which is based on trade-off analysis among the three main attributes of choice. Novel forecasting techniques are also proposed to further support decision making during the development of embedded software. The usefulness, practicality, and industrial relevance of the SDK4ED platform were evaluated in a real-world setting, through three use cases on actual commercial embedded software applications stemming from the airborne, automotive, and healthcare domains, as well as through an industrial study. To the best of our knowledge, this is the first quality analysis platform that focuses on multiple quality criteria, which also takes into account their trade-offs to facilitate code refactoring selection.

An Asynchronous Scheme for Rollback Recovery in Message-Passing Concurrent Programming Languages

Conference Paper

May 2024

Germán Vidal

Practicable live container migrations in high performance computing clouds: Diskless, iterative, and connection-persistent

Article

May 2024
J SYST ARCHITECT

Jordi Guitart

Checkpointing models for tasks of different types

Article

Apr 2024

A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy is in operation, aiming to protect against possibly lengthy recovery periods by backing up the current state at periodic checkpoints. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.

Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directions

Article

Full-text available

Mar 2024

Fault tolerance is crucial in ensuring smooth working of distributed and cloud computing. It is challenging to implement because of the constantly changing infrastructure and complex configurations in distributed and cloud computing. Implementation of various fault tolerance methods require domain‐specific knowledge as well as in‐depth understanding of the existing techniques and approaches. Recent surveys on fault tolerance in cloud and distributed environments exist, but they have limitations. This article systematically reviews fault tolerance approaches in distributed and cloud computing and discusses their taxonomy. Based on the taxonomy provided, fault‐tolerance approaches are divided into four types, that is, reactive approaches, proactive approaches, adaptive approaches, and hybrid approaches. Reactive approaches provide a preventive measure after the occurrence of faults in the system. Proactive approaches prevent the system or minimize failure effects by predicting in advance. The adaptive approaches predict, learn, and adapt the changes to deal with new faults in the system. The hybrid approaches combine reactive, proactive, and adaptive approaches. The objective of this article is to give a better understanding of handling faults using suitable approaches and further compare them on various parameters. The paper also presents a promising research direction based on the challenges and issues in multiple approaches.

Hybrid checkpoint protocol for cell-dependent infrastructured networks

Conference Paper

Jan 2004

Parallelized 0/1 Knapsack Algorithm Optimization in CPU-GPU-Based Heterogeneous System with Algorithm-based Fault Tolerance

Conference Paper

Jan 2024

A Survey of Actor-Like Programming Models for Serverless Computing

Chapter

Jan 2024

Serverless computing promises to significantly simplify cloud computing by providing Functions-as-a-Service where invocations of functions, triggered by events, are automatically scheduled for execution on compute nodes. Notably, the serverless computing model does not require the manual provisioning of virtual machines; instead, FaaS enables load-based billing and auto-scaling according to the workload, reducing costs and making scheduling more efficient. While early serverless programming models only supported stateless functions and severely restricted program composition, recently proposed systems offer greater flexibility by adopting ideas from actor and dataflow programming. This paper presents a survey of actor-like programming abstractions for stateful serverless computing, and provides a characterization of their properties and highlights their origin.

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

Chapter

Jan 2024

German Vidal

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.

A survey on the evolution of stream processing systems

Article

Full-text available

Nov 2023
VLDB J

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Article

Nov 2023

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput. We present Rhythm , a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.

Achieving Enhanced Performance Combining Checkpointing and Dynamic State Partitioning

Conference Paper

Oct 2023

Managing asynchronous workloads in a multi‐tenant microservice enterprise environment

Article

Oct 2023

A multi‐tenant microservice architecture involving components with asynchronous interactions and batch jobs requires efficient strategies for managing asynchronous workloads. This article addresses this issue in the context of a leading company developing tax software solutions for many national and multi‐national corporations in Brazil. A critical process provided by the company's cloud‐based solutions encompasses tax integration, which includes coordinating complex tax calculation tasks and needs to be supported by asynchronous operations using a message broker to guarantee order correctness. We explored and implemented two approaches for managing asynchronous workloads related to tax integration within a multi‐tenant microservice architecture in the company's context: (i) a polling‐based approach that employs a queue as a distributed lock (DL) and (ii) a push‐based approach named single active consumer (SAC) that relies on the message broker's logic to deliver messages. These approaches aim to achieve efficient resource allocation when dealing with a growing number of container replicas and tenants. In this article, we evaluate the correctness and performance of the DL and SAC approaches to shed light on how asynchronous workloads impact the management of multi‐tenant microservice architectures from delivery and deployment perspectives.

A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead

Conference Paper

Oct 2023

Condition Monitoring through Vibration Analysis and Wavelet Transformation to detect and correct the Machine Unbalance Fault

Conference Paper

Full-text available

May 2014

Rajendra Pawar

Condition monitoring of machinery is becoming more and more popular particularly in the process plan, where sudden breakdown may prove costly or may even fatal. The uncertainties faced by the rotating machinery such as misalignment, looseness, bearing defect and electrical faults are corrected by familiar and ordinary maintenance procedure. Replacing the faulty bearings, gears, drive belts, couplings and other machine components are also rather straight forward process. However, correcting the unbalance requires some special knowledge and understanding. The unbalanced rotor always cause more vibrations and generates excessive forces on the bearing areas and reduces the life of the machine. In this work vector balancing method is presented, which minimize the number of trial runs by eliminating all guesswork. In case of non stationary signal, the use of Spectrum analysis based on Fourier transform has some limitations. Hence, the vibration signal analysis is evaluated by using Continuous wavelet transform method. It is one of the most important tool for signal processing particularly in nonstationary signals. It is capable of giving time frequency representation of the signal.

Rollback Recovery in Session-Based Programming

Chapter

Jun 2023

To react to unforeseen circumstances or amend abnormal situations in communication-centric systems, programmers are in charge of “undoing” the interactions which led to an undesired state. To assist this task, session-based languages can be endowed with reversibility mechanisms. In this paper we propose a language enriched with programming facilities to commit session interactions, to roll back the computation to a previous commit point, and to abort the session. Rollbacks in our language always bring the system to previous visited states and a rollback cannot bring the system back to a point prior to the last commit. Programmers are relieved from the burden of ensuring that a rollback never restores a checkpoint imposed by a session participant different from the rollback requester. Such undesired situations are prevented at design-time (statically) by relying on a decidable compliance check at the type level, implemented in MAUDE. We show that the language satisfies error-freedom and progress of a session.

Reliable Actors with Retry Orchestration

Article

Jun 2023

Cloud developers have to build applications that are resilient to failures and interruptions. We advocate for a fault-tolerant programming model for the cloud based on actors, retry orchestration, and tail calls. This model builds upon persistent data stores and message queues readily available on the cloud. Retry orchestration not only guarantees that (1) failed actor invocations will be retried but also that (2) completed invocations are never repeated and (3) it preserves a strict happen-before relationship across failures within call stacks. Tail calls can break complex tasks into simple steps to minimize re-execution during recovery. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanisms of fault tolerance in this model. We briefly describe our implementation. Using an application inspired by a typical enterprise scenario, we validate the functional correctness of our implementation and assess the impact of fault preparedness and recovery on performance.

Optimal Checkpointing Strategy for Real-time Systems with Both Logical and Timing Correctness

Article

Jun 2023

Real-time systems are susceptible to adversarial factors such as faults and attacks, leading to severe consequences. This paper presents an optimal checkpoint scheme to bolster fault resilience in real-time systems, addressing both logical consistency and timing correctness. First, we partition message-passing processes into a directed acyclic graph (DAG) based on their dependencies, ensuring checkpoint logical consistency. Then, we identify the DAG’s critical path, representing the longest sequential path, and analyze the optimal checkpoint strategy along this path to minimize overall execution time, including checkpointing overhead. Upon fault detection, the system rolls back to the nearest valid checkpoints for recovery. Our algorithm derives the optimal checkpoint count and intervals, and we evaluate its performance through extensive simulations and a case study. Results show a 99.97% and 67.86% reduction in execution time compared to checkpoint-free systems in simulations and the case study, respectively. Moreover, our proposed strategy outperforms prior work and baseline methods, increasing deadline achievement rates by 31.41% and 2.92% for small-scale tasks and 78.53% and 4.15% for large-scale tasks.

Generic Checkpointing Support for Stream-based State-Machine Replication

Conference Paper

May 2023

Accomplished Minimum-process Synchronized Consistent Recovery Line Aggregation Algorithm for Fault-Tolerant Mobile Computing Environments

Research Proposal

Full-text available

Sep 2022

An atomic member addition mechanism for permissioned blockchain based on autonomous rollback

Conference Paper

Dec 2022

SeqDLM: A Sequencer-Based Distributed Lock Manager for Efficient Shared File Access in a Parallel File System

Conference Paper

Nov 2022

Agreements and Consensus

Chapter

Oct 2022

Three variants of consensus problems are consensus, Byzantine agreement, and interactive consistency. These problems are equivalent. If any of the three has a solution, we can obtain a solution for the remaining ones. Knowing impossibility results makes us aware of the limitation of solutions. Reaching a consensus with non‐faulty processes is difficult. Even under a minimal set of assumptions, the consensus solutions require exponential time during which processes may experience timeouts and send wrong messages to others saying that non‐responding processes have failed. Failures could block commits in distributed transactions. There may be different types of faults, such as crash, omission, timeout, value, and Byzantine. It forms the basis for the three‐phase commit protocol under fail‐stop assumptions. Processes experiencing Byzantine faults send arbitrary messages to others. We can focus only on the Byzantine agreement, also known as the Byzantine general problem (BGP) has a solution under certain restrictions. The solution requires many execution rounds in a synchronous distributed system and can sustain f failures with 3f+1 processes. Reaching consensus in an asynchronous distributed system is impossible. Lamport proposed the Paxos algorithm to reach an agreement in the eventual synchronous model. Paxos is a complicated protocol to understand and requires non‐trivial architectural adaptations for implementation. Raft is a simpler alternative that is as efficient as Paxos and more amenable to building practical systems.

Communication Induced Checkpointing based Fault Tolerance Mechanism – A Review and CIAC-FTM Framework in IoT Environment

Conference Paper

Nov 2022

Checkpointing Models for Tasks with Widely Different Processing Times

Chapter

Jan 2023

A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy aiming to protect against possibly lengthy recovery periods is in operation. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.KeywordsBreakdownsRepairsRecoveryM/G/1 queueEmbedded Markov chainsServer vacationsCheckpoint optimization

Impeccable Dependable Recovery Line Accumulation Protocol for Mobile Distributed Systems

Research

Full-text available

Nov 2020

Impeccable Dependable Recovery Line Accumulation Protocol for Mobile Distributed Systems

Article

Full-text available

Nov 2020

In coordinated DRL-accumulation (Dependable Recovery Line Accumulation) protocols for mobile distributed systems, if a single operation fails to capture its reclamation-dot (checkpoint); all the DRL-accumulation effort goes waste, because, each operation has to abort its partially-committed reclamation-dot. In order to capture its partially-committed reclamation-dot, a Mob Nod (Mobile Node) needs to transfer large reclamation-dot data to its local Mob-Spt-Stn (Mobile Support Station) over wireless channels. The DRL-accumulation effort may be exceedingly high due to repeated abandons especially in mobile systems. We try to minimize the loss of DRL-accumulation effort when any operation fails to capture its reclamation-dot in coordination with others. In the first phase, we capture perishable reclamation-dots only. In this case, if any operation fails to capture its reclamation-dot in the first phase, all concerned operations need to abort their perishable reclamation-dots only and not the partially-committed ones. We design a least process DRL-accumulation protocol for Mobile Distributed systems, where no useless reclamation-dots are captured and an effort has been made to optimize the intrusion of operations. We put forward to delay the processing of selective reckoning-messages at the receiver end only during the DRL-accumulation period. An operation is allowed to perform its normal computations and send reckoning-messages during its intrusion period. In this way, we try to keep intrusion of operations to bare least. In order to keep the intrusion time least, we collect the dependency vectors and compute the exact least set in the beginning of the protocol.

A Comparative Analysis of Dependable Recovery Line Accumulation Protocols for Mobile Computing Environments

Conference Paper

Full-text available

Jan 2023

Portals: An Extension of Dataflow Streaming for Stateful Serverless

Conference Paper

Nov 2022

Converting a swap-based system to do paging in an architecture lacking page-referenced bits

Conference Paper

Full-text available

Dec 1981

This paper discusses the modifications made to the UNIX operating system for the VAX-11/780 to convert it from a swap-based segmented system to a paging-based virtual memory system. Of particular interest is that the host machine architecture does not include page-referenced bits. We discuss considerations in the design of page-replacement and load-control policies for such an architecture, and outline current work in modeling the policies employed by the system. We describe our experience with the chosen algorithms based on benchmark-driven studies and production system use.

Implementing Remote fork() with Checkpoint-Restart

Article

Full-text available

Jan 1989

We describe a method for implementing checkpoints on a UNIX® system. The method requires no special operating system support. The checkpoints (a term we use both for the act of saving state and the result) are created in the file system name space. Availability in the name space allows facilities to duplicate and transfer files to be applied; in this case, we get replicated processes and process migration rather naturally. We describe the process migration implementation. Our process migration implementation was easily optimized to achieve an execution speed improvement of greater than 7 times over our first implementation; this was accomplished by a combination of a faster file transfer mechanism and a change in the underlying protocol. We have incorporated the mechanism into a library routine, rfork(). We conclude with a discussion of advantages, limitations and applications of our approach.

Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication

Article

Full-text available

Jan 1994

Elmootazbellah Elnozahy

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

Chapter

Full-text available

Jan 1993

Many important problems in distributed computing admit solutions that contain a phase where some global property needs to be detected. This subproblem can be seen as an instance of the Global Predicate Evaluation (GPE) problem where the objective is to establish the truth of a Boolean expression whose variables may refer to the global system state. Given the uncertainties in asynchronous distributed systems that arise from commu- nication delays and relative speeds of computations, the formulation and solution of GPE reveal most of the subtleties in global reasoning with im- perfect information. In this chapter, we use GPE as a canonical problem in order to survey concepts and mechanisms that are useful in understanding global states of distributed computations. We illustrate the utility of the developed techniques by examining distributed deadlock detection and distributed debugging as two instances of GPE.

A Distributed Domino-effect free Recovery Algorithm

Conference Paper

Full-text available

Oct 1984

Recovery techniques may be distinguished on the basis of the time when the recovery lines are built: at the time of recording the recovery point, at the time of rollback. Consequently we distinguish "planned“ and "unplanned" policies for determining recovery lines. With an unplanned policy a "domino effect" can occur. The planned policy is usually intended as being static, in the sense that the recovery lines are a priori established at design time. In this paper an algorithm for "dynamic" planning of recovery line is specified. We shall define a computational model for a distributed system of communicating processes using asynchronous message passing and shall describe the recovery algorithms by means of axioms.

A software instruction counter

Article

Full-text available

Apr 1989
Comput Architect News

Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent software solutions. This paper describes our implementation of a software instruction counter for program debugging. We show that an instruction counter can be reasonably implemented in software, often with less than 10% execution overhead. Our experience suggests that a hardware instruction counter is not necessary for a practical implementation of watch-points and reverse execution, however it will make program instrumentation much easier for the system developer.

System Structure for Software Fault Tolerance

Article

Full-text available

Aug 2003

Brian Randell

The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

Article

Full-text available

Apr 1996

This paper presents a new algorithm for supporting fault tolerant objects in distributed object oriented systems. The fault tolerance provided by the algorithm is fully user transparent. The algorithm uses checkpointing and message logging scheme. However the novelty of this scheme is in identifying the checkpointing instances such that the checkpointing time will not affect the regular response time for the object requests. It also results in storing the minimum amount of object state (object address space). A simple message logging scheme that pairs the logging of response message and the next request message reduces the message logging time by half on an average compared to other similar logging schemes. The scheme exploits the general features and concepts associated with the notion of the objects and object interactions to its advantage.

A low-overhead recovery technique using quasi-synchronous checkpointing

Article

Full-text available

May 1996

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.

A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

Article

Full-text available

Jun 1995

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failure. Although most recoverable DSM require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and preliminary performances evaluation of our recoverable DSM on an Intel Paragon with 56 nodes.

Checkpointing and Its Applications

Article

Full-text available

Jun 1995

The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.

Fail-stop processors: An approach to designing fault-tolerant computing systems

Article

Jan 1983

IGOR: A System for Program Debugging via Reversible Execution

Article

Jan 1989

Typical debugging tools are insufficiently powerful to find the most difficult types of program misbehaviors. We have implemented a prototype of a new debugging system, IGOR, which provides a great deal more useful information and offers new abilities that are quite promising. The system runs fast enough to be quite useful while providing many features that are usually available only in an interpreted environment. We describe here some improved facilities (reverse execution, selective searching of execution history, substitution of data and executable parts of the programs) that are needed for serious debugging and are not found in traditional single-thread debugging tools. With a little help from the operating system, we provide these capabilities at reasonable cost without modifying the executable code and running fairly close to full speed. The prototype runs under the DUNE distributed operating system. The current system only supports debugging of single-thread programs. The paper describes planned extensions to make use of extra processors to speed the system and for applying the technique to multi-thread and time dependent executions.

Fault tolerant matrix operations using checksum and reverse computation

Conference Paper

Mar 1996

In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.

Checkpointing and recovery in a pipeline of transputers

Article

Sep 1992

Fault-tolerance is an essential feature of distributed systems designed for mission-critical applications, e.g., on-board computers for rocket launch vehicles. In many systems, when a processor fault is detected, it is replaced by a spare. However, to continue working normally, the system must restart from a globally consistent state. Hence, the state of the system must be periodically checkpointed. In this paper, we describe a checkpointing and recovery scheme in which recovery is extremely fast because checkpointing is done continuously and no explicit rollback is involved.

Process checkpointin primitives for fault tolerance: definitions and examples

Article

Dec 1992
MICROPROCESS MICROSY

Many fault tolerance techniques that are implemented via software are based on the use of process checkpointing and restore primitives. This is true both for methods used in system fault tolerance and for methods used in software fault tolerance, such as Recovery Blocks, but usually system and software fault tolerance appear to require different ad hoc primitives. Moreover, the use of checkpointing primitives within components implementing different kinds of fault tolerance should be coordinated, to save space and time. In this paper we present a unified interface for checkpointing and restore primitives, which is suitable both for software and for system fault tolerance in UNIX-type systems. We provide examples of the use of such primitives, including the use in a dedicated software component (the Recovery Meta Program) which may implement various techniques for fault tolerance. Finally, we discuss the implementation of the proposed primitives, and provide a comparison with some complementary approaches.

Persistent Linda 2: A Transactional/Checkpointing Approach to Fault Tolerant Linda

Article

Portable Recovery and Checkpointing in Heterogenous Systems

Article

Jan 1998

Efficient synchronous checkpointing in distributed systems

Article

J. Cao

Dynamic cluster-based recovery: Pessimistic and optimistic schemes

Article

Jan 1993

N. H. Vaidya

Consistent State Restoration in Distributed Systems

Article

Jan 1977

This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in ″backward error recovery″ , i. e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e. g. distributed, systems. The formalisation is based on the use of ″Occurrence Graphs″ to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restortation. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.

Checkpointing through garbage collection

Article

Jan 1996

An asynchronous checkpointing service

Article

Apr 1991

Rumen Stainov

A general requirement on checkpoint-based fault recovery schemes in distributed systems (DS) is maintaining a consistent DS state, i.e. the effects of all interactions of the failed process with other processes after the checkpoint must be taken into consideration. This paper proposes an approach to autonomous logging of asynchronous messages and their recovery simulation, using the communication dependencies in a transputer based High Performance Computing System. The proposed checkpointing service aims at providing system support to several fault-tolerance methods. Rollback recovery, Recovery block, Back-up recovery. The concept of asynchronous checkpointing allows us to separate the checkpointing functions from the fault-tolerance methods.

Checkpointing and the Modeling of Program Execution Time

Article

Jan 1994

Victor F. Nicola

A technique for optimizing the performance of a checkpoint restart system

Article

D. S. Wyner

Evaluation of global checkpoint rollback strategies for error recovery in concurrent processing systems

Article

Adding support for software interrupts in log-based rollback-recovery protocols

Article

J. H. Slye

Mechanisms of file-checkpointing for UNIX applications

Article

Checkpoint and recovery in ACTOR systems

Article

C. E. Hewitt

Efficient fault-tolerance support for interactive distributed applications

Article

Elmootazbellah Elnozahy

Checkpointing mechanisms for scientific parallel applications

Article

L. M. Silva

Roll-forward checkpointing scheme: Concurrent retry with nondedicated spares

Article

Preserving and Using Context Information in IPC

Article

Jan 1989

Survey of checkpoint and rollbak recovery techniques

Article

Repeated global snapshots in asynchronous distributed systems

Article

Jan 1989

M. Ahuja

Job and process recovery in a UNIX-based operating system

Article

Jan 1989

Application-Transparent Setting of Recovery Points

Article

Jan 1983

A scheme for coordinated execution of independently designed recoverable distributed processes

Article

Jan 1986

A scheme for facilitating efficient backward recovery in loosely coupled networks is presented. The scheme allows for the independent and uncoordinated design of error detection and recovery capabilities of distributed processes. It makes provision for properly coordinating such distributed processes at run-time for cooperative recovery without incurring a cyclic chain of rollback propagations called a domino effect. The operational rules of the scheme have been devised to minimize the number of recovery-points used for maintaining the capability for recovery with minimum-distance rollbacks. The system design philosophy is that each process must be solely responsible for detecting the errors that it originates. An approach to making judicious exceptions (i. e. , utilizing the cooperative error detection capabilities of processes without incurring a domino effect) has been devised in order to further enhance the system robustness.

Recovery Control of Communicating Processes in a Distributed System

Article

Jan 1985

W. G. Wood

The backward recovery of a computation to a previously existing state is a well-known method for attaining a degree of fault tolerance in digital systems. In this paper a protocol is developed for the purpose of providing “unplanned” recovery control in a distributed system of communicating processes. The protocol has the property of ensuring that the whole system reverts to a consistent state in the event of one or more processes initiating recovery action and it supports the determination of recovery point safety; that is when a recovery point cannot possibly be recovered to. It provides recovery control that is “unplanned” in the sense that the consistent state to which the system reverts after the initiation of recovery action is not predetermined. It is determined dynamically when recovery action is initiated and is based on the recorded information flow between the processes. The protocol is first developed for a model of computation in which each process independently implements a succession of single level, i.e. non-nested, recovery regions and where no restrictions are placed on inter-process message passing. The model is then extended to cover the case where processes may implement nested recovery regions. A development of the basic protocol which covers this case is presented and is shown to be significantly more complicated.

Error-recovery in multicomputers using asynchronous co-ordinated checkpointing

Article

Efficient checkpointing on MIMD architectures

Article

Jan 1993

James Plank

Fault-tolerant CORBA using checkpointing and recovery

Article

Jan 1997

M. Zweiacker

Fault tolerance is an issue of high importance to distributed systems, a fact that is well identified in the ISO/ITU Reference Model of ODP by the inclusion of failure transparency. The Persistent Object Group Service (POGS) described in this article keeps track of the state of a distributed application, as far as global checkpoint consistency is concerned. Application objects take checkpoints of their own in a noncoordinated fashion, using the POGS to detect global state inconsistencies. As a consequence of consulting POGS, objects take additional checkpoints that would not have occurred otherwise but which are necessary to ensure global state consistency. The advantage of the POGS approach lies in the fact that global checkpoint consistency control is separated from the objects that actually do the checkpointing. This is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process. A prototype of the POGS has been implemented using CORBA as a standard distributed systems technology.

Crash recovery for scientific applications

Article

Jan 1993

A discussion of checkpoint restart

Article

D. P. Jasper

Adding input and output to the transactional model

Article

R. Pausch

System structuring for software fault tolerance

Article

Jan 1977

Brian Randell

A first order approximation to the optimum checkpoint interval

Article

Sep 1974

John W. Young

To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].

Performance of rollback recovery systems under intermittent failures

Article

Jun 1978

A mathematical model of a transaction-oriented system under intermittent failures is proposed. The system is assumed to operate with a checkpointing and rollback/recovery method to ensure reliable information processing. The model is used to derive the principal performance measures, including availability, response time, and the system saturation point.

Transparent recovery in distributed systems (position paper)

Article

Dec 2002

David F. Bacon

An abstract is not available.

Biomass estimation equations for miombo woodland, Zambia

Article

Jan 1985

Peter Stromgaard

Above-ground tree biomass in Zambian miombo woodland was determined by the harvest method. After clear-cutting and measuring 271 tress in 5 sample plots, each 400 m2, multiple regression analyses gave a set of equations relating above-ground fresh biomass, as the dependent variable, to tree diameter and height as the independant variables. Equations were calculated for 6 dominant miombo species separately, and for undisturbed miombo generally. With knowledge of tree diameter and height, these equations will enable above-ground biomass of single trees to be predicted fairly accurately. This is obviously of importance for the monitoring of forest resources in the African miombo, but has also relevance for agroforestry and the study of the shifting cultivators in the area.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Abstract

No full-text available

Recommended publications

A Survey of Rollback-Recovery Protocols