Component failure model.

Source publication

System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling.

Conference Paper

Full-text available

Jan 2004

As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component coun...

Context 1

... of nominal MTBF, typically there is a phase in component lifetime where failure rates can be higher. Generally, the failure rate model for components follows a bathtub curve as shown in Figure 1. As can be shown, there are three different phases: component burn in, normal ag- ing, and late failure. Failures are frequent during the burn in and the late failure phases due to defects in the components, and component aging, respectively. For the sake discussion the MTBF of individual compo- nents is defined as: MTBF = 1/λ, where the parameter λ is called the failure rate of the component. 2 In a system where the failure of a single component results in the failure of the entire system the MTBF of the system is given by: Figure 2 shows the expected system MTBF obtained from this model for system for three different component reliability levels: MTBFs of 10 4 , 10 5 , and 10 6 hours. As can be seen, when increasing the number of components in the system the MTBF of the entire system dramatically decreases. For projected Pflop systems the MTBF is only a few hours even when all components are of very high reliability (MTBF 10 6 hours). In particular, for a system with 100,000 nodes the MTBF would be 10 hours. The BlueGene/L system with 65,536 nodes is expected to have an MTBF of less than 24 hours. In reality these numbers are optimistic for several reasons: MTBF numbers for indi- vidual components assume ideal environmental conditions, components will not generally have uniformly high MTBFs, and failure of some components can damage others, reduc- ing their MTBFs. 1000 ...

View in full-text

A checkpointing mechanism for virtual clusters using memory-bound time-multiplexed data transfers

Article

Full-text available

Feb 2024
IJECE

Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of virtual machines (VMs) offer an attractive fault tolerance capability for cloud data centers. However, existing mechanisms have suffered from high checkpoint downtimes and overheads. This paper introduces Mekha, a novel hypervisor-level, in-memory coordinated checkpoint-restart mechanism for VCs that leverages precopy live migration. During a VC checkpoint event, Mekha creates a shadow VM for each VM and employs a novel memory-bound timed-multiplex data (MTD) transfer mechanism to replicate the state of each VM to its corresponding shadow VM. We also propose a global ending condition that enables the checkpoint coordinator to control the termination of the MTD algorithm for every VM in a VC, thereby reducing overall checkpoint latency. Furthermore, the checkpoint protocols of Mekha are designed based on barrier synchronizations and virtual time, ensuring the global consistency of checkpoints and utilizing existing data retransmission capabilities to handle message loss. We conducted several experiments to evaluate Mekha using a message passing interface (MPI) application from the NASA advanced supercomputing (NAS) parallel benchmark. The results demonstrate that Mekha significantly reduces checkpoint downtime compared to traditional checkpoint mechanisms. Consequently, Mekha effectively decreases checkpoint overheads while offering efficiency and practicality, making it a viable solution for cloud computing environments.

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

Preprint

Nov 2019

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization -- the fraction of total time available for the system to do useful work -- that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of our model through experiments with Apache Flink, where we obtain improvements in system utilization for every case, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.

A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications

Conference Paper

Full-text available

Sep 2018

Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature and its failure rate. By analyzing the results of the temperature monitoring on a parallel application, we noticed that the failure rate is dynamically changing and the failure inter-arrival times do not follow an exponential distribution. Under such a scenario, the constant checkpointing method is not the optimal solution and thus a checkpointing method with an adaptive checkpoint interval, called an adaptive checkpointing method, is required to achieve high performance. However, to use the adaptive method, the processor temperature must be constantly monitored in order to decide the timing for checkpointing. In this paper, we propose an adaptive checkpointing method with less reliance on the temperature monitoring. Our proposed method uses the timings of already occurred failures, called the prior failures, to estimate the mean time to failure (MTTF) of the next failure, called the posterior failure. The timing of the posterior failure is predicted based on the characteristic of a truncated Weibull distribution. The simulation results show that the proposed method can reduce the total wasted time compared to the constant checkpointing method with a considerably small temperature monitoring period.

Determination of Checkpointing Intervals for Malleable Applications

Article

Nov 2017

Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for sequential applications while few efforts deal with parallel applications where the applications are executed on the same number of processors for the entire duration of execution. Some checkpointing systems support parallel applications where the number of processors on which the applications execute can be changed during the execution. We refer to these kinds of parallel applications as {\em malleable} applications. In this paper, we develop a performance model for malleable parallel applications that estimates the amount of useful work performed in unit time (UWT) by a malleable application in the presence of failures as a function of checkpointing interval. We use this performance model function with different intervals and select the interval that maximizes the UWT value. By conducting a large number of simulations with the traces obtained on real supercomputing systems, we show that the checkpointing intervals determined by our model can lead to high efficiency of applications in the presence of failures.

Accelerating Checkpoint/Restart Application Performance in Large-Scale Systems with Network Attached Memory

Thesis

Jan 2017

Juri Schmidt

Technology scaling and a continual increase in operating frequency have been the main driver of processor performance for several decades. A recent slowdown in this evolution is compensated by multi-core architectures, which challenge application developers and also increase the disparity between the processor and memory performance. The increasing core count and growing scale of computing systems furthermore turn attention to communication as a significant contributor on application run-times. Larger systems also comprise many more components which are subject to failures. In order to mitigate the effects of these failures, fault tolerance techniques such as Checkpoint/Restart are used. These techniques often rely on message-based communication and data transport stresses the local memory interface. In order to reduce communication overhead it is desirable to either decrease the number of messages, or otherwise to accelerate the execution of commonly used global operations. Finally, power consumption of large-scale systems has become a major concern and the efficiency of such systems must considerably improve to allow future Exascale systems to operate within a reasonable power budget. This work addresses the topics memory interface, communication, fault tolerance, and energy efficiency in large-scale systems. It presents Network Attached Memory (NAM), an FPGA-based hardware prototype that can be directly connected to a common high-performance interconnection network in large-scale systems. It provides access to the emerging memory technology Hybrid Memory Cube (HMC) as shared memory resource, tightly integrated with processing elements. The first part introduces the HMC memory architecture and serial interface, and thoroughly evaluates it in an FPGA using a custom-developed host controller, which has become an open-source initiative. The next part describes the hardware architecture of the NAM design and prototype, and theoretically evaluates the expected performance and bottlenecks. The NAM design was fully prototyped in an FPGA and the contribution also comprises a corresponding software stack. As a first use case NAM serves as Checkpoint/Restart target, aiming to reduce inter-node communication and to accelerate the creation of checkpoint parity information. Reducing checkpointing overhead improves application run-times and energy efficiency likewise. The final part of this work evaluates the NAM performance in a 16 node test system. It shows a good read/write scaling behavior for an increasing number of nodes. For Checkpoint/Restart with a real application, a 2.1X improvement over a standard approach is a remarkable result. It proves the successful concept of a dedicated hardware component to reduce communication and fault tolerance overhead for current and future large-scale systems.

ABSTRACTION CHECKPOINTING LEVELS: PROBLEMS AND SOLUTIONS

Article

Aug 2014

A common approach to guarantee an acceptable level of fault tolerance in scientific computing is the checkpointing. In this strategy: when a task fails, it is allowed to be restarted from the recently checked pointed state rather than from the beginning, which reduces the system loss and ensures the reliability. Several systems use the checkpointing to ensure the fault tolerance such as HPC, distributed discrete event simulation and Clouds. The literature proposes several classifications of checkpointing techniques using different metrics and criteria. In this paper we focus on the classification based on abstraction level. In this classification the checkpointing is categorized into two principal types: application level and system level. Each of these levels has its advantages and suffers from many problems. The difference between our present paper and the others surveys proposed in the literature is that: in this paper we will study each level in details. We will also study and analyze some works that propose solutions to solve the problems and exceed the limits of each abstraction level.

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability

Article

Full-text available

Dec 2012

Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.

Co-analysis of RAS log and job log on Blue Gene/P

Conference Paper

Full-text available

Jun 2011

With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.

Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût

Thesis

Feb 2011

Constantinos Makassikis

Les grappes de PCs constituent des architectures distribuées dont l'adoption se répand à cause de leur faible coût mais aussi de leur extensibilité en termes de noeuds. Notamment, l'augmentation du nombre des noeuds est à l'origine d'un nombre croissant de pannes par arrêt qui mettent en péril l'exécution d'applications distribuées. L'absence de solutions efficaces et portables confine leur utilisation à des applications non critiques ou sans contraintes de temps.MoLOToF est un modèle de tolérance aux pannes de niveau applicatif et fondée sur la réalisation de sauvegardes. Pour faciliter l'ajout de la tolérance aux pannes, il propose une structuration de l'application selon des squelettes tolérants aux pannes, ainsi que des collaborations entre le programmeur et le système de tolérance des pannes pour gagner en efficacité. L'application de MoLOToF à des familles d'algorithmes parallèles SPMD et Maître-Travailleur a mené aux frameworks FT-GReLoSSS et ToMaWork respectivement. Chaque framework fournit des squelettes tolérants aux pannes adaptés aux familles d'algorithmes visées et une mise en oeuvre originale. FT-GReLoSSS est implanté en C++ au-dessus de MPI alors que ToMaWork est implanté en Java au-dessus d'un système de mémoire partagée virtuelle fourni par la technologie JavaSpaces. L'évaluation des frameworks montre un surcoût en temps de développement raisonnable et des surcoûts en temps d'exécution négligeables en l'absence de tolérance aux pannes. Les expériences menées jusqu'à 256 noeuds sur une grappe de PCs bi-coeurs, démontrent une meilleure efficacité de la solution de tolérance aux pannes de FT-GReLoSSS par rapport à des solutions existantes de niveau système (LAM/MPI et DMTCP).

Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance

Article

Feb 2011

Constantinos Makassikis

PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).

Component failure model.

Context in source publication

Citations