Figure 1 - uploaded by Kei Davis
Content may be subject to copyright.
Component failure model. 

Component failure model. 

Source publication
Conference Paper
Full-text available
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component coun...

Context in source publication

Context 1
... of nominal MTBF, typically there is a phase in component lifetime where failure rates can be higher. Generally, the failure rate model for components follows a bathtub curve as shown in Figure 1. As can be shown, there are three different phases: component burn in, normal ag- ing, and late failure. Failures are frequent during the burn in and the late failure phases due to defects in the components, and component aging, respectively. For the sake discussion the MTBF of individual compo- nents is defined as: MTBF = 1/λ, where the parameter λ is called the failure rate of the component. 2 In a system where the failure of a single component results in the failure of the entire system the MTBF of the system is given by: Figure 2 shows the expected system MTBF obtained from this model for system for three different component reliability levels: MTBFs of 10 4 , 10 5 , and 10 6 hours. As can be seen, when increasing the number of components in the system the MTBF of the entire system dramatically decreases. For projected Pflop systems the MTBF is only a few hours even when all components are of very high reliability (MTBF 10 6 hours). In particular, for a system with 100,000 nodes the MTBF would be 10 hours. The BlueGene/L system with 65,536 nodes is expected to have an MTBF of less than 24 hours. In reality these numbers are optimistic for several reasons: MTBF numbers for indi- vidual components assume ideal environmental conditions, components will not generally have uniformly high MTBFs, and failure of some components can damage others, reduc- ing their MTBFs. 1000 ...

Citations

... Users can select desired operating systems, software stacks, and computing environments for their applications, making the VC an attractive platform for high-performance computing (HPC) applications [1]. However, as cloud providers scale out hardware resources to meet user demands, the number of failures increases [2]- [4], making fault tolerance mechanisms essential for largescale cloud computing environments. Checkpoint-restart [5] is a widely used technique for mitigating failures by periodically saving the state of processes to storage and restarting the computation from the saved state when a failure occurs. ...
Article
Full-text available
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of virtual machines (VMs) offer an attractive fault tolerance capability for cloud data centers. However, existing mechanisms have suffered from high checkpoint downtimes and overheads. This paper introduces Mekha, a novel hypervisor-level, in-memory coordinated checkpoint-restart mechanism for VCs that leverages precopy live migration. During a VC checkpoint event, Mekha creates a shadow VM for each VM and employs a novel memory-bound timed-multiplex data (MTD) transfer mechanism to replicate the state of each VM to its corresponding shadow VM. We also propose a global ending condition that enables the checkpoint coordinator to control the termination of the MTD algorithm for every VM in a VC, thereby reducing overall checkpoint latency. Furthermore, the checkpoint protocols of Mekha are designed based on barrier synchronizations and virtual time, ensuring the global consistency of checkpoints and utilizing existing data retransmission capabilities to handle message loss. We conducted several experiments to evaluate Mekha using a message passing interface (MPI) application from the NASA advanced supercomputing (NAS) parallel benchmark. The results demonstrate that Mekha significantly reduces checkpoint downtime compared to traditional checkpoint mechanisms. Consequently, Mekha effectively decreases checkpoint overheads while offering efficiency and practicality, making it a viable solution for cloud computing environments.
... The development of Exascale stream processing systems [11,18,34], to cope with an ever increasing big data demand, presents many challenges, one of which is fault-tolerance -as the degree of parallelism in a stream processing system increases, the mean time to failure (MTTF) of the stream processing system as a whole decreases [25,28]. For instance, MTTF of Exascale systems is anticipated to be in minutes [31,4] which arises the need for efficient fault tolerance approaches. ...
... For example, in Exascale systems multiple failures are expected everyday [37,2] and MTTF is anticipated to be in minutes [31,4]. For systems such as Flink where a failure of a single node results in restarting the whole application from the previous checkpoint, the failure rate of the system is n i=1 λ i , where λ i is the failure rate of node i [28]. Fig. 13 shows how the failure rate changes with the number of nodes in the system considering the failure rate of all nodes is 0.0022 per hour. ...
Preprint
State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization -- the fraction of total time available for the system to do useful work -- that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of our model through experiments with Apache Flink, where we obtain improvements in system utilization for every case, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.
... In the case of a cluster with multiple nodes, failures in any component of a node can cause a whole program to fail. Moreover, the failure rate grows linearly with the number of nodes when we assume that all of the nodes hold identical conditions [24], [25]. While considering the effect of the temperature, the MTBF of a system, M , is defined as: ...
Conference Paper
Full-text available
Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature and its failure rate. By analyzing the results of the temperature monitoring on a parallel application, we noticed that the failure rate is dynamically changing and the failure inter-arrival times do not follow an exponential distribution. Under such a scenario, the constant checkpointing method is not the optimal solution and thus a checkpointing method with an adaptive checkpoint interval, called an adaptive checkpointing method, is required to achieve high performance. However, to use the adaptive method, the processor temperature must be constantly monitored in order to decide the timing for checkpointing. In this paper, we propose an adaptive checkpointing method with less reliance on the temperature monitoring. Our proposed method uses the timings of already occurred failures, called the prior failures, to estimate the mean time to failure (MTTF) of the next failure, called the posterior failure. The timing of the posterior failure is predicted based on the characteristic of a truncated Weibull distribution. The simulation results show that the proposed method can reduce the total wasted time compared to the constant checkpointing method with a considerably small temperature monitoring period.
... With the development of high performance systems with massive number of processors[1] and long running scalable scientific applications that can use the processors for executions [2], the mean time between failures (MTBF) of the processors used for a single application execution has tremendously decreased [3]. Hence many checkpointing systems have been developed to enable fault tolerance for application executions [4], [5], [6], [7], [8]. ...
Article
Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for sequential applications while few efforts deal with parallel applications where the applications are executed on the same number of processors for the entire duration of execution. Some checkpointing systems support parallel applications where the number of processors on which the applications execute can be changed during the execution. We refer to these kinds of parallel applications as {\em malleable} applications. In this paper, we develop a performance model for malleable parallel applications that estimates the amount of useful work performed in unit time (UWT) by a malleable application in the presence of failures as a function of checkpointing interval. We use this performance model function with different intervals and select the interval that maximizes the UWT value. By conducting a large number of simulations with the traces obtained on real supercomputing systems, we show that the checkpointing intervals determined by our model can lead to high efficiency of applications in the presence of failures.
... for more information. For today's large-scale systems, the MTBF ranges from a few hours to several days, mainly depending on the system size [63]. Researchers predicted that an Exascale system might fail in the order of every 30 minutes [64]. ...
Thesis
Technology scaling and a continual increase in operating frequency have been the main driver of processor performance for several decades. A recent slowdown in this evolution is compensated by multi-core architectures, which challenge application developers and also increase the disparity between the processor and memory performance. The increasing core count and growing scale of computing systems furthermore turn attention to communication as a significant contributor on application run-times. Larger systems also comprise many more components which are subject to failures. In order to mitigate the effects of these failures, fault tolerance techniques such as Checkpoint/Restart are used. These techniques often rely on message-based communication and data transport stresses the local memory interface. In order to reduce communication overhead it is desirable to either decrease the number of messages, or otherwise to accelerate the execution of commonly used global operations. Finally, power consumption of large-scale systems has become a major concern and the efficiency of such systems must considerably improve to allow future Exascale systems to operate within a reasonable power budget. This work addresses the topics memory interface, communication, fault tolerance, and energy efficiency in large-scale systems. It presents Network Attached Memory (NAM), an FPGA-based hardware prototype that can be directly connected to a common high-performance interconnection network in large-scale systems. It provides access to the emerging memory technology Hybrid Memory Cube (HMC) as shared memory resource, tightly integrated with processing elements. The first part introduces the HMC memory architecture and serial interface, and thoroughly evaluates it in an FPGA using a custom-developed host controller, which has become an open-source initiative. The next part describes the hardware architecture of the NAM design and prototype, and theoretically evaluates the expected performance and bottlenecks. The NAM design was fully prototyped in an FPGA and the contribution also comprises a corresponding software stack. As a first use case NAM serves as Checkpoint/Restart target, aiming to reduce inter-node communication and to accelerate the creation of checkpoint parity information. Reducing checkpointing overhead improves application run-times and energy efficiency likewise. The final part of this work evaluates the NAM performance in a 16 node test system. It shows a good read/write scaling behavior for an increasing number of nodes. For Checkpoint/Restart with a real application, a 2.1X improvement over a standard approach is a remarkable result. It proves the successful concept of a dedicated hardware component to reduce communication and fault tolerance overhead for current and future large-scale systems.
... TICK considers the transparency as the most important criteria in scalable systems so the SystemLevel is the perfect Checkpointing to ensure transparency in grid calculations. TICK uses the buffered coscheduling (BCS) [25] to ensure the checkpointing consistency. In BCS the messages are buffered and scheduled before transmission to omit the late and transit messages. ...
Article
A common approach to guarantee an acceptable level of fault tolerance in scientific computing is the checkpointing. In this strategy: when a task fails, it is allowed to be restarted from the recently checked pointed state rather than from the beginning, which reduces the system loss and ensures the reliability. Several systems use the checkpointing to ensure the fault tolerance such as HPC, distributed discrete event simulation and Clouds. The literature proposes several classifications of checkpointing techniques using different metrics and criteria. In this paper we focus on the classification based on abstraction level. In this classification the checkpointing is categorized into two principal types: application level and system level. Each of these levels has its advantages and suffers from many problems. The difference between our present paper and the others surveys proposed in the literature is that: in this paper we will study each level in details. We will also study and analyze some works that propose solutions to solve the problems and exceed the limits of each abstraction level.
... With the development of high performance systems with massive number of processors [1] and long running scalable scientific applications that can use large number of processors for executions [2, 3], the mean time between failures (MTBF) of the processors used for a single application execution has tremendously decreased [4]. Current petascale systems are reported to have MTBFs of less than 10 hours [5, 6], and future exascale systems are anticipated to have MTBFs of less than an hour [6]. ...
Article
Full-text available
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.
... This common belief is based on the traditional exponential distribution model. In the exponential model, the probability of job interruption is the same as failure probability í µí±ƒ (í µí±¡ < í µí±‡ ) = 1− í µí±’ −í µí¼†í µí±‡ , where í µí¼† is related to job width and í µí±‡ is determined by the job length [18], [19]. However, case study shows that failure interarrival rate fits the Weibull distribution í µí±ƒ (í µí±¡ < í µí±‡) = 1 − í µí±’ (−í µí¼†í µí±‡ ) í µí»¼ , where the probability of job interruption is decided not only by í µí¼† and í µí±‡ but also by the time between the recent failure as a conditional probability [30]. ...
Conference Paper
Full-text available
With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
... L'utilisation d'une machine virtuelle pour atteindre la portabilité est sans doute aussi le point faible de l'approche étant de donné que le code est à ce moment interprété et par conséquent plus lent. [49,111] est une implantation de la spécification MPI dans laquelle le temps d'exécution de l'application est discretisé en intervalles fixes (quelques centaines de microsecondes) pendant lesquels sont plannifiées les communications. Un aspect important qui découle de cette approche est l'absence de messages sur le réseau à la fin de chaque intervalle. ...
Thesis
Les grappes de PCs constituent des architectures distribuées dont l'adoption se répand à cause de leur faible coût mais aussi de leur extensibilité en termes de noeuds. Notamment, l'augmentation du nombre des noeuds est à l'origine d'un nombre croissant de pannes par arrêt qui mettent en péril l'exécution d'applications distribuées. L'absence de solutions efficaces et portables confine leur utilisation à des applications non critiques ou sans contraintes de temps.MoLOToF est un modèle de tolérance aux pannes de niveau applicatif et fondée sur la réalisation de sauvegardes. Pour faciliter l'ajout de la tolérance aux pannes, il propose une structuration de l'application selon des squelettes tolérants aux pannes, ainsi que des collaborations entre le programmeur et le système de tolérance des pannes pour gagner en efficacité. L'application de MoLOToF à des familles d'algorithmes parallèles SPMD et Maître-Travailleur a mené aux frameworks FT-GReLoSSS et ToMaWork respectivement. Chaque framework fournit des squelettes tolérants aux pannes adaptés aux familles d'algorithmes visées et une mise en oeuvre originale. FT-GReLoSSS est implanté en C++ au-dessus de MPI alors que ToMaWork est implanté en Java au-dessus d'un système de mémoire partagée virtuelle fourni par la technologie JavaSpaces. L'évaluation des frameworks montre un surcoût en temps de développement raisonnable et des surcoûts en temps d'exécution négligeables en l'absence de tolérance aux pannes. Les expériences menées jusqu'à 256 noeuds sur une grappe de PCs bi-coeurs, démontrent une meilleure efficacité de la solution de tolérance aux pannes de FT-GReLoSSS par rapport à des solutions existantes de niveau système (LAM/MPI et DMTCP).
... L'utilisation d'une machine virtuelle pour atteindre la portabilité est sans doute aussi le point faible de l'approche étant de donné que le code est à ce moment interprété et par conséquent plus lent. [49,111] est une implantation de la spécification MPI dans laquelle le temps d'exécution de l'application est discretisé en intervalles fixes (quelques centaines de microsecondes) pendant lesquels sont plannifiées les communications. Un aspect important qui découle de cette approche est l'absence de messages sur le réseau à la fin de chaque intervalle. ...
Article
PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).