Article

Reliability MicroKernel: Providing Application-Aware Reliability in the OS

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes the reliability MicroKernel (RMK) framework, a loadable kernel module (or a device driver) for providing application-aware reliability, and dynamically configuring reliability mechanisms. Characteristics of application/system execution are exploited transparently through application-aware reliability techniques to achieve low-latency detection, and low-overhead checkpointing. The RMK prototype is implemented in both Linux, and Windows; and it supports detection of application/OS failures, and transparent application checkpointing. Experiment results show that the system hang detection and application hang detection, which exploit characteristics of application, and system behavior, can achieve high coverage (100% observed in our experiments) with a low false positive rate. Moreover, the performance overhead of RMK, and its detection/checkpointing mechanisms, is small: 0.6% for application hang detection, and 0.1% for transparent application checkpointing in the experiments.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The primary method for detecting faults that lead to hanging is based on tracing all execution information [22,23]. A system hang detector [22] uses the counter of an instruction executed during a context switch. ...
... The primary method for detecting faults that lead to hanging is based on tracing all execution information [22,23]. A system hang detector [22] uses the counter of an instruction executed during a context switch. This method, which determines the system to be hanging when the value of the counter exceeds a maximum value, can be applied only when a process or the OS stop operating. ...
... In practice, this agent does not adjust the software code by operating in the system. Therefore, unlike other methods that obtain data through changes to the code, our method did not generate any memory overhead due to an increase in the amount of software code [19,22]. However, the size of the log when a system call was executed was 128 bytes. ...
Article
Full-text available
Debugging in an embedded system where hardware and software are tightly coupled and have restricted resources is far from trivial. When hardware defects appear as if they were software defects, determining the real source becomes challenging. In this study, we propose an automated method of distinguishing whether a defect originates from the hardware or software at the stage of integration testing of hardware and software. Our method overcomes the limitations of the embedded environment, minimizes the effects on runtime, and identifies defects by obtaining and analyzing software execution data and hardware performance counters. We analyze the effects of the proposed method through an empirical study. The experimental results reveal that our method can effectively distinguish defects.
... In addition, as better detailed in section 2, in most of the cases it has not been shown whether they are able to detect software hangs. To the best of our knowledge, only one of the most relevant work in this field accounts for active application hangs [7], however no means is provided to detect passive hang conditions. This work proposes an indirect failure detection framework for pinpointing active and passive halt failures (see section 3). ...
... These approaches can be labeled as indirect failure detection techniques. As an example, the work in [7] exploits hardware performance counters and OS signals to monitor the system behavior and to signal possible anomalous conditions. For instance, as better detailed in section 3.2, an active hang is flagged if the instruction count executed by the process goes outside a given bound. ...
... Most of indirect detection approaches in the literature, as the ones cited in section 2, are based on a single monitor, e.g., they monitor all the system calls made by the process under observation. The work in [7] proposed a software architecture (namely, Reliability MicroKernel) for development of detection policies using hardware-or software-based monitors (e.g., CPU hardware counters for instruction counting, OS context switches, system calls invocations): remarkable events in the system are monitored in order to detect hangs in the applications or the OS. In [7] applications hangs are detected by estimating an upper bound to instructions executed by each thread in each code section (e.g., a critical section or a loop body). ...
... In addition, as better detailed in section 2, in most of the cases it has not been shown whether they are able to detect software hangs. To the best of our knowledge, only one of the most relevant work in this field accounts for active application hangs [7], however no means is provided to detect passive hang conditions. This work proposes an indirect failure detection framework for pinpointing active and passive halt failures (see section 3). ...
... These approaches can be labeled as indirect failure detection techniques. As an example, the work in [7] exploits hardware performance counters and OS signals to monitor the system behavior and to signal possible anomalous conditions. For instance, as better detailed in section 3.2, an active hang is flagged if the instruction count executed by the process goes outside a given bound. ...
... Most of indirect detection approaches in the literature, as the ones cited in section 2, are based on a single monitor, e.g., they monitor all the system calls made by the process under observation. The work in [7] proposed a software architecture (namely, Reliability MicroKernel) for development of detection policies using hardware-or software-based monitors (e.g., CPU hardware counters for instruction counting, OS context switches, system calls invocations): remarkable events in the system are monitored in order to detect hangs in the applications or the OS. In [7] applications hangs are detected by estimating an upper bound to instructions executed by each thread in each code section (e.g., a critical section or a loop body). ...
Article
Full-text available
On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.
... However, this approach was not aimed at hang detection in individual OS components. In [13], the support of special hardware (e.g., performance counters in the CPU) are exploited to detect hangs in the OS and in user programs. In [14], a machine learning approach is proposed to identify anomalies in virtual machines, by monitoring information collected by the hypervisor such as CPU usage and I/O operations. ...
... The Max algorithm is a heuristic for estimating the number of instructions executed by a process before blocking (e.g., the process invokes a system call or it is preempted by the scheduler) [13]. This algorithm estimates the timeout by picking the highest response time in the history of response times. ...
Conference Paper
Full-text available
The microkernel architecture has been investigated by both industries and the academia for the development of dependable Operating Systems (OSs). This work copes with a relevant issue for this architecture, namely unresponsive components because of deadlocks and infinite loops. In particular, a monitor sends heartbeat messages to a component that should reply within a timeout. The timeout choice is tricky, since it should be dynamically adapted to the load conditions of the system. Therefore, our approach is based on an adaptive heartbeat mechanism, in which the timeout is estimated from past response times. We implement and compare three estimation algorithms for the choice of the timeout in the context of the Minix 3 OS. From the analysis we derive useful guidelines for choosing the best algorithm with respect to system requirements.
... At runtime, if an event deviates above or below the threshold within a given specified temporal window, an alarm is raised. Wang et al. [71] developed a framework for runtime system hang detection and application checkpointing. One of the functions of their kernel module is to detect hangs in the operating system and its applications in order to provide a low-latency error detection and recovery mechanism. ...
Preprint
Safety-critical systems must always have predictable and reliable behavior, otherwise systems fail and lives are put at risk. Even with the most rigorous testing it is impossible to test systems using all possible inputs. Complex software systems will often fail when given novel sets of inputs; thus, safety-critical systems may behave in unintended, dangerous ways when subject to inputs combinations that were not seen in development. Safety critical systems are normally designed to be fault tolerant so they do not fail when given unexpected inputs. Anomaly detection has been proposed as a technique for improving the fault tolerance of safety-critical systems. Past work, however, has been largely limited to behavioral parameter thresholds that miss many kinds of system deviations. Here we propose a novel approach to anomaly detection in fault-tolerant safety critical systems using patterns of messages between threads. This approach is based on techniques originally developed for detecting security violations on systems with UNIX-like system call APIs; here we show that they can be adapted to the constraints of safety critical microkernel-based hard real-time systems. We present the design, implementation, and initial evaluation of tH (thread Homeostasis) implemented on a QNX-based self-driving car platform.
... Based on this analysis, we design a soft hang filter that reads the selected performance events and compares them with their thresholds to find soft hang bugs and minimize the numbers of false positives and negatives. Compared to resource utilizations [35,45,53], monitoring and accessing performance event counters is more lightweight and provides a wider variety of low-level hardware metrics. Compared to just monitoring the response time [28], using both response time and performance events allows to minimize the number of false positives, thus improving the detection performance. ...
Conference Paper
A critical quality factor for smartphone apps is responsiveness, which indicates how fast an app reacts to user actions. A soft hang occurs when the app's response time of handling a certain user action is longer than a user-perceivable delay. Soft hangs can be caused by normal User Interface (UI) rendering or some blocking operations that should not be conducted on the app's main thread (i.e., soft hang bugs). Existing solutions on soft hang bug detection focus mainly on offline app code examination to find previously known blocking operations and then move them off the main thread. Unfortunately, such offline solutions can fail to identify blocking operations that are previously unknown or hidden in libraries. In this paper, we present Hang Doctor, a runtime methodology that supplements the existing offline algorithms by detecting and diagnosing soft hangs caused by previously unknown blocking operations. Hang Doctor features a two-phase algorithm that first checks response time and performance event counters for detecting possible soft hang bugs with small overheads, and then performs stack trace analysis when diagnosis is necessary. A novel soft hang filter based on correlation analysis is designed to minimize false positives and negatives for high detection performance and low overhead. We have implemented a prototype of Hang Doctor and tested it with the latest releases of 114 real-world apps. Hang Doctor has identified 34 new soft hang bugs that are previously unknown to their developers, among which 62%, so far, have been confirmed by the developers, and 68% are missed by offline algorithms.
... Data are either generated by the target component itself (these approaches are also known as push-based monitoring) or they are obtained by querying the target component (pull-based monitoring approaches). Indirect monitoring aims to collect the data without relying on the monitored component [13], [14]. In this case, information concerning the component execution is collected at different probepoints of the system, such as network or operating system, by inserting internal or external probes. ...
Conference Paper
Full-text available
The analysis of monitoring data is extremely valuable for critical computer systems. It allows to gain insights into the failure behavior of a given system under real workload conditions, which is crucial to assure service continuity and downtime reduction. This paper proposes an experimental evaluation of different direct monitoring techniques, namely event logs, assertions, and source code instrumentation, that are widely used in the context of critical industrial systems. We inject 12,733 software faults in a real-world air traffic control (ATC) middleware system with the aim of analyzing the ability of mentioned techniques to produce information in case of failures. Experimental results indicate that each technique is able to cover a limited number of failure manifestations. Moreover, we observe that the quality of collected data to support failure diagnosis tasks strongly varies across the techniques considered in this study.
... The monitors are designed to observe the exchanged messages among the entities of the distributed system, which are then used to deduce a runtime state transition diagram executed by all the entities in the system. An anomaly-based strategy is also in [8], which exploits hardware performance counters and IPC (Inter Process Communication) signals to monitor the system behavior, and to detect possible anomalous conditions. Other solutions, such as [11], are based on statistical learning approaches. ...
Conference Paper
Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependencies, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering p a t t e r n s , t y p i c a l l y escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.
... However, this approach has a not negligible overhead (all system call parameters are recorded) and is not suited for failures which cannot be reliably reproduced as hang failures. The work appeared in Wang et al. (2007) is the closest to ours; it proposed a detection approach at the OS level using CPU hardware counters. On the one hand, applications hangs are detected by estimating an upper bound to the number of instructions executed in each code block of the application. ...
Article
Full-text available
Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.
Chapter
System logs have been extensively used over the past decades to gain insight about dependability properties of computer systems. Log files contain textual information about regular and anomalous events detected by a system under real workload conditions. By mining the information contained in the logs it is possible to characterize the real failure behavior of the system. By real, we mean considering only the failures that manifest naturally, during system operation. This chapter provides an overview of the main tools and techniques for log-based failure analysis, which have been proposed in the last four decades. By surveying the relevant work in the area, the chapter highlights the main objectives, research trends and applications, and it also discusses the main limitations and recent proposals to improve log-based failure analysis.KeywordsEvent logsLog processingFailure analysisDependability evaluation
Article
Full-text available
Monitoring is a consolidated practice to characterize the dependability behavior of a software system. A variety of techniques, such as event logging and operating system probes, are currently used to generate monitoring data for troubleshooting and failure analysis. In spite of the importance of monitoring, whose role can be essential in critical software systems, there is a lack of studies addressing the assessment and the comparison of the techniques aiming to monitor the occurrence of failures during operations. This paper proposes a method to characterize the monitoring techniques implemented in a software system. The method is based on a fault injection approach and allows measuring 1) precision and recall of a monitoring technique and 2) the dissimilarity of the data it generates upon failures. The method has been used in two critical software systems implementing event logging, assertion checking, and source code instrumentation techniques. We analyzed a total of 3 844 failures. With respect to our data, we observed that the effectiveness of a technique is strongly affected by the system and type of failure, and that the combination of different techniques is potentially beneficial to increase the overall failure reporting ability. More important, our analysis revealed a number of practical implications to be taken into account when developing a monitoring technique.
Chapter
This work presents an overview of monitoring approaches to support the diagnosis of software faults and proposes a framework to reveal and diagnose the activation of faults in complex and Off-The Shelf (OTS) based software systems. The activation of a fault is detected by means of anomaly detection on data collected by OS-level monitors. Instead, the fault diagnosis is accomplished by means of a machine learning approach. The evaluation of the proposed framework is carried out using an industrial prototype from the Air Traffic Control domain by means of software fault injection. Results show that the monitoring and diagnosis framework is able to reveal and diagnose faults with high recall and precision with low latency and low overhead.
Article
Revealing anomalies at the operating system (OS) level to support online diagnosis activities of complex software systems is a promising approach when traditional detection mechanisms (e.g., based on event logs, probes and heartbeats) are inadequate or cannot be applied. In this paper we propose aconfigurable detection framework to reveal anomalies in the OS behavior, related to system misbehaviors. The detector is based on online statistical analysestechniques, and it is designed for systems that operate under variable andnon-stationary conditions. The framework is evaluated to detect the activation of software faults in a complex distributed system for Air Traffic Management (ATM). Results of experiments with two different OSs, namely Linux Red Hat EL5 and Windows Server 2008, show that the detector is effective for mission-critical systems. The framework can be configured to select the monitored indicators so as to tune the level of intrusivity. A sensitivity analysis of the detector parameters iscarried out to show their impact on the performance and to give to practitioners guidelines for its field tuning.
Article
Checkpointing and rollback techniques enhance reliability and availability of virtual machines and their hosted IT services. This paper proposes VM-μCheckpoint, a light-weight pure-software mechanism for high-frequency checkpointing and rapid recovery for VMs. Compared with existing techniques of VM checkpointing, VM-μCheckpoint tries to minimize checkpoint overhead and speed up recovery by means of copy-on-write, dirty-page prediction and in-place recovery, as well as saving incremental checkpoints in volatile memory. Moreover, VM-μCheckpoint deals with the issue that latency in error detection potentially results in corrupted checkpoints, particularly when checkpointing frequency is high. We also constructed Markov models to study the availability improvements provided by VM-μCheckpoint (from 99 to 99.98 percent on reasonably reliable hypervisors). We designed and implemented VM-μCheckpoint in the Xen VMM. The evaluation results demonstrate that VM-μCheckpoint incurs an average of 6.3 percent overhead (in terms of program execution time) for 50 ms checkpoint intervals when executing the SPEC CINT 2006 benchmark. Error injection experiments demonstrate that VM-μCheckpoint, combined with error detection techniques in RMK, provides high coverage of recovery.
Conference Paper
Almost every computer user has encountered an un-responsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a self-healing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its "light-heavy" detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis.
Article
Event logs have been widely used over the last three decades to analyze the failure behavior of a variety of systems. Nevertheless, the implementation of the logging mechanism lacks a systematic approach and collected logs are often inaccurate at reporting software failures: This is a threat to the validity of log-based failure analysis. This paper analyzes the limitations of current logging mechanisms and proposes a rule-based approach to make logs effective to analyze software failures. The approach leverages artifacts produced at system design time and puts forth a set of rules to formalize the placement of the logging instructions within the source code. The validity of the approach, with respect to traditional logging mechanisms, is shown by means of around 12,500 software fault injection experiments into real-world systems.
Conference Paper
To face the challenges resulting from increasing complexity of portable consumer electronics, a dependability framework to improve the availability of application software in runtime is needed. In this paper, a system layer division based on the user event processing is given and faults are classified based on it. A hardware software co-designed application aware framework architecture proposed to handle these types of faults. And the software components on different layer are provided with separate service to improve the availability. A validation based on software-implemented fault injection is also given to evaluate the performance. And the experiment results show that the framework can achieve high coverage and fast error recovery.
Conference Paper
Full-text available
We propose a fault injection framework to assess hang detection facilities within the Linux Operating System (OS). The novelty of the framework consists in the adoption of a more representative faultload than existing ones, and in the effectiveness in terms of number of hang failures produced; representativeness is supported by a field data study on the Linux OS. Using the proposed fault injection framework, along with realistic workloads, we find that the Linux OS is unable to detect hangs in several cases. We experience a relative coverage of 75%. To improve detection facilities, we propose a simple yet effective hang detector, which periodically tests OS liveness, as perceived by applications, by means of I/O system calls; it is shown that this approach can improve relative coverage up to 94%. The hang detector can be deployed on any Linux system, with an acceptable overhead.
Conference Paper
This paper presents an approach to formally verify the detection capability of a system hang detector. To achieve this goal, an abstract formal model of a typical Linux system is created to thoroughly exercise all execution scenarios that may lead to hangs. The goal is to expose cases (i.e., hang scenarios) that escape detection. Our system model abstracts the basic hardware (e.g., timer, hardware counter) and software (e.g., processes/threads) components present in the Linux system. The model enables: (i) capturing behavior of these components so as to depict execution scenarios that lead to hangs, and (ii) evaluating hang detection coverage. Explicit-state model checking is applied to reason about system behavior and uncover hang scenarios that escape detection. The results indicate that the proposed framework allows identification of corner cases of hang scenarios that escape detection and provides valuable insight to developers for enhancing detection mechanisms.
Article
Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery. The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM. The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value. The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-μCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-μCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-μCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint. The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.) Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-μCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-μCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals). We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.
Article
Full-text available
The CHORUS technology has been designed for building new generations of open, istributed, scalable operating systems. CHORUS has the following main characteristics: - a communication-based architecture, relying on a minimal Nucleus which integrates dis- tributed processing and communication at the lowest level, and which implements gen- eric services used by a set of subsystem servers to extend standard operating system interfaces. A UNIX subsystem has been developed; other subsystems such as object- oriented systems are planned; - a real-time Nucleus providing real-time services which are accessible to system program-mers; - a modular architecture providing scalability, and allowing, in particular, dynamic configuration of the system and its applications over a wide range of hardware and net- work configurations, including parallel and multiprocessor systems. CHORUS − V3 is the current version of the CHORUS Distributed Operating System, developed by Chorus systèmes. Earlier versions were studied and implemented within the Chorus research project at INRIA between 1979 and 1986. This paper presents the CHORUS architecture and the facilities provided by the CHORUS − V3 Nucleus. It also describes the UNIX subsystem built with the CHORUS technology that provides: - binary compatibility with UNIX ; - extended UNIX services, supporting distributed applications by providing network IPC, distributed virtual memory, light-weight processes, and real-time facilities.
Conference Paper
Full-text available
Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic activity networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.
Conference Paper
Full-text available
This paper proposes an application-transparent, low-overhead checkpointing strategy for maintaining consistency of control structures in a commercial main memory database (MMDB) system, based on the ARMOR (adaptive reconfigurable mobile object of reliability) infrastructure. Performance measurements and availability estimates show that the proposed checkpointing scheme significantly enhances database availability (an extra nine in improvement compared with major-recovery-based solutions) while incurring only a small performance overhead (less than 2% in a typical workload of real applications).
Conference Paper
Full-text available
This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the reliability and security engine (RSE), which is implemented as an integral part of a modern microprocessor. The RSE interacts with the processor through an input/output interface. The CHECK instruction, a special extension of the instruction set architecture of the processor, is the interface of the application with the RSE. The detection mechanisms described here in detail are: (I) the memory layout randomization (MLR) module, which randomizes the memory layout of a process in order to foil attackers who assume a fixed system layout, (2) the data dependency tracking (DDT) module, which tracks the dependencies among threads of a process and maintains checkpoints of shared memory pages in order to rollback the threads when an offending (potentially malicious) thread is terminated, and (3) the instruction checker module (ICM), which checks an instruction for its validity or the control-flow of the program just as the instruction enters the pipeline for execution. Performance simulations for the studied modules indicate low overhead of the proposed solutions.
Conference Paper
Full-text available
This paper describes an experimental study of Linux kernel behavior in the presence of errors that impact the instruction stream of the kernel code. Extensive error injection experiments including over 35,000 errors are conducted targeting the most frequently used functions in the selected kernel subsystems. Three types of faults/errors injection campaigns are conducted: (1) random non-branch instruction, (2) random conditional branch, and (3) valid but incorrect branch. The analysis of the obtained data shows: (i) 95% of the crashes are due to four major causes, namely, unable to handle kernel NULL pointer, unable to handle kernel paging request, invalid opcode, and general protection fault, (ii) less than 10% of the crashes are associated with fault propagation and nearly 40% of crash latencies are within 10 cycles, (iii) errors in the kernel can result in crashes that require reformatting the file system to restore system operation; the process of bringing up the system can take nearly an hour.
Conference Paper
Full-text available
Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models; (2) it is difficult to port these tools to new systems. NFTAPE, a tool for composing automated fault injection experiments from available lightweight fault injectors, triggers, monitors, and other components, helps to solve these problems. We have conducted experiments using NFTAPE with several types of lightweight fault injectors, including driver-based, debugger-based, target-specific, simulation-based, hardware-based, and performance-fault injections. Two example experiments are described in this paper. The first uses a hardware fault injector with a Myrinet LAN; the other uses a Software Implemented Fault Injection (SWIFI) fault injector to target a space-imaging application
Conference Paper
Full-text available
This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RENEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated protocol, with comparable performance. The message logging protocol degraded the performance. Even though the message logging protocol was slower due to log replay, all three protocols required a similar amount of time to restore the application to the same state as before failure occurred and recovery was initiated.
Article
Full-text available
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, which is often prohibitively expensive for practical use due to its high performance overhead and cost. The adaptive reconfigurable mobile objects of reliability (Armor) middleware architecture offers a scalable low-overhead way to provide high-dependability services to applications. It uses coordinated multithreaded processes to manage redundant resources across interconnected nodes, detect errors in user applications and infrastructural components, and provide failure recovery. The authors describe the experiences and lessons learned in deploying Armor in several diverse fields.
Article
Full-text available
Embedded systems generally interact in some way with the outside world. This may involve measuring sensors and controlling actuators, communicating with other systems, or interacting with users. These functions impose real-time constraints on system design. Verification of these specifications requires computing an upper bound on the worst-case execution time (WCET) of a hardware/software system. Furthermore, it is critical to derive a tight upper bound on WCET in order to make efficient use of system resources. The problem of bounding WCET is particularly difficult on modern processors. These processors use cache-based memory systems that vary memory access time based on the dynamic memory access pattern of the program. This must be accurately modeled in order to tightly bound WCET. Several analysis methods have been proposed to bound WCET on processors with instruction caches. Existing approaches either search all possible program paths, an intractable problem, or they use highly pessimistic assumptions to limit the search space. In this paper we present a more effective method for modeling instruction cache activity and computing a tight bound on WCET. The method uses an integer linear programming formulation and does not require explicit enumeration of program paths. The method is implemented in the program cinderella and we present some experimental results of this implementation.
Article
Full-text available
We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decouples processes in pods from dependencies to the host operating system and other processes on the system. By integrating Zap virtualization with a checkpoint-restart mechanism, Zap can migrate a pod of processes as a unit among machines running independent operating systems without leaving behind any residual state after migration. We have implemented a Zap prototype in Linux that supports transparent migration of unmodified applications without any kernel modifications. We demonstrate that our Linux Zap prototype can provide general-purpose process migration functionality with low overhead. Our experimental results for migrating pods used for running a standard user's X windows desktop computing environment and for running an Apache web server show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies.
Article
Full-text available
Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work. 1 Introduction Consider a programmer who has developed an application which will take a long time to execute, say five days. Two days into the computation, the processor on which the application is...
Article
We have created Zap, a novel system for transparent migration of legacy and networked applications. Zap provides a thin virtualization layer on top of the operating system that introduces pods, which are groups of processes that are provided a consistent, virtualized view of the system. This decouples processes in pods from dependencies to the host operating system and other processes on the system. By integrating Zap virtualization with a checkpoint-restart mechanism, Zap can migrate a pod of processes as a unit among machines running independent operating systems without leaving behind any residual state after migration. We have implemented a Zap prototype in Linux that supports transparent migration of unmodified applications without any kernel modifications. We demonstrate that our Linux Zap prototype can provide general-purpose process migration functionality with low overhead. Our experimental results for migrating pods used for running a standard user's X windows desktop computing environment and for running an Apache web server show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies.
Article
ABSTRACT Unmanned space missions rely on some degree of autonomous operation of on-board devices, including on-board computers. Spacecraft designers are challenged to balance autonomy and high performance with the strong reliability needed in a harsh space environment. The primary problem is that reliable radiation-hardened components are not as technologically advanced as the more common, but more vulnerable, commercial components. The goal of the X2000 Future Deliveries project at JPL-NASA is to explore common,o-the- shelf (COTS) technologies (such as processors and network protocols) that can be used in a space environment in a way that retains high reliability. We investigate the fault tolerance of one of these components, the IEEE 1394 (FireWire) network bus, in the context of an embedded X2000 testbed. A software framework, NFTAPE, supports automated fault injection into the nodes. The results of this research indicate that the 1394 bus implements a robust network protocol in the presence of faults in the chipset. It is shown that faults do not propagate to other nodes and only a small number of faults actually produce erroneous behavior. In addition, recovery never involves a complete system restart and the vast majority errors on a node do not aect,the communication between other nonfaulty nodes. These observations provide evidence that the IEEE 1394 bus could be a suitable implementation for computing in a space environment. More detailed studies are necessary to rigorously confirm these findings. iii ACKNOWLEDGMENTS Fristly, I would like to thank Professor Ravishankar Iyer, my adviser, and Professor Zbigniew
Article
This paper addresses the problem of using COTS microkernels in dependable systems. Because they are not developed with this aim, their behavior in the presence of faults is a main concern to system designers. We propose a novel approach to contain the effect of both external and internal faults that may affect their behavior. As microkernels can be decomposed into simple components, modeling of their expected behavior in the absence of faults is most often possible, which allows for the easy definition of dynamic predicates. For an efficient implementation of fault containment wrappers checking for these predicates, we introduce the notion of MetaKernel to reify the information required for implementing the predicates and to reflect appropriate actions. This approach is exemplified on a case study using an open version of the Chorus microkernel. MAFALDA, a software-implemented fault injection tool, is used to illustrate the benefits procured by the proposed wrappers
Conference Paper
This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.
Conference Paper
This paper describes an experimental study of Linux kernel behavior in the presence of errors that impact the instruction stream of the kernel code. Extensive error injection experiments including over 35,000 errors are conducted targeting the most fre- quently used functions in the selected kernel subsystems. Three types of faults/errors injection campaigns are conducted: (1) ran- dom non-branch instruction, (2) random conditional branch, and (3) valid but incorrect branch. The analysis of the obtained data shows: (i) 95% of the crashes are due to four major causes, namely, unable to handle kernel NULL pointer, unable to handle kernel paging request, invalid opcode, and general protection fault, (ii) less than 10% of the crashes are associated with fault propagation and nearly 40% of crash latencies are within 10 cycles, (iii) errors in the kernel can result in crashes that require reformatting the file system to restore system operation; the process of bringing up the system can take nearly an hour. Subsequently, over 35,000 faults/errors are injected into the kernel functions within four subsystems: architecture- dependent code (arch), virtual file system interface (fs), cen- tral section of the kernel (kernel), and memory management (mm). Three types of fault/error injection campaigns are con- ducted: random non-branch, random conditional branch, and valid but incorrect conditional branch. The data is analyzed to quantify the response of the OS as a whole based on the sub- system and to determine which functions are responsible for error sensitivity. The analysis provides a detailed insight into the OS behavior under faults/errors. The major findings in- clude: • Most crashes (95%) are due to four major causes: unable to handle kernel NULL pointer, unable to handle kernel paging request, invalid opcode, and general protection fault. • Nine errors in the kernel result in crashes (most severe crash category), which require reformatting the file system. The process of bringing up the system can take nearly an hour. • Less than 10% of the crashes are associated with fault propagation, and nearly 40% of crash latencies are within 10 cycles. The closer analysis of the propagation patterns indicates that it is feasible to identify strategic locations for embedding additional assertions in the source code of a given subsystem to detect errors and, hence, to prevent er- ror propagation.
Conference Paper
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed.We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Conference Paper
Software robustness has significant impact on system availability. Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy" code region. However, most previous work on rollback and replay support was designed to survive hardware or operating system failures, and is therefore too heavyweight for the fine-grained rollback and replay needed for software debugging. This paper presents Flashback, a lightweight OS extension that provides fine-grained rollback and replay to help debug software. Flashback uses shadow processes to efficiently roll back in-memory state of a process, and logs a process' interactions with the system to support deterministic replay. Both shadow processes and logging of system calls are implemented in a lightweight fashion specifically designed for the purpose of software debugging. We have implemented a prototype of Flashback in the Linux operating system. Our experimental results with micro-benchmarks and real applications show that Flashback adds little overhead and can quickly roll back a debugged program to a previous execution point and deterministically replay from that point.
Article
The Sequoia computer is a tightly coupled multiprocessor that avoids most of the fault-tolerance disadvantages of tight coupling by using a fault-tolerant hardware-design approach. An overview is give of how the hardware architecture and operating system (OS) work together to provide a high degree of fault tolerance with good system performance. A description of hardware is followed by a discussion of the multiprocessor synchronization problem. Kernel support for fault recovery and the recovery process itself are examined. It is shown the kernel, through a combination of locking, shadowed memory, and controlled flushing of non-write-through cache, maintains a consistent main memory state recoverable from any single-point failure. The user shared memory is also discussed.< >
Conference Paper
Embedded systems generally interact with the outside world. Thus, some real-time constraints may be imposed on the system design. Verification of these constraints requires computing a tight upper bound on the worst case execution time (WCET) of a hardware/software system. The problem of bounding WCET is particularly difficult on modern processors, which use cache-based memory systems that vary memory access time significantly. This must be accurately modeled in order to tightly bound WCET. Existing approaches either search all possible program paths, an intractable problem, or they use pessimistic assumptions to limit the search space. In this paper we present afar more effective and accurate method for modeling instruction cache activity and computing a tight bound on WCET. It is implemented in the program cinderella. We present some preliminary results of using this tool on sample embedded programs
Conference Paper
The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB
Conference Paper
An alternative approach to building an entire operating system (OS) separating those parts of the OS that control the basic hardware resources (the kernel) from those that determine the unique characteristics of an OS environment, is examined, taking the Mach kernel as an example. Mach features which support OS emulation are discussed. In-kernel and out-of-kernel emulation are described. Two instances of the latter approach, the multithreaded Unix server and the multiserver Unix, are considered. Related work and Mach availability are addressed
Article
The Sequoia computer is a tightly coupled multiprocessor that avoids most of the fault-tolerance disadvantages of tight coupling by using a fault-tolerant hardware-design approach. An overview is give of how the hardware architecture and operating system (OS) work together to provide a high degree of fault tolerance with good system performance. A description of hardware is followed by a discussion of the multiprocessor synchronization problem. Kernel support for fault recovery and the recovery process itself are examined. It is shown the kernel, through a combination of locking, shadowed memory, and controlled flushing of non-write-through cache, maintains a consistent main memory state recoverable from any single-point failure. The user shared memory is also discussed.
Article
Operating systems form a foundation for robust application software, making it important to understand how effective they are at handling exceptional conditions. The Ballista testing system was used to characterize the handling of exceptional input parameter values for up to 233 POSIX functions and system calls on each of 15 widely used operating system (OS) implementations. This identified ways to crash systems with a single call, ways to cause task hangs within OS code, ways to cause abnormal task termination within OS and library code, failures to implement defined POSIX functionality, and failures to report unsuccessful operations. Overall, only 55 percent to 76 percent of the exceptional tests performed generated error codes, depending on the operating system being tested. Approximately 6 percent to 19 percent of tests failed to generate any indication of error despite exceptional inputs. Approximately 1 percent to 3 percent of tests revealed failures to implement defined POSIX functionality for unusual, but specified, situations. Between 18 percent and 33 percent of exceptional tests caused the abnormal termination of an OS system call or library function, and five systems were completely crashed by individual system calls with exceptional parameter values. The most prevalent sources of these robustness failures were illegal pointer values, numeric overflows, and end-of-file overruns
Article
The commercial offer concerning microkernel technology constitutes an attractive alternative for developing operating systems to suit a wide range of application domains. However, the integration of commercial off-the-shelf (COTS) microkernels into critical embedded computer systems is a problem for system developers, in particular due to the lack of objective data concerning their behavior in the presence of faults. This paper addresses this issue by describing a prototype environment, called MAFALDA (Microkernel Assessment by Fault injection AnaLysis and Design Aid), that is aimed at providing objective failure data on a candidate microkernel and also improving its error detection capabilities. The paper first presents the overall architecture of MAFALDA. Then, a case study carried out on an instance of the Chorus microkemel is used to illustrate the benefits that can be obtained with MAFALDA both from the dependability assessment and design-aid viewpoints. Implementation issues are also addressed that account for the specific API of the target microkemel. Some overall insights and lessons learned, gained during the various studies conducted on both Chorus and another target microkemel (LynxOS), are then depicted and discussed. Finally, we conclude the paper by summarizing the main features of the work presented and by identifying future research
Article
In this paper we present compiler-assisted checkpointing, a new technique which uses static program analysis to optimize the performance of checkpointing. We achieve this performance gain using libckpt, a checkpointing library which implements memory exclusion in the context of user-directed checkpointing. The correctness of user-directed checkpointing is dependent on program analysis and insertion of memory exclusion calls by the programmer. With compiler-assisted checkpointing, this analysis is automated by a compiler or preprocessor. The resulting memory exclusion calls will optimize the performance of checkpointing, and are guaranteed to be correct. We provide a full description of our program analysis techniques and present detailed examples of analyzing three fortran programs. The results of these analyses have been implemented in libckpt, and we present the performance improvements that they yield.
Article
Introduction looning software development costs translate into increased Operating systems have become one of the most hotly concost for the user and delays in the introduction of new features tested battlegrounds for "open system standards". Various and new applications. national, international and industry groups are attempting to define, implement and ultimately convince users to buy new "open" computing environments. Most of these efforts have 2. An Alternative Approach to OS 1 centered around versions of the UNIX [9] operating system, Organization but there is no consensus among industrial groups as to which An alternative approach to building an entire operating sys- "version" of UNIX is ultimately the correct basis for a open tem is to separate those parts of the operating system which system standard. Already two major industrial organizations, control the basic hardware resources -- often called the the Open Software Foundation and U
Article
The CHORUS technology has been designed for building new generations of open, distributed, scalable operating systems. CHORUS has the following main characteristics: # a communication-based architecture, relying on a minimal Nucleus which integrates distributed processing and communication at the lowest level, and which implements generic services used by a set of subsystem servers to extend standard operating system interfaces. A UNIX subsystem has been developed; other subsystems such as objectoriented systems are planned; # a real-time Nucleus providing real-time services which are accessible to system programmers; # a modular architecture providing scalability, and allowing, in particular, dynamic configuration of the system and its applications over a wide range of hardware and network configurations, including parallel and multiprocessor systems. CHORUS-V3 is the current version of the CHORUS Distributed Operating System, developed by Chorus systemes. Earlier versions were st...
Article
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed. We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Article
Checkpointing is a technique used for many purposes, including, but not limited to assistance in fault-tolerant applications, rollback transactions and migration. Many tools have been proposed in the past to help solve these problems. But in the field of migration there is still a lack, because either: (a) the tool is limited to some kind of parallel programming library (PVM, MPI), (b) the size of the checkpoint image is too big to be worth migrating, (c) the checkpoint is limited to some well-behaved applications or (d) the checkpoint image must be saved to file or sent to centralized servers instead of going directly to the target machine's memory. We developed a new tool called Epckpt that can solve this lack in process migration. Epckpt can: (a) checkpoint almost all kinds of applications independent of their behavior, (b) limit the size of the applications image to its minimum, (c) checkpoint fork-parallel applications, (d) checkpoint an application that was not meant for being ch...
Article
With the advent of the Hyper Text Transfer Protocol (HTTP) it was just a matter of time before the commercial use would be evident. Performance testing of different hardware platforms as well as different implementations of HTTP has made it necessary to create a new benchmark that will allow a customer easily to understand the performance characterizes of different vendors. The WebSTONE, a web serving benchmark has been developed in an attempt to better understand the performance characterics of both hardware and software. The following paper describes the benchmark in technical detail and issues involved in developing this benchmark. This benchmark was developed because there is currently no other way of testing the application. This benchmark is intended for free distribution both this white paper and code. It is the intent for this benchmark to grow and better help test system performance for future systems and HTTP implementations. Contents WebSTONE: The First Generation in HTTP ...
Checking the NMI watchdogs
  • D Bovet
  • M Cesati
D. Bovet and M. Cesati, " Checking the NMI watchdogs, " in Under-standing the Linux Kernel, 3rd ed. : O'Reilly & Associates, Inc., 2005, ch. 6.4.2.
Open, secure, scalable, reliable UNIX for POWER5 servers
  • Ibm
  • Corporation
IBM Corporation, " Open, secure, scalable, reliable UNIX for POWER5 servers, " AIX 5L for POWER Version 5.3, 2004.
Address space of a rocess in Linux. 1) Periodically initiate a checkpoint by setting a chkpt_started flag to 1. 2) Check whether there is a pending I/O operation, signal, or IPC message in the target process. If TRUE, go to sleep; otherwise
  • Fig
Fig. 9. Address space of a rocess in Linux. 1) Periodically initiate a checkpoint by setting a chkpt_started flag to 1. 2) Check whether there is a pending I/O operation, signal, or IPC message in the target process. If TRUE, go to sleep; otherwise, go to step 6.
Clock and timer circuits
  • D Bovet
  • M Cesati
Cesati Understanding the Linux Kernel, 2005 :O&amp;#39;Reilly &amp;amp
  • D Bovet
MetaKernels and fault containment wrappers
  • F Salles
  • M Rodriguez
  • J.-C Fabre
  • J Arlat