Article

Analyze-NOW - An environment for collection & analysis of failures in a network of workstations

January 1997
IEEE Transactions on Reliability 45(4):561 - 570

January 1997
45(4):561 - 570

DOI:10.1109/24.556579

Source
IEEE Xplore

Authors:

Ravishankar K. Iyer

University of Illinois, Urbana-Champaign

This paper describes Analyze-NOW, an environment for the collection and analysis of failures/errors in a network of workstations. Descriptions cover the data collection methodology and the tool implemented to facilitate this process. Software tools used for analysis are described, with emphasis on the details of the implementation of the Analyzer, the primary analysis tool. Application of the tools is demonstrated by using them to collect and analyze failure data (for 32-week period) from a network of 69 SunOS-based workstations. Classification based on the source and effect of faults is used to identify problem areas. Different types of failures encountered on the machines and network are highlighted to develop a proper understanding of failures in a network environment. The results from the analysis tool should be used to pinpoint the problem areas in the network. The results obtained from using Analyze-NOW on failure data from the monitored network reveal some interesting behavior of the network. Nearly 70% of the failures were network-related, whereas disk errors were few. Network-related failures were 75% of all hard-failures (failures that make a workstation unusable). Half of the network-related failures were due to servers not responding to clients, and half were performance-related and others. Problem areas in the network were found using this tool. The authors' approach was compared to the method of using the network architecture to locate problem areas. This comparison showed that locating problem areas using network architecture over-estimates the number of problem areas

Measurement-Based Availability of Unix Systems in a Distributed Environment.

Conference Paper

Full-text available

Jan 2001

This paper presents a measurement-based availability study of networked Unix systems,based on data collected during 11 months from 298 workstations and serversinterconnected through a local area computing network.The data corresponds to event logs recorded by the Unix operating system via the Syslogd daemon.Our study focuses on the identification of machine reboots and the evaluation of statistical measures characterizing: a) the distribution of reboots (per machine,time), b) the distribution of uptimes and downtimes associated to these reboots,c) the availability of machines including workstations and servers, and d) error dependencies between clients and servers.

Measurement-based availability analysis of Unix systems in a distributed environment

Conference Paper

Full-text available

Dec 2001

This paper presents a measurement-based availability study of networked Unix systems, based on data collected during 11 months from 298 workstations and servers interconnected through a local area computing network. The data corresponds to event logs recorded by the Unix operating system via the Syslogd daemon. Our study focuses on the identification of machine reboots and the evaluation of statistical measures characterizing: (a) the distribution of reboots (per machine, time), (b) the distribution of uptimes and downtimes associated to these reboots, (c) the availability of machines including workstations and servers, and (d) error dependencies between clients and servers.

Defects Prediction and Prevention Approaches for Quality Software Development

Article

Full-text available

Jan 2018

Availability Assessment of SunOS/Solaris Unix Systems based on Syslogd and wtmpx log files: a case study

Conference Paper

Full-text available

Jan 2006

This paper presents a measurement-based availability assessment study using field data collected during a 4-year period from 373 SunOS/Solaris Unix workstations and servers interconnected through a local area network. We focus on the estimation of machine uptimes, downtimes and availability based on the identification of failures that caused total service loss. Data corresponds to syslogd event logs that contain a large amount of information about the normal activity of the studied systems as well as their behavior in the presence of failures. It is widely recognized that the information contained in such event logs might be incomplete or imperfect. The solution investigated in this paper to address this problem is based on the use of auxiliary sources of data obtained from wtmpx files maintained by the SunOS/Solaris Unix operating system. The results obtained suggest that the combined use of wtmpx and syslogd log files provides more complete information on the state of the target systems that is useful to provide availability estimations that better reflect reality.

A methodology for detection and estimation of software aging

Conference Paper

Dec 1998

The phenomenon of software aging refers to the accumulation of errors during the execution of the software which eventually results in it's crash/hang failure. A gradual performance degradation may also accompany software aging. Pro-active fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be used to counteract aging if it exists. We propose a methodology for detection and estimation of aging in the UNIX operating system. First, we present the design and implementation of an SNMP based, distributed monitoring tool used to collect operating system resource usage and system activity data at regular intervals, from networked UNIX workstations. Statistical trend detection techniques are applied to this data to detect/validate the existence of aging. For quantifying the effect of aging in operating system resources, we propose a metric: “estimated time to exhaustion”, which is calculated using well known slope estimation techniques. Although the distributed data collection tool is specific to UNIX, the statistical techniques can be used for detection and estimation of aging in other software as well

Availability assessment of SunOS/Solaris Unix Systems based on Syslogd and wtmpx logfiles : a case study

Article

Full-text available

May 2007

A Probabilistic and Deterministic based Defect Prediction through Defect Association Learning in Software Development

Article

Full-text available

Mar 2020

Software development is a multitasking activity by an individual or group of team. Every one activity engages diverse tasks and complication. To accomplish quality improvement, it is essential to make every activity task free of defects. But locating and correcting defects is more expensive and time-intense. In the past, many potential methods have been used to predict potential drawbacks in the program based on the theory of probability facts. Because the probability method applies a random variable and probability distributions to find a solution, the result is always in a possible range that can be true at some time or may also be wrong. Therefore, an additional calculation method coupled with the probability of making it more accurate and new in predicting the defect of the program. In this paper, we propose a Probabilistic and Deterministic based Defect Prediction (PD-DP) through Defect Association Learning (DAL). The PD-DP implements a Probability association method (PAM) and Deterministic association method (DAM) to predict the software defect accurately in software development. The experimental evaluation of the PP-DP in compare to existing prediction methods shows enhancement in prediction accuracy.

Operating System Support to Detect Application Hangs

Conference Paper

Full-text available

Jul 2008

The landscape of software failure cause models

Article

Full-text available

Mar 2016

The software engineering field has a long history of classifying software failure causes. Understanding them is paramount for fault injection, focusing testing efforts or reliability prediction. Since software fails in manifold complex ways, a broad range of software failure cause models is meanwhile published in dependability literature. We present the results of a meta-study that classifies publications containing a software failure cause model in topic clusters. Our results structure the research field and can help to identify gaps. We applied the systematic mapping methodology for performing a repeatable analysis. We identified 156 papers presenting a model of software failure causes. Their examination confirms the assumption that a large number of the publications discusses source code defects only. Models of fault-activating state conditions and error states are rare. Research seems to be driven mainly by the need for better testing methods and code-based quality improvement. Other motivations such as online error detection are less frequently given. Mostly, the IEEE definitions or orthogonal defect classification is used as base terminology. The majority of use cases comes from web, safety- and security-critical applications.

Failure Analysis of Internet Services

Article

Archana Ganapathi

Event Logs for the Analysis of Software Failures: A Rule-Based Approach

Article

Jun 2013

Event logs have been widely used over the last three decades to analyze the failure behavior of a variety of systems. Nevertheless, the implementation of the logging mechanism lacks a systematic approach and collected logs are often inaccurate at reporting software failures: This is a threat to the validity of log-based failure analysis. This paper analyzes the limitations of current logging mechanisms and proposes a rule-based approach to make logs effective to analyze software failures. The approach leverages artifacts produced at system design time and puts forth a set of rules to formalize the placement of the logging instructions within the source code. The validity of the approach, with respect to traditional logging mechanisms, is shown by means of around 12,500 software fault injection experiments into real-world systems.

Towards a Framework for Field Data Production and Management

Article

Full-text available

Field failure data play a key role in complex distributed system, as they often represent the only available source of information useful to control the dependability level of the system. However, the analysis of these data can be compromised by several factors, such as log heterogeneity and inaccuracy, which increase the level on distrust on logs, and make it difficult to compare different analyses to provide general results. The paper proposes a framework to overcome these limitations, based on three key aspects: (i) the use of an "accurate enough" model of the system in hand, (ii) the definition of common logging rules, to be used a t design and development time to enhance the accuracy, and (iii) the design of a logging platform to orchestra te the collection and analysis processes. A case study of the proposed framework is presented, in the context of a real world complex system.

A measurement-based model for estimation of software aging in operational software systems

Article

Jan 1999

DEPENDABILITY EVALUATION OF MOBILE DISTRIBUTED SYSTEMS VIA FIELD FAILURE DATA ANALYSIS

Article

Full-text available

Marcello Cinque

Error Detection Framework for Complex Software Systems

Conference Paper

May 2011

Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependencies, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering p a t t e r n s , t y p i c a l l y escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.

Enabling Effective Dependability Evaluation of Complex Systems via a Rule-Based Logging Framework

Article

Full-text available

Field Failure Data Analysis (FFDA) is a widely adopted methodology to characterize the dependability behav-ior of a computing system. It is often based on the analysis of logs available in the system under study. However, current logs do no seem to be actually conceived to perform FFDA, since their production usually lacks a systematic approach and relies on developers' experience and attitude. As a result, collected logs may be heterogeneous, inaccurate and redundant. This, in turn, increases analysis efforts and reduces the quality of FFDA results. This paper proposes a rule-base logging framework, which aims to improve the quality of logged data and to make the analysis phase more effective. Our proposal is compared to traditional log analysis in the context of a real-world case study in the field of Air Traffic Control. We demonstrate that the adoption of a rule-based strategy makes it possible to significantly improve dependability evaluation by both reducing the amount of information actually needed to perform the analysis and without affecting system performance.

Improving Log-based Field Failure Data Analysis of multi-node computing systems

Conference Paper

Full-text available

Jul 2011

Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.

Analysis and implementation of software rejuvenation in cluster systems

Conference Paper

Jun 2001
Perform Eval Rev

Several recent studies have reported the phenomenon of "software aging", one in which the state of a software system degrades with time. This may eventually lead to performance degradation of the software or crash/hang failure or both. "Software rejuvenation" is a pro-active technique aimed to prevent unexpected or unplanned outages due to aging. The basic idea is to stop the running software, clean its internal state and restart it. In this paper, we discuss software rejuvenation as applied to cluster systems. This is both an innovative and an efficient way to improve cluster system availability and productivity. Using Stochastic Reward Nets (SRNs), we model and analyze cluster systems which employ software rejuvenation. For our proposed time-based rejuvenation policy, we determine the optimal rejuvenation interval based on system availability and cost. We also introduce a new rejuvenation policy based on prediction and show that it can dramatically increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior and performability measures, which we are just beginning to explore. We then briefly describe an implementation of a software rejuvenation system that performs periodic and predictive rejuvenation, and show some empirical data from systems that exhibit aging

Why Do Internet Services Fail, and What Can Be Done About It?

Conference Paper

Jan 2003

We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of examining the operations problem tracking databases from two of the services and a log of service failure post-mortem reports from the third. Architecturally, we find convergence on a common structure: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of cus- tom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. With respect to failures, we find that operator errors are their primary cause, operator error is the most difficult type of failure to mask, service front-ends are responsible for more problems than service back-ends but fewer minutes of unavailability, and that online testing and more thoroughly exposing and detecting component failures could reduce system failure rates for at least one service.

Using statistical monitoring to detect failures in internet services /

Article

Full-text available

Submitted to the Department of Computer Science. Copyright by the author. Thesis (Ph. D.)--Stanford University, 2005.

An Automated Distributed Infrastructure for Collecting Bluetooth Field Failure Data

Conference Paper

Full-text available

Jun 2005

The widespread use of mobile and wireless computing platforms is leading to a growing interest on dependability issues. Several research studies have been conducted on dependability of mobile environments, but none of them attempted to identify system bottlenecks and to quantify dependability measures. This paper proposes a distributed automated infrastructure for monitoring and collecting spontaneous failures of the Bluetooth infrastructure, which is nowadays more and more recognized as an enabler for mobile systems. Information sources for failure data are presented, and preliminary experimental results are discussed.

Event log based dependability analysis of Windows NT and 2K systems

Conference Paper

Jan 2003

Measurement-based model for estimation of resource exhaustion in operational software systems

Conference Paper

Feb 1999

Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of “software aging”, in which the state of the software system degrades with time, has been reported (S. Garg et al., 1998). The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure, or both. Earlier work in this area to detect aging and to estimate its effect on system resources did not take into account the system workload. In this paper, we propose a measurement-based model to estimate the rate of exhaustion of operating system resources both as a function of time and the system workload state. A semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates and the time-to-exhaustion for the resources. With the help of this measure, proactive fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be employed to prevent unexpected outages

Failure data analysis of a LAN of Windows NT based computers

Conference Paper

Full-text available

Feb 1999

This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: 1) most of the problems that lead to reboots are software related; 2) rebooting the machine does not always solve the problem; 3) there are indications of propagated or correlated failures; and 4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service

A Comprehensive Model for Software Rejuvenation

Article

Full-text available

May 2005

Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

Software-Based Failure Detection and Recovery in Programmable Network Interfaces

Article

Full-text available

Dec 2007

Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.

Computer intrusion detection through EWMA for autocorrelated and uncorrelated data

Article

Apr 2003

Reliability and quality of service from information systems has been threatened by cyber intrusions. To protect information systems from intrusions and thus assure reliability and quality of service, it is highly desirable to develop techniques that detect intrusions. Many intrusions manifest in anomalous changes in intensity of events occurring in information systems. In this study, we apply, test, and compare two EWMA techniques to detect anomalous changes in event intensity for intrusion detection: EWMA for autocorrelated data and EWMA for uncorrelated data. Different parameter settings and their effects on performance of these EWMA techniques are also investigated to provide guidelines for practical use of these techniques.

Networked Windows NT System Field Failure Data Analysis

Article

Mar 2001

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime. 1

Data analysis of cyber-activity within high performance computing environments

Conference Paper

Oct 2017

A Theoretic Representation of the Effects of Targeted Failures in HPC Systems

Chapter

Jul 2016

Antwan Clark

High performance computing (HPC) systems are becoming the norm for daily use and care must be made to ensure that these systems are resilient. Recent contributions on resiliency have been from quantitative and qualitative perspectives where general system failures are considered. However, there are limited contributions dealing with the specific classes of failures that are directly related to cyber-attacks. In this chapter, the author uses the concepts of transition processes and limiting distributions to perform a generic theoretical investigation of the effects of targeted failures by relating the actions of the cyberenemy (CE) to different risk levels in an HPC system. Special cases of constant attack strategies are considered where exact solutions are obtained. Additionally, a stopped process is introduced to model the effects of system termination. The results of this representation can be directly applied throughout the HPC community for monitoring and mitigating cyber-attacks.

Automating Crash Report Analysis Using 'Exception-based Patterns' & 'Reference Assembly mapping'

Conference Paper

Feb 2015

When a complex real-world application is deployed post-release, a number of crash reports are generated. As the number of clients using the product increases, so do the crash reports. Typically, the approach followed in many software organizations is to manually analyze a crash report to identify the erroneous module responsible for the crash. Naturally, when a large number of crash reports are generated daily, the development team requires a substantial amount of time to analyze all these reports. This in turn increases the turn-around time for crash report analysis which often leaves customers unhappy. In order to address this problem, we have developed an automated method to analyze a crash report and identify the erroneous module. This method is based on a novel algorithm that searches for exception-based patterns in crash reports and maps reference assemblies. We have applied this method to several thousand crash reports across four sub-systems of an industrial automation application. Results indicate that the algorithm not only achieves a high accuracy in finding the erroneous module and subsystem behind a crash, but also significantly reduces the turn-around time for crash report analysis.

One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach

Conference Paper

Jun 2013

Catello Di Martino

This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude.

Avoiding link error algorithm based on link load balancing

Article

Mar 2009

The link load balancing technique is proved in theory to be able to improve system availability greatly. Then the paper presents a dynamic link load balancing algorithm which can avoid the failure path. This new algorithm obtain the length of link waiting queue and response time, which helps distribute all the data traffic loads on multiple links and increase the link bandwidth usage. In addition, the availability of the current links can be managed on real time, and the current disconnected link can be shielded and the mean time to failure(MTTF) can be extended, therefore system availability greatly improves. Finally, the simulation results verify the effectiveness of this algorithm.

Innovative Technologies for Dependable OTS-Based Critical Systems

Chapter

Jan 2013

Failure analysis is valuable to dependability engineers because it supports designing effective mitigation means, defining strategies to reduce maintenance costs, and improving system service. Event logs, which contain textual information about regular and anomalous events detected by the system under real workload conditions, represent a key source of data to conduct failure analysis. So far, event logs have been successfully used in a variety of domains. This chapter describes methodology and well-established techniques underlying log-based failure analysis. Description introduces the workflow leading to analysis results starting from the raw data in the log. Moreover, the chapter surveys relevant works in the area with the aim of highlighting main objectives and applications of log-based failure analysis. Discussion reveals benefits and limitations of logs for evaluating complex systems.

Assessing time coalescence techniques for the analysis of supercomputer logs

Article

Jun 2012

This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer logs, due to increasing importance of automatic event log analysis for these systems. Experimental results show how the approach allows to compare different time coalescence techniques and to identify their weaknesses with respect to given system settings. In addition, results revealed an interesting correlation between errors caused by the coalescence and errors in the estimation of dependability measurements.

Why Does Windows Crash?.

Article

Archana Ganapathi

Abstract Reliability is a rapidly growing concern in contemporary Personal Computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we analyze crash data from Windows machines. We collected our data from two different sources – the UC Berkeley EECS department and a population of volunteers who contribute to the BOINC project. Westudy both application crash behavior and operating systems crashes. We found that application crashes are caused by both faulty non-robust dll files as well as impatient users who prematurely terminate non-responding applications, especially web browsers. OS crashes are predominantly caused by poorly-written device driver code. Users as well as product developers will benefit from understanding the crash behaviors and crashprevention techniques we have revealed in this paper. ,,,,,,,,,,,

Operating System Support to Detect Application Hangs

Article

Full-text available

Jan 2008

On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.

The Hide and Seek Field Data Game

Conference Paper

Oct 2006

Domenico Cotroneo

Since around 80s, researchers and software engineers dealing with dependability of system and software products have recognized the crucial role of field data. Field data represent an attractive way for increasing the efficiency of testing activities and, once the product has been delivered, to improve the quality of subsequent releases. On the other hand, plenty of research studies have been conducted on techniques and methods for field data measurement of operational systems: from data filtering and analysis techniques to dependability measurements and improvements. Field data represent also a good opportunity for understanding source of failures. Examples are provided in Siewiorek et al. (2004), such as the one reporting failure data collected on the public switched telephone network (PSTN) which emphasize how failures in systems are due to the environment and operators. Despite these efforts, there is an increasing gap between software practitioners and researchers involved in the development of accurate and efficient methodologies for field data measurement campaigns. This paper tries to shed some light on this problem, which is a sort of "hide and seek game" of real world field data between industries and researchers. Here, the author tries to emphasize the importance of real collaboration between industries and the research community toward the definition of effective "design for dependability evaluation" methodology

RAAC: An architecture for scalable, reliable storage in clusters

Conference Paper

Full-text available

Jan 2004

Striping data across multiple nodes has been recognized as an effective technique for delivering high-bandwidth I/O to applications running on clusters. However the technique is vulnerable to disk failure. We present an I/O architecture for clusters called reliable array of autonomous controllers (RAAC) that builds on the technique of RAID style data redundancy. The RAAC architecture uses a two-tier layout that enables the system to scale in terms of storage capacity and transfer bandwidth while avoiding the synchronization overhead incurred in a distributed RAID system. We describe our implementation of RAAC in PVFS, and compare the performance of parity-based redundancy in RAAC and in a conventional distributed RAID architecture.

Measurement-Based Analysis of Networked System Availability

Conference Paper

Jan 2000
Lect Notes Comput Sci

The dependability of a system can be experimentally evaluated at different phases of its life cycle. In the design phase, computer-aided design (CAD) environments are used to evaluate the design via simulation, including simulated fault injection. Such fault injection tests the effectiveness of fault-tolerant mechanisms and evaluates system dependability, providing timely feedback to system designers. Simulation, however, requires accurate input parameters and validation of output results. Although the parameter estimates can be obtained from past measurements, this is often complicated by design and technology changes. In the prototype phase, the system runs under controlled workload conditions. In this stage, controlled physical fault injection is used to evaluate the system behavior under faults, including the detection coverage and the recovery capability of various fault tolerance mechanisms. Fault injection on the real system can provide information about the failure process, from fault occurrence to system recovery, including error latency, propagation, detection, and recovery (which may involve reconfiguration). But this type of fault injection can only study artificial faults; it cannot provide certain important dependability measures, such as mean time between failures (MTBF) and availability. In the operational phase, a direct measurement-based approach can be used to measure systems in the field under real workloads. The collected data contain a large amount of information about naturally occurring errors/failures.

Effectiveness of Machine Checks for Error Diagnostics

Conference Paper

Jun 2009

Machine Check Architecture (MCA) is a processor internal architecture subsystem that detects and logs correctable and uncorrectable errors in the data or control paths in each CPU core and the Northbridge. These errors include parity errors associated with caches, TLBs, ECC errors associated with caches and DRAM, and system bus errors. This paper reports on an experimental study on: (i) monitoring a computing cluster for machine checks and using this data to identify patterns that can be employed for error diagnostics and (ii) introducing faults into the machine to understand the resulting machine checks and correlate this data with relevant performance metrics.

Improving Log-based Field Failure Data Analysis of multi-node computing systems

Conference Paper

Jun 2011

Windows XP Kernel Crash Analysis.

Conference Paper

Jan 2006

PC users have started viewing crashes as a fact of life rather than a problem. To improve operating system dependability, systems designers and programmers must analyze and understand failure data. In this paper, we analyze Windows XP kernel crash data collected from a population of volunteers who contribute to the Berkeley Open Infrastructure for Network Computing (BOINC) project. We found that OS crashes are predominantly caused by poorly-written device driver code. Users as well as product developers will benefit from understanding the crash behaviors elaborated in this paper.

An OptimizedWorkload for Failure Data Analysis of Mobile P2P over Bluetooth Ad-Hoc Networks

Conference Paper

Jan 2006

Mobile Peer-to-Peer (P2P) is a base paradigm for many new killer applications for mobile ad-hoc networks and the Mobile Internet. Currently, it is not well understood whether this paradigm is able to meet business and consumer dependability expectations. Dependability assessment of P2P applications can be achieved by field failure data analysis. The collection of failure data from wireless ad-hoc networks is a challenging task due to the intermittent usage and the mobility of users that do not allow to measure time-based dependability parameters. For this reason, we propose to deploy automated workloads on the actual peer nodes which have to operate continuously. Specifically, this paper formalizes the problem and presents the design of a workload for mobile P2P that aims to orchestrate the peers uniformly, letting the failure occurrence be independent of the network load. Simulation results and experimentation over an actual Bluetooth network demonstrate that the proposed workload meets the defined requirements.

Proactive management of software aging

Article

Mar 2001
IBM J RES DEV

In response to the strong desire of customers to be provided with advance notice of unplanned outages, techniques were developed that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvination of an application, process group, or entire operating system. The resulting techniques are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures.

Dependability Evaluation and Modeling of the Bluetooth Data Communication Channel

Conference Paper

Mar 2008

This work presents a measurement-based dependability evaluation of the Bluetooth data communication channel, i.e., the Baseband layer. The main contribution is the definition of the Baseband's error/recovery model according to the Markov chains formalism. The model is derived by analyzing field data, which are collected via a commercial air sniffer deployed over real- world Bluetooth piconets. The model is parametric and actual values for its parameters are estimated by analyzing the field data. The paper also proposes the evaluation of dependability statistics (e.g., the error and failure times distributions, and the availability estimate), and the study of the failing behavior of the Bluetooth communication channel under Wi-Fi interferences.

Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks

Conference Paper

Feb 2006

This work presents a failure data analysis campaign on Bluetooth personal area networks (PANs) conducted on two kind of heterogeneous testbeds (working for more than one year). The obtained results reveal how failures distribution is characterized and suggest how to improve the dependability of Bluetooth PANs. Specifically, we define the failure model and we then identify the most effective recovery actions and masking strategies that can be adopted for each failure. We then integrate the discovered recovery actions and masking strategies in our testbeds, improving the availability and the reliability of 3.64% (up to 36.6%) and 202% (referred to the mean time to failure), respectively

Automated logging of mobile phones failures data

Conference Paper

May 2006

The increasing complexity of mobile phones directly affects their reliability, while the user tolerance for failures becomes to decrease, especially when the phone is used for business- or mission-critical applications. Despite these concerns, there is still little understanding on how and why these devices fail and no techniques have been defined to gather useful information about failures manifestation from the phone. This paper presents the design of a logger application to collect failure-related information from mobile phones. Preliminary failure data collected from real-world mobile phones confirm the proposed logger is a useful instrument to gain knowledge about mobile phone failure's dynamics and causes.

Networked Windows NT System Field Failure Data Analysis

Conference Paper

Feb 1999

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime

Dependability analysis of a commercial high-speed network

Conference Paper

Jul 1997

The paper presents an injection-based approach to analyze dependability of high-speed networks using the Myrinet as an example testbed. Instead of injecting faults related to network protocols, the authors injected faults into the host interface component, which performs the actual send and receive operations. The fault model used was a temporary single bit flip in an instruction executing on the host interface's custom processor, corresponding to a transient fault in the processor itself. Results show that more than 25% of the injected faults resulted in interface failures. Furthermore, they observed fault propagation from an interface to its host computer or to another interface to which it sent a message. These findings suggest that two important issues for high-speed networking in critical applications are protecting the host computer from errant or malicious interface components and implementing thorough message acceptance test mechanisms to prevent errant messages from propagating faults between interfaces

Experimental analysis of computer system dependability

Article

Full-text available

Jul 1993

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance.

Measurement of failure rate in widely distributed software

Conference Paper

Full-text available

Jul 1995

In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved

Orthogonal Defect Classification - A Concept for In-Process Measurements

Article

Full-text available

Dec 1992

Orthogonal defect classification (ODC), a concept that enables in-process feedback to software developers by extracting signatures on the development process from defects, is described. The ideas are evolved from an earlier finding that demonstrates the use of semantic information from defects to extract cause-effect relationships in the development process. This finding is leveraged to develop a systematic framework for building measurement and analysis methods. The authors define ODC and discuss the necessary and sufficient conditions required to provide feedback to a developer; illustrate the use of the defect type distribution to measure the progress of a product through a process; illustrate the use of the defect trigger distribution to evaluate the effectiveness and eventually the completeness of verification processes such as inspection or testing; provides sample results from pilot projects using ODC; and open the doors to a wide variety of analysis techniques for providing effective and fast feedback based on the concepts of ODC

Analysis of failures in the Tandem NonStop-UX Operating System

Conference Paper

Nov 1995

The paper presents results from an investigation of failures in several releases of Tandem's NonStop-UX Operating System, which is based on Unix System V. The analysis covers software failures from the field and failures reported by Tandem's test center. Fault classification is based on the status of the reported failures, the detection point of the errors in the operating system code, the panic message generated by the systems, the module that was found to be faulty, and the type of programming mistake. This classification reveals which modules in the operating system generate the most faults and the modules in which most errors are detected. We also present distributions of the failure and repair times including inter arrival time of unique failures and time between duplicate failures. These distributions, unlike generic time distributions, such as time between failures, help characterize the software quality. Distribution of the repair times emphasizes the repair process and the factors influencing repair. Distribution of up time of the systems before the panic reveals the factors triggering the panic

Analysis of the VAX/VMS error logs in multicomputer environments-a case study of software dependability

Conference Paper

Nov 1992

An analysis is given of the software error logs produced by the VAX/VMS operating system from two VAXcluster multicomputer environments. Basic error characteristics are identified by statistical analysis. Correlations between software and hardware errors, and among software errors on different machines are investigated. Finally, reward analysis and reliability growth analysis are performed to evaluate software dependability. Results show that major software problems in the measured systems are from program flow control and I/O management. The network-related software is suspected to be a reliability bottleneck. It is shown that a multicomputer software `time between error' distribution can be modeled by a 2-phase hyperexponential random variable: a lower error rate pattern which characterizes regular errors, and a higher error rate pattern which characterizes error bursts and concurrent errors on multiple machines

Models for time coalescence in event logs

Conference Paper

Aug 1992

One heuristic for data reduction that is widely used in the literature is coalescence of events occurring close in time. The authors explore the validity of this heuristic by developing a model for the formation of the contents of an event log by multiple independent error processes. The probability of coalescing events from two or more error sources is formulated and compared with results from hand analysis of actual event logs taken from a Tandem TNS II system. Results indicate that the probability of coalescing events from more than one error source is a strong function of the time constant selected. The model can be used to select an appropriate time constant and also has implications in designing event logging systems

Network management and performance monitoring

Conference Paper

Apr 1991

Network performance monitoring is essential for managing a network efficiently and for insuring continuous operation of the network. The paper discusses the implementation of a network monitoring tool and the results obtained by monitoring the College of Engineering Network at the University of South Carolina using this monitoring tool. The monitoring tool collects statistics useful for fault management, resource management, congestion control, and performance management

Anomaly detection for diagnosis

Conference Paper

Jul 1990

Roy A. Maxion

The author presents a method for detecting anomalous events in communication networks and other similarly characterized environments in which performance anomalies are indicative of failure. The methodology, based on automatically learning the difference between normal and abnormal behavior, has been implemented as part of an automated diagnosis system from which performance results are drawn and presented. The dynamic nature of the model enables a diagnostic system to deal with continuously changing environments without explicit control, reaching to the way the world is now, as opposed to the way the world was planned to be. Results of successful deployment in a noisy, real-time monitoring environment are shown

Network management on Hughes Aircraft's engineering design network

Conference Paper

Nov 1989

P. Ho

A description is given of the network management on the Ethernet-based engineering design network (EDEN). Past experiences and current network management techniques are presented for the various areas of network management such as protocol-level management, network monitoring, network database, problem reporting, problem diagnosis, and fault isolation. Near-term expectations in network management for EDEN are discussed

A Case Study of Ethernet Anomalies in a Distributed Computing Environment

Article

Nov 1990

Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University

Dependability Measurement and Modeling of a Multicomputer System

Article

Feb 1993

A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant

Analyze-NOW - An environment for collection & analysis of failures in a network of workstations

Abstract

No full-text available

Recommended publications

Emergence of critical rates in multiple access network control schemes

Toward the intelligent integrated network management

Achieving Byzantine agreement in a generalized network model

Using the Web in supplementary teacher education