Conference Paper

Analyze-NOW-an environment for collection and analysis of failuresin a network of workstations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes Analyze-NOW an environment for collection and analysis of failures/errors in a network of workstations. Descriptions cover the data collection methodology and the tool implemented to facilitate this process. Software tools used for analysis are described, with emphasis on the details of the implementation of the Analyzer, the primary analysis tool. Application of the tools is demonstrated by using them to collect and analyze failure data (for 32 week period) from a network of 69 SunOS-based workstations. Classification based on the source and the effect of faults is used to identify problem areas. Different types of failures encountered on the machines and the network are highlighted to develop a proper understanding of failures in a network environment. Lastly, a case is made for using the results from the analysis tool to pinpoint the problem areas in the network

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... consist in examining the log files produced by the components, in order to figure out the system behavior by correlating different events [7]. These techniques belong to direct detection strategies: they try to infer the system's health by directly querying it, or by analyzing events it is able to produce (e.g., logs, exceptions). ...
Conference Paper
Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependencies, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering p a t t e r n s , t y p i c a l l y escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.
... Soft-error-induced network interface failures can be quite detrimental to the reliability of a distributed system. The failure data analysis reported in [6] indicates that network-related problems contributed to approximately 40 percent of the system failures observed in distributed environments. As we will see in the following sections, soft errors can cause the network interface to completely stop responding, function improperly, or greatly reduce network performance. ...
Article
Full-text available
Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.
Article
Full-text available
different parameters in the device, such as the effective channel length Leff, the oxide thickness tox, and the threshold voltage Vth. These variations increase as the feature size reduces due to the difficulty of fabricating small structures consistently across a die or a wafer [1]. Controlling the variation in device parameters during fabrication is becoming therefore a great challenge for scaled technologies. The performance and power consumption of integrated circuits are greatly affected by these variations. This can cause deviation from the intended design parameters for a chip and severely affects the yield as well as performance. Thus, process variations must be taken into consideration while designing circuits and architectures. We present a new adaptive cache architecture design which takes into consideration the effect of process variations on access latency. Preliminary results show that our new design can achieve a 13% to 29% performance improvement on the applications studied compared to a conventional design.
Article
This paper presents the results of a reliability study on a set of nearly 100 popular Web sites done from an end user's perspective. The study attempts to address the following issues: (a) What is the (stationary) probability that a user's request to access a Web site succeeds? (b) On average, what percentage of Web sites remain accessible to the user at a given moment? (c) What are the major causes of access failures as seen by the user? (d) Typically, how long could a Web site be unavailable to the user? (e) What parameters could be used to quantitatively describe the behavior of a host on the Internet? Data for the study was acquired by periodically attempting to fetch an HTML file from each Web site and recording the outcome of such attempts. Analysis of the acquired data revealed: (i) Over 94% of the HTML file fetch requests succeed on average. (ii) Over 70% of the failures last less than 15 min (iii) Network-related outages account for over half of the failures. (iv) Network-related outages can potentially render more than 70% of the hosts inaccessible to the user. (v) Host-related failures tend to be of a shorter duration than failures that might involve the network. (vi) Network connectivity is good on average, with 93% of the sites being accessible at any given time. (vii) Mean Availability of the hosts is high (0.993). (viii) On average, a host remains unavailable to a user for about 2.5 days per year. However, the total downtimes exhibited by individual hosts varied from about 2 h per year to nearly 20 days per year.
Conference Paper
Full-text available
Striping data across multiple nodes has been recognized as an effective technique for delivering high-bandwidth I/O to applications running on clusters. However the technique is vulnerable to disk failure. We present an I/O architecture for clusters called reliable array of autonomous controllers (RAAC) that builds on the technique of RAID style data redundancy. The RAAC architecture uses a two-tier layout that enables the system to scale in terms of storage capacity and transfer bandwidth while avoiding the synchronization overhead incurred in a distributed RAID system. We describe our implementation of RAAC in PVFS, and compare the performance of parity-based redundancy in RAAC and in a conventional distributed RAID architecture.
Conference Paper
Machine Check Architecture (MCA) is a processor internal architecture subsystem that detects and logs correctable and uncorrectable errors in the data or control paths in each CPU core and the Northbridge. These errors include parity errors associated with caches, TLBs, ECC errors associated with caches and DRAM, and system bus errors. This paper reports on an experimental study on: (i) monitoring a computing cluster for machine checks and using this data to identify patterns that can be employed for error diagnostics and (ii) introducing faults into the machine to understand the resulting machine checks and correlate this data with relevant performance metrics.
Article
In response to the strong desire of customers to be provided with advance notice of unplanned outages, techniques were developed that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvination of an application, process group, or entire operating system. The resulting techniques are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures.
Conference Paper
Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of “software aging”, in which the state of the software system degrades with time, has been reported (S. Garg et al., 1998). The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure, or both. Earlier work in this area to detect aging and to estimate its effect on system resources did not take into account the system workload. In this paper, we propose a measurement-based model to estimate the rate of exhaustion of operating system resources both as a function of time and the system workload state. A semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates and the time-to-exhaustion for the resources. With the help of this measure, proactive fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be employed to prevent unexpected outages
Conference Paper
The phenomenon of software aging refers to the accumulation of errors during the execution of the software which eventually results in it's crash/hang failure. A gradual performance degradation may also accompany software aging. Pro-active fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be used to counteract aging if it exists. We propose a methodology for detection and estimation of aging in the UNIX operating system. First, we present the design and implementation of an SNMP based, distributed monitoring tool used to collect operating system resource usage and system activity data at regular intervals, from networked UNIX workstations. Statistical trend detection techniques are applied to this data to detect/validate the existence of aging. For quantifying the effect of aging in operating system resources, we propose a metric: “estimated time to exhaustion”, which is calculated using well known slope estimation techniques. Although the distributed data collection tool is specific to UNIX, the statistical techniques can be used for detection and estimation of aging in other software as well
Conference Paper
This paper presents the results of a 40-day reliability study on a set of 97 popular Web sites done from an end user's perspective. Data for the study was acquired by periodically attempting to fetch an HTML file from each Web site and recording the outcome of such attempts. Analysis of the acquired data revealed: (i) 94% of the HTML file fetch requests succeed on average; (ii) most failures last less than 15 minutes; (iii) the underlying network plays a dominant role in determining host accessibility: (a) network related-outages account for a major part of the failures, (b) some network-related outages rendered more than 70% of the hosts inaccessible, and (c) host-related failures tend to be shorter than failures that might involve the network; (vi) the network connectivity is high on the average with 93% of the sites being accessible at any given time; and (vii) the mean availability of the hosts is high (0.993)
Conference Paper
The paper presents an injection-based approach to analyze dependability of high-speed networks using the Myrinet as an example testbed. Instead of injecting faults related to network protocols, the authors injected faults into the host interface component, which performs the actual send and receive operations. The fault model used was a temporary single bit flip in an instruction executing on the host interface's custom processor, corresponding to a transient fault in the processor itself. Results show that more than 25% of the injected faults resulted in interface failures. Furthermore, they observed fault propagation from an interface to its host computer or to another interface to which it sent a message. These findings suggest that two important issues for high-speed networking in critical applications are protecting the host computer from errant or malicious interface components and implementing thorough message acceptance test mechanisms to prevent errant messages from propagating faults between interfaces
Article
Full-text available
Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.
Article
Full-text available
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance.
Conference Paper
Full-text available
In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved
Conference Paper
Full-text available
An analysis of software defects reported at customer sites in two large IBM database management products, DB2 and IMS, is presented. The analysis considers several different error classification systems and compares the results to those of an earlier study of field defects in IBM's MVS operating system. The authors compare the error type, defect type, and error trigger distributions of the DB2, IMS, and MVS products; show that there may exist an asymptotic behavior in the error type distribution as a function of a defect type; and discuss the undefined state errors that dominate the error type distribution
Article
Full-text available
Orthogonal defect classification (ODC), a concept that enables in-process feedback to software developers by extracting signatures on the development process from defects, is described. The ideas are evolved from an earlier finding that demonstrates the use of semantic information from defects to extract cause-effect relationships in the development process. This finding is leveraged to develop a systematic framework for building measurement and analysis methods. The authors define ODC and discuss the necessary and sufficient conditions required to provide feedback to a developer; illustrate the use of the defect type distribution to measure the progress of a product through a process; illustrate the use of the defect trigger distribution to evaluate the effectiveness and eventually the completeness of verification processes such as inspection or testing; provides sample results from pilot projects using ODC; and open the doors to a wide variety of analysis techniques for providing effective and fast feedback based on the concepts of ODC
Conference Paper
The paper presents results from an investigation of failures in several releases of Tandem's NonStop-UX Operating System, which is based on Unix System V. The analysis covers software failures from the field and failures reported by Tandem's test center. Fault classification is based on the status of the reported failures, the detection point of the errors in the operating system code, the panic message generated by the systems, the module that was found to be faulty, and the type of programming mistake. This classification reveals which modules in the operating system generate the most faults and the modules in which most errors are detected. We also present distributions of the failure and repair times including inter arrival time of unique failures and time between duplicate failures. These distributions, unlike generic time distributions, such as time between failures, help characterize the software quality. Distribution of the repair times emphasizes the repair process and the factors influencing repair. Distribution of up time of the systems before the panic reveals the factors triggering the panic
Conference Paper
An analysis is given of the software error logs produced by the VAX/VMS operating system from two VAXcluster multicomputer environments. Basic error characteristics are identified by statistical analysis. Correlations between software and hardware errors, and among software errors on different machines are investigated. Finally, reward analysis and reliability growth analysis are performed to evaluate software dependability. Results show that major software problems in the measured systems are from program flow control and I/O management. The network-related software is suspected to be a reliability bottleneck. It is shown that a multicomputer software `time between error' distribution can be modeled by a 2-phase hyperexponential random variable: a lower error rate pattern which characterizes regular errors, and a higher error rate pattern which characterizes error bursts and concurrent errors on multiple machines
Conference Paper
One heuristic for data reduction that is widely used in the literature is coalescence of events occurring close in time. The authors explore the validity of this heuristic by developing a model for the formation of the contents of an event log by multiple independent error processes. The probability of coalescing events from two or more error sources is formulated and compared with results from hand analysis of actual event logs taken from a Tandem TNS II system. Results indicate that the probability of coalescing events from more than one error source is a strong function of the time constant selected. The model can be used to select an appropriate time constant and also has implications in designing event logging systems
Conference Paper
Network performance monitoring is essential for managing a network efficiently and for insuring continuous operation of the network. The paper discusses the implementation of a network monitoring tool and the results obtained by monitoring the College of Engineering Network at the University of South Carolina using this monitoring tool. The monitoring tool collects statistics useful for fault management, resource management, congestion control, and performance management
Conference Paper
The author presents a method for detecting anomalous events in communication networks and other similarly characterized environments in which performance anomalies are indicative of failure. The methodology, based on automatically learning the difference between normal and abnormal behavior, has been implemented as part of an automated diagnosis system from which performance results are drawn and presented. The dynamic nature of the model enables a diagnostic system to deal with continuously changing environments without explicit control, reaching to the way the world is now, as opposed to the way the world was planned to be. Results of successful deployment in a noisy, real-time monitoring environment are shown
Conference Paper
A description is given of the network management on the Ethernet-based engineering design network (EDEN). Past experiences and current network management techniques are presented for the various areas of network management such as protocol-level management, network monitoring, network database, problem reporting, problem diagnosis, and fault isolation. Near-term expectations in network management for EDEN are discussed
Article
Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate
Article
Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University
Article
In this correspondence we present a statistical model which relates mean computer failure rates to level of system activity. Our analysis reveals a strong statistical dependency of both hardware and software component failure rates on several common measures of utilization (specifically CPU utilization, I/O initiation, paging, and job-step initiation rates). We establish that this effect is not dominated by a specific component type, but exists across the board in the two systems studied. Our data covers three years of normal operation (including significant upgrades and reconfigurations) for two large Stanford University computer complexes. The complexes, which are composed of IBM mainframe equipment of differing models and vintage, run similar operating systems and provide the same interface and capability to their users. The empirical data comes from identically structured and maintained failure logs at the two sites along with IBM OS/VS2 operating system performance/load records.
Article
A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant
A case study of Ethernet anomalies in a distributed comput-ing environment A comparison of software defects in database management systems and operating systems Analysis of the VAX/VMS error logs in multicomputer environment environments -a case study of software dependabil-ity
  • R A Maxion
  • F E Feather
R. A. Maxion and F. E. Feather, " A case study of Ethernet anomalies in a distributed comput-ing environment, " IEEE Trans. Rehab., Vol. 39, No. 4, 1990. [Sullivan921 M. Sullivan and R. Chillarege, " A comparison of software defects in database management systems and operating systems, " Int. Symp. Fault-Tolerant Computing, July 1992. [Tang921 D. Tang and R. K. Iyer, " Analysis of the VAX/VMS error logs in multicomputer environment environments -a case study of software dependabil-ity, " Int. Symp. Software Reliability Engineering, Oct. 1992.
Models for time coalescence in event logs Network management on Hughes Aircraft's engineering design network
  • J P Hansen
  • D P P Siewiorek
  • Ho
J. P. Hansen and D. P. Siewiorek, " Models for time coalescence in event logs, " Intl. Symp. Fault-Tolerant Computing,, 1992. [H089] P. Ho, " Network management on Hughes Aircraft's engineering design network,'' Proc. Conf. Local Com-puter Networks, 1989.