Article

Analyze-NOW - An environment for collection & analysis of failures in a network of workstations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes Analyze-NOW, an environment for the collection and analysis of failures/errors in a network of workstations. Descriptions cover the data collection methodology and the tool implemented to facilitate this process. Software tools used for analysis are described, with emphasis on the details of the implementation of the Analyzer, the primary analysis tool. Application of the tools is demonstrated by using them to collect and analyze failure data (for 32-week period) from a network of 69 SunOS-based workstations. Classification based on the source and effect of faults is used to identify problem areas. Different types of failures encountered on the machines and network are highlighted to develop a proper understanding of failures in a network environment. The results from the analysis tool should be used to pinpoint the problem areas in the network. The results obtained from using Analyze-NOW on failure data from the monitored network reveal some interesting behavior of the network. Nearly 70% of the failures were network-related, whereas disk errors were few. Network-related failures were 75% of all hard-failures (failures that make a workstation unusable). Half of the network-related failures were due to servers not responding to clients, and half were performance-related and others. Problem areas in the network were found using this tool. The authors' approach was compared to the method of using the network architecture to locate problem areas. This comparison showed that locating problem areas using network architecture over-estimates the number of problem areas

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The lack of real data collected from networked distributed systems is one of the reasons for the lack of published results on dependability analysis and modelling of interconnected systems. Examples of such data are reported in [16], where event logs collected from a network of 69 Sun workstations monitored over a period of 32 weeks are analysed, and in [8], where availability analyses are carried out based on event logs collected from 70 Windows NT mail servers. ...
... The identification of events corresponding to errors and the definition of error classification criteria according to the origin of the error or its severity requires a thorough manual analysis of log files. Some examples of classification criteria are presented in [16]. In this paper, we focus on the identification of machine reboots from the event logs and the evaluation of statistics characterizing: a) the distribution of reboots (per machine, time), b) the distribution of uptimes and downtimes associated to these reboots, c) the availability of machines including workstations and servers, and d) error dependencies between clients and servers. ...
... However, these dependencies will not be activated if the clients do not access the servers upon the occurrence of server failures. Similar results were observed in the experimental study presented in [16]. ...
Conference Paper
Full-text available
This paper presents a measurement-based availability study of networked Unix systems,based on data collected during 11 months from 298 workstations and serversinterconnected through a local area computing network.The data corresponds to event logs recorded by the Unix operating system via the Syslogd daemon.Our study focuses on the identification of machine reboots and the evaluation of statistical measures characterizing: a) the distribution of reboots (per machine,time), b) the distribution of uptimes and downtimes associated to these reboots,c) the availability of machines including workstations and servers, and d) error dependencies between clients and servers.
... The lack of real data collected from networked distributed systems is one of the reasons for the lack of published results on dependability analysis and modelling of interconnected systems. Examples of such data are reported in [16], where event logs collected from a network of 69 Sun workstations monitored over a period of 32 weeks are analysed, and in [8], where availability analyses are carried out based on event logs collected from 70 Windows NT mail servers. ...
... The identification of events corresponding to errors and the definition of error classification criteria according to the origin of the error or its severity requires a thorough manual analysis of log files. Some examples of classification criteria are presented in [16]. In this paper, we focus on the identification of machine reboots from the event logs and the evaluation of statistics characterizing: a) the distribution of reboots (per machine, time), b) the distribution of uptimes and downtimes associated to these reboots, c) the availability of machines including workstations and servers, and d) error dependencies between clients and servers. ...
... However, these dependencies will not be activated if the clients do not access the servers upon the occurrence of server failures. Similar results were observed in the experimental study presented in [16]. ...
Conference Paper
Full-text available
This paper presents a measurement-based availability study of networked Unix systems, based on data collected during 11 months from 298 workstations and servers interconnected through a local area computing network. The data corresponds to event logs recorded by the Unix operating system via the Syslogd daemon. Our study focuses on the identification of machine reboots and the evaluation of statistical measures characterizing: (a) the distribution of reboots (per machine, time), (b) the distribution of uptimes and downtimes associated to these reboots, (c) the availability of machines including workstations and servers, and (d) error dependencies between clients and servers.
... In past years, several software technologies have been developed for the integration of state-of-the-art collection technologies that manipulate and model log-based error analysis and log data; for example, "MEADEP" [35], "NOW" [36], and "SEC" [37,38]. However, since the log-based investigation is not supported by fully automated procedures, the processing load on most analysis loads is inadequate knowledge of the system. ...
... In addition, an error that activates multiple messages in the log causes considerable effort to use the entries for the same results of the error manifestation. Preprocessing tasks are crucial for accurate error analysis [6,22,27,36]. ...
... Computer system dependability analysis based on event logs has been the focus of several published papers [1,2,4,5,7,8,9]. Various types of systems have been studied (Tandem, VAX/VMS, Unix, Windows NT, Windows 2000, etc.) including mainframes and largely deployed commercial systems. ...
... Such task is tedious and requires the development of heuristics and predefined failure criteria. An example of such analysis is reported in [7]. ...
Conference Paper
Full-text available
This paper presents a measurement-based availability assessment study using field data collected during a 4-year period from 373 SunOS/Solaris Unix workstations and servers interconnected through a local area network. We focus on the estimation of machine uptimes, downtimes and availability based on the identification of failures that caused total service loss. Data corresponds to syslogd event logs that contain a large amount of information about the normal activity of the studied systems as well as their behavior in the presence of failures. It is widely recognized that the information contained in such event logs might be incomplete or imperfect. The solution investigated in this paper to address this problem is based on the use of auxiliary sources of data obtained from wtmpx files maintained by the SunOS/Solaris Unix operating system. The results obtained suggest that the combined use of wtmpx and syslogd log files provides more complete information on the state of the target systems that is useful to provide availability estimations that better reflect reality.
... Most of the previous work in measurement-based dependability evaluation was based on measurements made at either failure times [3, 10, 14] or at times an error was observed [11, 15, 20, 21]. Chillarege et. ...
... This comprehensive model takes into account both reactive recovery following a failure due to resource exhaustion and rejuvenation, and is used to derive optimal rejuvenation schedules. Other related work in measurement-based dependability evaluation is based on either measurements made at failure times [11] or at error observation times [42]. In our case, we monitor the system performance variables continuously since we are interested in trend estimation and, hence, in predicting the time to next failure and not in observing interfailure times or identifying error patterns. ...
Conference Paper
The phenomenon of software aging refers to the accumulation of errors during the execution of the software which eventually results in it's crash/hang failure. A gradual performance degradation may also accompany software aging. Pro-active fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be used to counteract aging if it exists. We propose a methodology for detection and estimation of aging in the UNIX operating system. First, we present the design and implementation of an SNMP based, distributed monitoring tool used to collect operating system resource usage and system activity data at regular intervals, from networked UNIX workstations. Statistical trend detection techniques are applied to this data to detect/validate the existence of aging. For quantifying the effect of aging in operating system resources, we propose a metric: “estimated time to exhaustion”, which is calculated using well known slope estimation techniques. Although the distributed data collection tool is specific to UNIX, the statistical techniques can be used for detection and estimation of aging in other software as well
... Computer system dependability analysis based on event logs has been the focus of several published papers [1,2,4,5,7,8,9]. Various types of systems have been studied (Tandem, VAX/VMS, Unix, Windows NT, Windows 2000, etc.) including mainframes and largely deployed commercial systems. ...
... Such task is tedious and requires the development of heuristics and predefined failure criteria. An example of such analysis is reported in [7]. ...
Article
Full-text available
This paper presents a measurement-based availability assessment study using field data collected during a 4-year period from 373 SunOS/Solaris Unix workstations and servers interconnected through a local area network. We focus on the estimation of machine uptimes, downtimes and availability based on the identification of failures that caused total service loss. Data corresponds to syslogd event logs that contain a large amount of information about the normal activity of the studied systems as well as their behavior in the presence of failures. It is widely recognized that the information contained in such event logs might be incomplete or imperfect. The solution investigated in this paper to address this problem is based on the use of auxiliary sources of data obtained from wtmpx files maintained by the SunOS/Solaris Unix operating system. The results obtained suggest that the combined use of wtmpx and syslogd log files provides more complete information on the state of the target systems that is useful to provide availability estimations that better reflect reality.
... In the past, many software technologies [10], [11] have been developed to support "log-based defect analysis" and the integration of modern capture techniques for the processing and modeling of historical data, such as "MEADEP" [12], " Analyze NOW" [13] and "SEC" [14]. However, "log-based analysis" is not sustained by completely automated practices, so most analyst protocol processing campaigns rely on often limited system knowledge. ...
Article
Full-text available
Software development is a multitasking activity by an individual or group of team. Every one activity engages diverse tasks and complication. To accomplish quality improvement, it is essential to make every activity task free of defects. But locating and correcting defects is more expensive and time-intense. In the past, many potential methods have been used to predict potential drawbacks in the program based on the theory of probability facts. Because the probability method applies a random variable and probability distributions to find a solution, the result is always in a possible range that can be true at some time or may also be wrong. Therefore, an additional calculation method coupled with the probability of making it more accurate and new in predicting the defect of the program. In this paper, we propose a Probabilistic and Deterministic based Defect Prediction (PD-DP) through Defect Association Learning (DAL). The PD-DP implements a Probability association method (PAM) and Deterministic association method (DAM) to predict the software defect accurately in software development. The experimental evaluation of the PP-DP in compare to existing prediction methods shows enhancement in prediction accuracy.
... On the other hand, log-based techniques consist in analyzing the log files produced by the component, if available [11,12]. In fact, these may contain many useful data to comprehend the system dependability behavior, as shown by many studies on the field [13,14]. Moreover, logs are the only direct source of information available in the case of OTS items. ...
... Table 2 shows the references per tag in the "Type of Model" category. Figure 3 activations [37,43,48,59,61,66,79,81,82,90,91,111,113,116,119,149,164,172] failures [91,96,99,100,102,118,123,126,141,147,149,156,157,161,162,167,168,174,175,179] meta [41,57,63,70,77,85,90,91,92,106,107,123,136,152,153,176,190] In the following plots, we therefore visualize tag combinations using a correspondence value. The correspondence value is our measure of correlation between two tags. ...
Article
Full-text available
The software engineering field has a long history of classifying software failure causes. Understanding them is paramount for fault injection, focusing testing efforts or reliability prediction. Since software fails in manifold complex ways, a broad range of software failure cause models is meanwhile published in dependability literature. We present the results of a meta-study that classifies publications containing a software failure cause model in topic clusters. Our results structure the research field and can help to identify gaps. We applied the systematic mapping methodology for performing a repeatable analysis. We identified 156 papers presenting a model of software failure causes. Their examination confirms the assumption that a large number of the publications discusses source code defects only. Models of fault-activating state conditions and error states are rare. Research seems to be driven mainly by the need for better testing methods and code-based quality improvement. Other motivations such as online error detection are less frequently given. Mostly, the IEEE definitions or orthogonal defect classification is used as base terminology. The majority of use cases comes from web, safety- and security-critical applications.
... Failures in networks of workstations have been studied in [25]. A large number of outages are due to planned maintenance, software installation and configuration. ...
... Over the years, several software packages have emerged to support log-based failure analysis, integrating state-of-theart techniques to collect, manipulate, and model the log data, e.g., MEADEP [25], Analyze NOW [26], and SEC [27], [28]. However, log-based analysis is not yet supported by fully-automated procedures, leaving most of the processing burden to log analysts, who often have a limited knowledge of the system. ...
Article
Event logs have been widely used over the last three decades to analyze the failure behavior of a variety of systems. Nevertheless, the implementation of the logging mechanism lacks a systematic approach and collected logs are often inaccurate at reporting software failures: This is a threat to the validity of log-based failure analysis. This paper analyzes the limitations of current logging mechanisms and proposes a rule-based approach to make logs effective to analyze software failures. The approach leverages artifacts produced at system design time and puts forth a set of rules to formalize the placement of the logging instructions within the source code. The validity of the approach, with respect to traditional logging mechanisms, is shown by means of around 12,500 software fault injection experiments into real-world systems.
... It consists of 4 software modules: a data preprocessor for converting data in various formats to the MEADEP format, a data analyzer for graphical data-presentation and parameter estimation, a graphical modeling interface for building block diagrams (including the exponential block, Weibull block, and k-out-of-n block) and Markov reward chains, and a model-solution module for availability/reliability calculations with graphical parametric analysis. Analyze NOW [9] is a set of tools specifically tailored for the FFDA of networks of workstations. It embodies tools for the automated data collection from all the workstations, and tools for the analysis of such data. ...
Article
Full-text available
Field failure data play a key role in complex distributed system, as they often represent the only available source of information useful to control the dependability level of the system. However, the analysis of these data can be compromised by several factors, such as log heterogeneity and inaccuracy, which increase the level on distrust on logs, and make it difficult to compare different analyses to provide general results. The paper proposes a framework to overcome these limitations, based on three key aspects: (i) the use of an "accurate enough" model of the system in hand, (ii) the definition of common logging rules, to be used a t design and development time to enhance the accuracy, and (iii) the design of a logging platform to orchestra te the collection and analysis processes. A case study of the proposed framework is presented, in the context of a real world complex system.
... Whereas in [5] only time based trend detection and estimation of resource exhaustion are considered, we also take the system workload into account for building our model. Other previous work in measurement-based dependability evaluation is based on either measurements made at failure times [1, 12, 15] or at error observation times [13, 23, 24]. In our case, we need to monitor the system parameters continuously since we are interested in trend estimation and not in inter-failure times or identifying error patterns. ...
... Analyze NOW [104] is a set of tools specifically tailored for the FFDA of networks of workstations. It embodies tools for the automated data collection from all the workstations, and tools for the analysis of such data. ...
... Query-based t e c h n i q u e s r e l y o n t h e p r e s e n c e o f a n e x t e r n a l module, which is in charge of querying the monitored component, to discover latent errors [6]. Log-based techniques consist in examining the log files produced by the components, in order to figure out the system behavior by correlating different events [7]. ...
Conference Paper
Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependencies, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering p a t t e r n s , t y p i c a l l y escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.
... An example is MEADEP [18], which consists of four software modules, i.e., a data preprocessor for converting data in various formats to the MEADEP format, a data analyzer for graphical datapresentation and parameter estimation, a graphical modeling interface for building block diagrams, e.g., Weibull and kout-of-n block, and Markov reward chains, and a modelsolution module for availability/reliability estimation with graphical parametric analysis. Analyze NOW [19] is a set of tools tailored for networks of workstations. It embodies tools for the automated data collection from all the workstations, and tools for automating the data analysis task. ...
Article
Full-text available
Field Failure Data Analysis (FFDA) is a widely adopted methodology to characterize the dependability behav-ior of a computing system. It is often based on the analysis of logs available in the system under study. However, current logs do no seem to be actually conceived to perform FFDA, since their production usually lacks a systematic approach and relies on developers' experience and attitude. As a result, collected logs may be heterogeneous, inaccurate and redundant. This, in turn, increases analysis efforts and reduces the quality of FFDA results. This paper proposes a rule-base logging framework, which aims to improve the quality of logged data and to make the analysis phase more effective. Our proposal is compared to traditional log analysis in the context of a real-world case study in the field of Air Traffic Control. We demonstrate that the adoption of a rule-based strategy makes it possible to significantly improve dependability evaluation by both reducing the amount of information actually needed to perform the analysis and without affecting system performance.
... These studies contributed to achieve a significant understanding of the failure modes of these systems, and, such as, [15], made it possible to improve their successive releases. Despite the existence of software packages that integrate state-of-the-art techniques to collect, to manipulate, and to model the data log, e.g., [24], [22], FFDA also requires ad-hoc strategies and algorithms to identify failure-related entries in the log and to coalesce the entries related to the same problem. The criticality of these tasks, towards the objective of achieving accurate measurements, has been recognized since early work in this area, such as, [10], [2]. ...
Conference Paper
Full-text available
Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.
... Our implementation of the rejuvenation agent i n volves monitoring and trending which is similar to the above w orks. Other previous work in measurementbased dependability e v aluation is based on either measurements made at failure times 2, 13, 15] or at error observation times 14, 22, 23]. In 17], system parameters are constantly monitored to detect anomalies in a network automatically. ...
Conference Paper
Several recent studies have reported the phenomenon of "software aging", one in which the state of a software system degrades with time. This may eventually lead to performance degradation of the software or crash/hang failure or both. "Software rejuvenation" is a pro-active technique aimed to prevent unexpected or unplanned outages due to aging. The basic idea is to stop the running software, clean its internal state and restart it. In this paper, we discuss software rejuvenation as applied to cluster systems. This is both an innovative and an efficient way to improve cluster system availability and productivity. Using Stochastic Reward Nets (SRNs), we model and analyze cluster systems which employ software rejuvenation. For our proposed time-based rejuvenation policy, we determine the optimal rejuvenation interval based on system availability and cost. We also introduce a new rejuvenation policy based on prediction and show that it can dramatically increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior and performability measures, which we are just beginning to explore. We then briefly describe an implementation of a software rejuvenation system that performs periodic and predictive rejuvenation, and show some empirical data from systems that exhibit aging
... Several studies have examined failures in networks of workstations. Thakur et al. [32] examined failures in a network of 69 SunOS workstations but only divided problem root cause into network, non-disk machine problems, and disk-related machine problems. Kalyanakrishnam et al. [15] studied six months of event logs from a LAN of Windows NT workstations used for mail delivery, to determine the causes of machines rebooting. ...
Conference Paper
We describe the architecture, operational practices, and failure characteristics of three very large-scale Internet services. Our research on architecture and operational practices took the form of interviews with architects and operations staff at those (and several other) services. Our research on component and service failure took the form of examining the operations problem tracking databases from two of the services and a log of service failure post-mortem reports from the third. Architecturally, we find convergence on a common structure: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of cus- tom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. With respect to failures, we find that operator errors are their primary cause, operator error is the most difficult type of failure to mask, service front-ends are responsible for more problems than service back-ends but fewer minutes of unavailability, and that online testing and more thoroughly exposing and detecting component failures could reduce system failure rates for at least one service.
... While brown-outs at Internet services are pervasive, more serious outages also occur frequently, as noted in Table 2 [53,54,77,89,121,124], there is only one that we know of that reports directly on causes of failures in Internet services [107]. In that report, Oppenheimer et al. study over a hundred post-mortem reports on user-visible failures at three different Internet services, and find that improved detection of applicationlevel failures could have mitigated or avoided 65% of reported user-visible failures. ...
Article
Full-text available
Submitted to the Department of Computer Science. Copyright by the author. Thesis (Ph. D.)--Stanford University, 2005.
... A plenty of studies have been focused on this topic, proposing techniques and methodologies for studying error logs collected from a variety of distributed systems. Examples are studies of Network of Workstations [7], Windows NT operating systems [8], and, more recently, Large-Scale Heterogeneous Server En- vironments [9]. On the other hand, there are a growing number of works which have been studying dependability issues of wireless and mobile infrastructures. ...
Conference Paper
Full-text available
The widespread use of mobile and wireless computing platforms is leading to a growing interest on dependability issues. Several research studies have been conducted on dependability of mobile environments, but none of them attempted to identify system bottlenecks and to quantify dependability measures. This paper proposes a distributed automated infrastructure for monitoring and collecting spontaneous failures of the Bluetooth infrastructure, which is nowadays more and more recognized as an enabler for mobile systems. Information sources for failure data are presented, and preliminary experimental results are discussed.
... The latter events are particularly useful for dependability analysis. Event-log-based dependability analysis of computer systems has been the focus of several research papers [1] [2] [3] [5] [8] [9]. While various types of systems have been studied, including mainframes and largely deployed commercial systems, only a few studies addressed Windows NT or Windows 2K systems [4] [6] [10]. ...
... Whereas in [5] only time based trend detection and estimation of resource exhaustion are considered, we also take the system workload into account for building our model. Other previous work in measurement-based dependability evaluation is based on either measurements made at failure times [1, 12, 15] or at error observation times [13, 23, 24]. In our case, we need to monitor the system parameters continuously since we are interested in trend estimation and not in inter-failure times or identifying error patterns. ...
Conference Paper
Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of “software aging”, in which the state of the software system degrades with time, has been reported (S. Garg et al., 1998). The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure, or both. Earlier work in this area to detect aging and to estimate its effect on system resources did not take into account the system workload. In this paper, we propose a measurement-based model to estimate the rate of exhaustion of operating system resources both as a function of time and the system workload state. A semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates and the time-to-exhaustion for the resources. With the help of this measure, proactive fault management techniques such as “software rejuvenation” (Y. Huang et al., 1995) may be employed to prevent unexpected outages
... The study analyzed problems that resulted in panics and crashed the system. Thakur [19] described a simple yet effective methodology for collecting and analyzing failures in a network of UNIX-based work- stations. ...
Conference Paper
Full-text available
This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: 1) most of the problems that lead to reboots are software related; 2) rebooting the machine does not always solve the problem; 3) there are indications of propagated or correlated failures; and 4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service
... Other related work in measurement-based dependability evaluation is based on either measurements made at failure times [11] or at error observation times [42]. In our case, we monitor the system performance variables continuously since we are interested in trend estimation and, hence, in predicting the time to next failure and not in observing interfailure times or identifying error patterns. ...
Article
Full-text available
Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.
... Soft-error-induced network interface failures can be quite detrimental to the reliability of a distributed system. The failure data analysis reported in [6] indicates that network-related problems contributed to approximately 40 percent of the system failures observed in distributed environments. As we will see in the following sections, soft errors can cause the network interface to completely stop responding, function improperly, or greatly reduce network performance. ...
Article
Full-text available
Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.
... There are extensive studies on faults and failures in information systems in the reliability community, covering software reliability [5], classification of faults [5], [6], fault detection [7], [8], and fault-tolerant system design [9]- [12]. However, intrusions into information systems, which account for many real-world reliability incidents and QoS problems in information systems, have not been widely studied using theories and techniques of quality and reliability engineering. ...
Article
Reliability and quality of service from information systems has been threatened by cyber intrusions. To protect information systems from intrusions and thus assure reliability and quality of service, it is highly desirable to develop techniques that detect intrusions. Many intrusions manifest in anomalous changes in intensity of events occurring in information systems. In this study, we apply, test, and compare two EWMA techniques to detect anomalous changes in event intensity for intrusion detection: EWMA for autocorrelated data and EWMA for uncorrelated data. Different parameter settings and their effects on performance of these EWMA techniques are also investigated to provide guidelines for practical use of these techniques.
... The study analyzed problems that resulted in panics and system crashes. Thakur [22] described a simple yet effective methodology for collecting and analyzing failures in a network of UNIX-based workstations. The majority of observed failures (68%) encountered were network related. ...
Article
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%,(5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime. 1
Chapter
High performance computing (HPC) systems are becoming the norm for daily use and care must be made to ensure that these systems are resilient. Recent contributions on resiliency have been from quantitative and qualitative perspectives where general system failures are considered. However, there are limited contributions dealing with the specific classes of failures that are directly related to cyber-attacks. In this chapter, the author uses the concepts of transition processes and limiting distributions to perform a generic theoretical investigation of the effects of targeted failures by relating the actions of the cyberenemy (CE) to different risk levels in an HPC system. Special cases of constant attack strategies are considered where exact solutions are obtained. Additionally, a stopped process is introduced to model the effects of system termination. The results of this representation can be directly applied throughout the HPC community for monitoring and mitigating cyber-attacks.
Conference Paper
When a complex real-world application is deployed post-release, a number of crash reports are generated. As the number of clients using the product increases, so do the crash reports. Typically, the approach followed in many software organizations is to manually analyze a crash report to identify the erroneous module responsible for the crash. Naturally, when a large number of crash reports are generated daily, the development team requires a substantial amount of time to analyze all these reports. This in turn increases the turn-around time for crash report analysis which often leaves customers unhappy. In order to address this problem, we have developed an automated method to analyze a crash report and identify the erroneous module. This method is based on a novel algorithm that searches for exception-based patterns in crash reports and maps reference assemblies. We have applied this method to several thousand crash reports across four sub-systems of an industrial automation application. Results indicate that the algorithm not only achieves a high accuracy in finding the erroneous module and subsystem behind a crash, but also significantly reduces the turn-around time for crash report analysis.
Conference Paper
This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude.
Article
The link load balancing technique is proved in theory to be able to improve system availability greatly. Then the paper presents a dynamic link load balancing algorithm which can avoid the failure path. This new algorithm obtain the length of link waiting queue and response time, which helps distribute all the data traffic loads on multiple links and increase the link bandwidth usage. In addition, the availability of the current links can be managed on real time, and the current disconnected link can be shielded and the mean time to failure(MTTF) can be extended, therefore system availability greatly improves. Finally, the simulation results verify the effectiveness of this algorithm.
Chapter
Failure analysis is valuable to dependability engineers because it supports designing effective mitigation means, defining strategies to reduce maintenance costs, and improving system service. Event logs, which contain textual information about regular and anomalous events detected by the system under real workload conditions, represent a key source of data to conduct failure analysis. So far, event logs have been successfully used in a variety of domains. This chapter describes methodology and well-established techniques underlying log-based failure analysis. Description introduces the workflow leading to analysis results starting from the raw data in the log. Moreover, the chapter surveys relevant works in the area with the aim of highlighting main objectives and applications of log-based failure analysis. Discussion reveals benefits and limitations of logs for evaluating complex systems.
Article
This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer logs, due to increasing importance of automatic event log analysis for these systems. Experimental results show how the approach allows to compare different time coalescence techniques and to identify their weaknesses with respect to given system settings. In addition, results revealed an interesting correlation between errors caused by the coalescence and errors in the estimation of dependability measurements.
Article
Abstract Reliability is a rapidly growing concern in contemporary Personal Computer (PC) industry, both for computer users as well as product developers. To improve dependability, systems designers and programmers must consider failure and usage data for operating systems as well as applications. In this paper, we analyze crash data from Windows machines. We collected our data from two different sources – the UC Berkeley EECS department and a population of volunteers who contribute to the BOINC project. Westudy both application crash behavior and operating systems crashes. We found that application crashes are caused by both faulty non-robust dll files as well as impatient users who prematurely terminate non-responding applications, especially web browsers. OS crashes are predominantly caused by poorly-written device driver code. Users as well as product developers will benefit from understanding the crash behaviors and crashprevention techniques we have revealed in this paper. ,,,,,,,,,,,
Article
Full-text available
On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.
Conference Paper
Since around 80s, researchers and software engineers dealing with dependability of system and software products have recognized the crucial role of field data. Field data represent an attractive way for increasing the efficiency of testing activities and, once the product has been delivered, to improve the quality of subsequent releases. On the other hand, plenty of research studies have been conducted on techniques and methods for field data measurement of operational systems: from data filtering and analysis techniques to dependability measurements and improvements. Field data represent also a good opportunity for understanding source of failures. Examples are provided in Siewiorek et al. (2004), such as the one reporting failure data collected on the public switched telephone network (PSTN) which emphasize how failures in systems are due to the environment and operators. Despite these efforts, there is an increasing gap between software practitioners and researchers involved in the development of accurate and efficient methodologies for field data measurement campaigns. This paper tries to shed some light on this problem, which is a sort of "hide and seek game" of real world field data between industries and researchers. Here, the author tries to emphasize the importance of real collaboration between industries and the research community toward the definition of effective "design for dependability evaluation" methodology
Conference Paper
Full-text available
Striping data across multiple nodes has been recognized as an effective technique for delivering high-bandwidth I/O to applications running on clusters. However the technique is vulnerable to disk failure. We present an I/O architecture for clusters called reliable array of autonomous controllers (RAAC) that builds on the technique of RAID style data redundancy. The RAAC architecture uses a two-tier layout that enables the system to scale in terms of storage capacity and transfer bandwidth while avoiding the synchronization overhead incurred in a distributed RAID system. We describe our implementation of RAAC in PVFS, and compare the performance of parity-based redundancy in RAAC and in a conventional distributed RAID architecture.
Conference Paper
The dependability of a system can be experimentally evaluated at different phases of its life cycle. In the design phase, computer-aided design (CAD) environments are used to evaluate the design via simulation, including simulated fault injection. Such fault injection tests the effectiveness of fault-tolerant mechanisms and evaluates system dependability, providing timely feedback to system designers. Simulation, however, requires accurate input parameters and validation of output results. Although the parameter estimates can be obtained from past measurements, this is often complicated by design and technology changes. In the prototype phase, the system runs under controlled workload conditions. In this stage, controlled physical fault injection is used to evaluate the system behavior under faults, including the detection coverage and the recovery capability of various fault tolerance mechanisms. Fault injection on the real system can provide information about the failure process, from fault occurrence to system recovery, including error latency, propagation, detection, and recovery (which may involve reconfiguration). But this type of fault injection can only study artificial faults; it cannot provide certain important dependability measures, such as mean time between failures (MTBF) and availability. In the operational phase, a direct measurement-based approach can be used to measure systems in the field under real workloads. The collected data contain a large amount of information about naturally occurring errors/failures.
Conference Paper
Machine Check Architecture (MCA) is a processor internal architecture subsystem that detects and logs correctable and uncorrectable errors in the data or control paths in each CPU core and the Northbridge. These errors include parity errors associated with caches, TLBs, ECC errors associated with caches and DRAM, and system bus errors. This paper reports on an experimental study on: (i) monitoring a computing cluster for machine checks and using this data to identify patterns that can be employed for error diagnostics and (ii) introducing faults into the machine to understand the resulting machine checks and correlate this data with relevant performance metrics.
Conference Paper
Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.
Conference Paper
PC users have started viewing crashes as a fact of life rather than a problem. To improve operating system dependability, systems designers and programmers must analyze and understand failure data. In this paper, we analyze Windows XP kernel crash data collected from a population of volunteers who contribute to the Berkeley Open Infrastructure for Network Computing (BOINC) project. We found that OS crashes are predominantly caused by poorly-written device driver code. Users as well as product developers will benefit from understanding the crash behaviors elaborated in this paper.
Conference Paper
Mobile Peer-to-Peer (P2P) is a base paradigm for many new killer applications for mobile ad-hoc networks and the Mobile Internet. Currently, it is not well understood whether this paradigm is able to meet business and consumer dependability expectations. Dependability assessment of P2P applications can be achieved by field failure data analysis. The collection of failure data from wireless ad-hoc networks is a challenging task due to the intermittent usage and the mobility of users that do not allow to measure time-based dependability parameters. For this reason, we propose to deploy automated workloads on the actual peer nodes which have to operate continuously. Specifically, this paper formalizes the problem and presents the design of a workload for mobile P2P that aims to orchestrate the peers uniformly, letting the failure occurrence be independent of the network load. Simulation results and experimentation over an actual Bluetooth network demonstrate that the proposed workload meets the defined requirements.
Article
In response to the strong desire of customers to be provided with advance notice of unplanned outages, techniques were developed that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvination of an application, process group, or entire operating system. The resulting techniques are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures.
Conference Paper
This work presents a measurement-based dependability evaluation of the Bluetooth data communication channel, i.e., the Baseband layer. The main contribution is the definition of the Baseband's error/recovery model according to the Markov chains formalism. The model is derived by analyzing field data, which are collected via a commercial air sniffer deployed over real- world Bluetooth piconets. The model is parametric and actual values for its parameters are estimated by analyzing the field data. The paper also proposes the evaluation of dependability statistics (e.g., the error and failure times distributions, and the availability estimate), and the study of the failing behavior of the Bluetooth communication channel under Wi-Fi interferences.
Conference Paper
This work presents a failure data analysis campaign on Bluetooth personal area networks (PANs) conducted on two kind of heterogeneous testbeds (working for more than one year). The obtained results reveal how failures distribution is characterized and suggest how to improve the dependability of Bluetooth PANs. Specifically, we define the failure model and we then identify the most effective recovery actions and masking strategies that can be adopted for each failure. We then integrate the discovered recovery actions and masking strategies in our testbeds, improving the availability and the reliability of 3.64% (up to 36.6%) and 202% (referred to the mean time to failure), respectively
Conference Paper
The increasing complexity of mobile phones directly affects their reliability, while the user tolerance for failures becomes to decrease, especially when the phone is used for business- or mission-critical applications. Despite these concerns, there is still little understanding on how and why these devices fail and no techniques have been defined to gather useful information about failures manifestation from the phone. This paper presents the design of a logger application to collect failure-related information from mobile phones. Preliminary failure data collected from real-world mobile phones confirm the proposed logger is a useful instrument to gain knowledge about mobile phone failure's dynamics and causes.
Conference Paper
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime
Conference Paper
The paper presents an injection-based approach to analyze dependability of high-speed networks using the Myrinet as an example testbed. Instead of injecting faults related to network protocols, the authors injected faults into the host interface component, which performs the actual send and receive operations. The fault model used was a temporary single bit flip in an instruction executing on the host interface's custom processor, corresponding to a transient fault in the processor itself. Results show that more than 25% of the injected faults resulted in interface failures. Furthermore, they observed fault propagation from an interface to its host computer or to another interface to which it sent a message. These findings suggest that two important issues for high-speed networking in critical applications are protecting the host computer from errant or malicious interface components and implementing thorough message acceptance test mechanisms to prevent errant messages from propagating faults between interfaces
Article
Full-text available
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance.
Conference Paper
Full-text available
In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved
Article
Full-text available
Orthogonal defect classification (ODC), a concept that enables in-process feedback to software developers by extracting signatures on the development process from defects, is described. The ideas are evolved from an earlier finding that demonstrates the use of semantic information from defects to extract cause-effect relationships in the development process. This finding is leveraged to develop a systematic framework for building measurement and analysis methods. The authors define ODC and discuss the necessary and sufficient conditions required to provide feedback to a developer; illustrate the use of the defect type distribution to measure the progress of a product through a process; illustrate the use of the defect trigger distribution to evaluate the effectiveness and eventually the completeness of verification processes such as inspection or testing; provides sample results from pilot projects using ODC; and open the doors to a wide variety of analysis techniques for providing effective and fast feedback based on the concepts of ODC
Conference Paper
The paper presents results from an investigation of failures in several releases of Tandem's NonStop-UX Operating System, which is based on Unix System V. The analysis covers software failures from the field and failures reported by Tandem's test center. Fault classification is based on the status of the reported failures, the detection point of the errors in the operating system code, the panic message generated by the systems, the module that was found to be faulty, and the type of programming mistake. This classification reveals which modules in the operating system generate the most faults and the modules in which most errors are detected. We also present distributions of the failure and repair times including inter arrival time of unique failures and time between duplicate failures. These distributions, unlike generic time distributions, such as time between failures, help characterize the software quality. Distribution of the repair times emphasizes the repair process and the factors influencing repair. Distribution of up time of the systems before the panic reveals the factors triggering the panic
Conference Paper
An analysis is given of the software error logs produced by the VAX/VMS operating system from two VAXcluster multicomputer environments. Basic error characteristics are identified by statistical analysis. Correlations between software and hardware errors, and among software errors on different machines are investigated. Finally, reward analysis and reliability growth analysis are performed to evaluate software dependability. Results show that major software problems in the measured systems are from program flow control and I/O management. The network-related software is suspected to be a reliability bottleneck. It is shown that a multicomputer software `time between error' distribution can be modeled by a 2-phase hyperexponential random variable: a lower error rate pattern which characterizes regular errors, and a higher error rate pattern which characterizes error bursts and concurrent errors on multiple machines
Conference Paper
One heuristic for data reduction that is widely used in the literature is coalescence of events occurring close in time. The authors explore the validity of this heuristic by developing a model for the formation of the contents of an event log by multiple independent error processes. The probability of coalescing events from two or more error sources is formulated and compared with results from hand analysis of actual event logs taken from a Tandem TNS II system. Results indicate that the probability of coalescing events from more than one error source is a strong function of the time constant selected. The model can be used to select an appropriate time constant and also has implications in designing event logging systems
Conference Paper
Network performance monitoring is essential for managing a network efficiently and for insuring continuous operation of the network. The paper discusses the implementation of a network monitoring tool and the results obtained by monitoring the College of Engineering Network at the University of South Carolina using this monitoring tool. The monitoring tool collects statistics useful for fault management, resource management, congestion control, and performance management
Conference Paper
The author presents a method for detecting anomalous events in communication networks and other similarly characterized environments in which performance anomalies are indicative of failure. The methodology, based on automatically learning the difference between normal and abnormal behavior, has been implemented as part of an automated diagnosis system from which performance results are drawn and presented. The dynamic nature of the model enables a diagnostic system to deal with continuously changing environments without explicit control, reaching to the way the world is now, as opposed to the way the world was planned to be. Results of successful deployment in a noisy, real-time monitoring environment are shown
Conference Paper
A description is given of the network management on the Ethernet-based engineering design network (EDEN). Past experiences and current network management techniques are presented for the various areas of network management such as protocol-level management, network monitoring, network database, problem reporting, problem diagnosis, and fault isolation. Near-term expectations in network management for EDEN are discussed
Article
Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University
Article
A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant