Conference Paper

Data mining approaches to software fault diagnosis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior and the use of appropriate data mining techniques on the chosen representation. In this paper we use the sequence of system calls to characterize program execution. The data mining tasks addressed are learning to map system call streams to fault labels and automatic identification of fault causes. Spectrum kernels and SVM are used for the former while latent semantic analysis is used for the latter The techniques are demonstrated for the intrusion dataset containing system call traces. The results show that kernel techniques are as accurate as the best available results but are faster by orders of magnitude. We also show that latent semantic indexing is capable of revealing fault-specific features.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Further the type of features or parameters monitored can influence detection/prevention. System parameters such as system call trace sequences, memory utilization statistics and network resource utilization statistics are widely used for detecting intrusions. Of all the system resources, detecting anomalies in system call trace execution [109,110,111,112,113,114,115] has been a popular approach to host based intrusion detection. ...
... Then a complete system call trace which is characteristic of a task can be looked upon as a document. Contrary to techniques [112,114] which aim at separating anomalous system call traces from non-anomalous ones by Eigen decomposition, in the present case we explore the possibilities of identifying each system call whose execution behaviour needs to be carefully watched over. Identifying these important system calls does not necessarily mean that the proposed method can be used directly to detect intrusions. ...
... It can only aid a HIDS by providing extra information. In that sense the proposed technique is different from many clustering based anomalous system call trace identification techniques such as [109,110,111,112]. In this section we discuss our investigations with regard to identifying system calls of importance. ...
Article
Full-text available
Automatic text summarization techniques, which can reduce a source text to a summary text by content generalization or selection have assumed signifi- cance in recent times due to the ever expanding information explosion created by the World Wide Web. Summaries generated by generalization of information are called abstracts and those generated by selection of portions of text (sentences, phrases etc.) are called extracts. Further, summaries could for each document separately or multiple documents could be summarized together to produce a single summary. The challenges in making machines generate extracts or abstracts are primarily due to the lack of understanding of human cognitive processes. Summary generated by humans seems to be influenced by their moral, emotional and ethical stance on the subject and their background knowledge of the content being summarized.These characteristics are hardly understood and difficult to model mathematically. Further automatic summarization is very much handicapped by limitations of existing computing resources and lack of good mathematical models of cognition. In view of these, the role of rigorous mathematical theory in summarization has been limited hitherto. The research reported in this thesis is a contribution towards bringing in the awesome power of well-established concepts information theory to the field of summarization. Contributions of the Thesis The specific focus of this thesis is on extractive summarization. Its domain spans multi-document summarization as well as single document summarization. In the whole thesis the word "summarization" and "summary", imply extract generation and sentence extracts respectively. In this thesis, two new and novel summarizers referred to as ESCI (Extractive Summarization using Collocation Information) and De-ESCI (Dictionary enhanced ESCI) have been proposed. In addition, an automatic summary evaluation technique called DeFuSE (Dictionary enhanced Fuzzy Summary Evaluator) has also been introduced.The mathematical basis for the evolution of the scoring scheme proposed in this thesis and its relationship with other well-known summarization algorithms such as latent Semantic Indexing (LSI) is also derived. The work detailed in this thesis is specific to the domain of extractive summarization of unstructured text without taking into account the data set characteristics such as the positional importance of sentences. This is to ensure that the summarizer works well for a broad class of documents, and to keep the proposed models as generic as possible. Central to the proposed work is the concept of "Collocation Information of a word", its quantification and application to summarization. "Collocation Information" (CI) is the amount of information (Shannon’s measure) that a word and its collocations together contribute to the total information in the document(s) being summarized.The CI of a word has been computed using Shannon’s measure for information using a joint probability distribution. Further, a base value of CI called "Discrimination Threshold" (DT) has also been derived. To determine DT, sentences from a large collection of documents covering various topics including the topic covered by the document(s) being summarized were broken down into sequences of word collocations.The number of possible neighbors for a word within a specified collocation window was determined. This number has been called the "cardinality of the collocating set" and is represented as |ℵ (w)|. It is proved that if |ℵ (w)| determined from this large document collection for any word w is fixed, then the maximum value of the CI for a word w is proportional to |ℵ (w)|. This constrained maximum is the "Discrimination Threshold" and is used as the base value of CI. Experimental evidence detailed in this thesis shows that sentences containing words with CI greater than DT are most likely to be useful in an extract. Words in every sentence of the document(s) being summarized have been assigned scores based on the difference between their current value of CI and their respective DT. Individual word scores have been summed to derive a score for every sentence. Sentences are ranked according to their scores and the first few sentences in the rank order have been selected as the extract summary. Redundant and semantically similar sentences have been excluded from the selection process using a simple similarity detection algorithm. This novel method for extraction has been called ESCI in this thesis. In the second part of the thesis, the advantages of tagging words as nouns, verbs, adjectives and adverbs without the use of sense disambiguation has been explored. A hierarchical model for abstraction of knowledge has been proposed, and those cases where such a model can improve summarization accuracy have been explained. Knowledge abstraction has been achieved by converting collocations into their hypernymous versions. In the second part of the thesis, the advantages of tagging words as nouns, verbs, adjectives and adverbs without the use of sense disambiguation has been explored. A hierarchical model for abstraction of knowledge has been proposed, and those cases where such a model can improve summarization accuracy have been explained. Knowledge abstraction has been achieved by converting collocations into their hypernymous versions. The number of levels of abstraction varies based on the sense tag given to each word in the collocation being abstracted. Once abstractions have been determined, Expectation- Maximization algorithm is used to determine the probability value of each collocation at every level of abstraction. A combination of abstracted collocations from various levels is then chosen and sentences are assigned scores based on collocation information of these abstractions.This summarization scheme has been referred to as De-ESCI (Dictionary enhanced ESCI). It had been observed in many human summary data sets that the factual attribute of the human determines the choice of noun and verb pairs. Similarly, the emotional attribute of the human determines the choice of the number of noun and adjective pairs. In order to bring these attributes into the machine generated summaries, two variants of DeESCI have been proposed. The summarizer with the factual attribute has been called as De-ESCI-F, while the summarizer with the emotional attribute has been called De-ESCI-E in this thesis. Both create summaries having two parts. First part of the summary created by De-ESCI-F is obtained by scoring and selecting only those sentences where a fixed number of nouns and verbs occur.The second part of De-ESCI-F is obtained by ranking and selecting those sentences which do not qualify for the selection process in the first part. Assigning sentence scores and selecting sentences for the second part of the summary is exactly like in ESCI. Similarly, the first part of De-ESCI-E is generated by scoring and selecting only those sentences where fixed number of nouns and adjectives occur. The second part of the summary produced by De-ESCI-E is exactly like the second part in De-ESCI-F. As the model summary generated by human summarizers may or may not contain sentences with preference given to qualifiers (adjectives), the automatic summarizer does not know apriori whether to choose sentences with qualifiers over those without qualifiers. As there are two versions of the summary produced by De-ESCI-F and De-ESCIE, one of them should be closer to the human summarizer’s point of view (in terms of giving importance to qualifiers). This technique of choosing the best candidate summary has been referred to as De-ESCI-F/E. Performance Metrics The focus of this thesis is to propose new models and sentence ranking techniques aimed at improving the accuracy of the extract in terms of sentences selected, rather than on the readability of the summary. As a result, the order of sentences in the summary is not given importance during evaluation. Automatic evaluation metrics have been used and the performance of the automatic summarizer has been evaluated in terms of precision, recall and f-scores obtained by comparing its output with model human generated extract summaries. A novel summary evaluator called DeFuSE has been proposed in this thesis, and its scores are used along with the scores given by a standard evaluator called ROUGE. DeFuSE evaluates an extract in terms of precision, recall and f-score relying on The use of WordNet hypernymy structure to identify semantically similar sentences in a document. It also uses fuzzy set theory to compute the extent to which a sentence from the machine generated extract belongs to the model summary. Performance of candidate summarizers has been discussed in terms of percentage improvement in fscore relative to the baselines. Average of ROUGE and DeFuSE f-score for every summary is computed, and the mean value of these scores is used to compare performance improvement. Performance For illustrative purposes, DUC 2002 and DUC 2003 multi-document data sets have been used. From these data sets only the 400 word summaries of DUC 2002 and track-4 (novelty track) summaries of DUC 2003 are useful for evaluation of sentence extracts and hence only these have been used. f-score has been chosen as a measure of performance. Standard baselines such as coverage, size and lead and also probabilistic baselines have been used to measure percentage improvement in f-score of candidate summarizers relative to these baselines. Further, summaries generated by MEAD using centroid and length as features for ranking (MEAD-CL), MEAD using positional, centroid and length as features for ranking (MEAD-CLP), Microsoft Word automatic summarizer (MS-Word) and Latent Semantic Indexing (LSI) based summarizer were used to compare the performance of the proposed summarization schemes.
... System call traces produced in the operational phase are then compared with nominal traces to reveal intrusion conditions. Other solutions, such as [16] [17], are based on data mining approaches, such as document classification. They extract recurrent execution patterns (using system calls or network connections) to model the application under nominal conditions, and to classify run-time behaviors as normal or anomalous. ...
... System call traces produced in the operational phase are then compared with nominal traces to reveal intrusion conditions. Other solutions, such as [16, 17], are based on data mining approaches, such as document classification. They extract recurrent execution patterns (using system calls or network connections) to model the application under nominal conditions, and to classify run-time behaviors as normal or anomalous. ...
Article
Full-text available
On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.
... System call traces produced in the operational phase are then compared with nominal traces to reveal intrusion conditions. Other solutions, such as [16,17], are based on data mining approaches, such as document classification. They extract recurrent execution patterns (using system calls or network connections) to model the application under nominal conditions, and to classify run-time behaviors as normal or anomalous. ...
... Failure diagnosis approaches in enterprise systems typ-ically localize anomalous system behavior through statistical analysis of time-series data [6,8,9,11,14] or through control-flow and data-flow analysis [1,3,5,7,12]. However, the failure diagnosis approaches developed for enterprise systems might not be directly applicable to automotive systems because automotive systems have limited processing and storage capacity and might not support the level of instrumentation and processing needed by the enterprise approach. ...
Article
Despite extensive design processes, emergent and anomalous behavior can still appear at runtime in dependable automotive systems. This occurs due to the existence of unexpected interactions and unidentified dependencies between independently-developed com-ponents. Therefore, system-level mechanisms must be provided to quickly diagnose such behavior and deter-mine an appropriate corrective action. DIAGNOSTIC FUSION describes a holistic process for synthesizing data across design stages and component boundaries in order to provide an actionable diagnosis.
... An anomaly-based strategy is also in [8], which exploits hardware performance counters and IPC (Inter Process Communication) signals to monitor the system behavior, and to detect possible anomalous conditions. Other solutions, such as [11], are based on statistical learning approaches. They extract recurrent execution patterns (using system calls) to model the application under nominal conditions, and to classify run-time behaviors as normal or anomalous. ...
Conference Paper
Software systems employed in critical scenarios are increasingly large and complex. The usage of many heterogeneous components causes complex interdependencies, and introduces sources of non-determinism, that often lead to the activation of subtle faults. Such behaviors, due to their complex triggering p a t t e r n s , t y p i c a l l y escape the testing phase. Effective on-line monitoring is the only way to detect them and to promptly react in order to avoid more serious consequences. In this paper, we propose an error detection framework to cope with software failures, which combines multiple sources of data gathered both at application-level and OS-level. The framework is evaluated through a fault injection campaign on a complex system from the Air Traffic Management (ATM) domain. Results show that the combination of several monitors is effective to detect errors in terms of false alarms, precision and recall.
... In particular, they are often adopted to diagnose failures due to hardware faults, by using statistical analysis and heuristic rules (Iyer et al., 1990; Lin and Siewiorek, 1990). Data mining and language processing techniques have also been adopted to automatically analyse log files (Bose and Srinivasan, 2005). These techniques assume the occurrence of some events in the log file to detect a failure; unfortunately, we cannot rely on the availability of log messages when dealing with hang failures since the system may be unable to execute and thus to produce log messages (e.g., a stuck component cannot return an error code or cannot throw an exception). ...
Article
Full-text available
Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.
... The machine learning paradigm underlies the location phase in the form of classification. This approach has been used in several works trying to solve different problems, e.g., works focusing on document classification (Manevitz & Yousef, 2002) (Jagadeesh, Bose, & Srinivasan, 2005) or aiming to find latent errors in software programs (Brun & Ernst, 2004). The location classifier has been trained in a supervised way, by the means of the pseudo-algorithm in figure 6. ...
Article
Full-text available
This paper proposes an approach to software faults diagnosis in complex fault tolerant systems, encompassing the phases of error detection, fault location, and system recovery. Errors are detected in the first phase, exploiting the operating system support. Faults are identified during the location phase, through a machine learning based approach. Then, the best recovery action is triggered once the fault is located. Feedback actions are also used during the location phase to improve detection quality over time. A real world application from the Air Traffic Control field has been used as case study for evaluating the proposed approach. Experimental results, achieved by means of fault injection, show that the diagnosis engine is able to diagnose faults with high accuracy and at a low overhead.
... An n-gram is a consecutive sequence of n symbols. We use n-grams in language modeling [Gao and Zhang 2001], pattern recognition [Yannakoudakis et al. 1990], predicting web page accesses [Deshpande and Karypis 2004], information retrieval [Nie et al. 2000], text categorization and author attribution [Caropreso et al. 2001; Joula et al. 2006; Keselj and Cercone 2004], speech recognition [Jelinek 1998], multimedia [Paulus and Klapuri 2003], music retrieval [Doraisamy and Rüger 2003], text mining [Losiewicz et al. 2000], information theory [Shannon 1948], software fault diagnosis [Bose and Srinivasan 2005], data compression [Bacon and Houde 1984], data mining [Su et al. 2000], indexing [Kim et al. 2005], On-line Analytical Processing (OLAP) [Keith et al. 2005a] , optimal character recognition (OCR) [Droettboom 2003], automated translation [Lin and Hovy 2003], time series segmentation [Cohen et al. 2002], and so on.This paper concerns the use of previously published hash functions for n-grams, together with recent randomized algorithms for estimating the number of distinct items in a stream of data. Together, they permit memory-efficient estimation of the number of distinct n-grams and motivate finer theoretical investigations into efficient n-gram hashing. ...
Article
Full-text available
Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n 1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one- hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data.
... Hence, detectors provide data that have been already 'filtered' . A similar data mining approach is proposed in [22]. Detectors represent the publishers of the publish subscribe infrastructure. ...
Conference Paper
Full-text available
This work addresses the problem of software fault diagnosis in complex safety critical software systems. The transient manifestations of software faults represent a challenging issue since they hamper a complete knowledge of the system fault model at design/development time. By taking into account existing diagnosis techniques, the paper proposes a novel diagnosis approach, which combines the detection and location processes. More specifically, detection and location modules have been designed to deal with partial knowledge about the system fault model. To this aim, they are tuned during system execution in order to improve diagnosis during system lifetime. A diagnosis engine has been realized to diagnose software faults in a real world middleware platform for safety critical applications. Preliminary experimental campaigns have been conducted to evaluate the proposed approach.
... Given a data source containing N symbols, there are up to N − n distinct n-grams. We use n-grams in language modeling [GZ01], pattern recognition [YTH90], predicting web page accesses [DK04], information retrieval [NGZZ00], text categorization and author attribution [CMS01,JSB06,KC04], speech recognition [Jel98], multimedia [PK03], music retrieval [DR03], text mining [LOK00], information theory [Sha48], software fault diagnosis [BS05], data compression [BH84], data mining [SYLZ00], indexing [KWLL05], On-line Analytical Processing (OLAP) [KKL05a], optimal character recognition (OCR) [Dro03], automated translation [LH03], time series segmentation [CHA02], and so on. This paper concerns the use of previously published hash functions for n-grams, together with recent randomized algorithms for estimating the number of distinct items in a stream of data. ...
Article
Full-text available
In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. To reduce costs further, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our running times with exact counts using suffix arrays and show that, while we use hardly any storage, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data.
Chapter
In the present time, software plays a vital role in business, governance, and society in general, so a continuous improvement of software productivity and quality such as reliability, robustness, etc. is an important goal of software engineering. During software development, a large amount of data is produced, such as software attribute repositories and program execution trace, which may help in future development and project management activities. Effective software development needs quantification, measurement, and modelling of previous software artefacts. The development of large and complex software systems is a formidable challenge which requires some additional activities to support software development and project management processes. In this scenario, data mining can provide a helpful hand in the software development process. This chapter discusses the application of data mining in software engineering and includes static and dynamic defect detection, clone detection, maintenance, etc. It provides a way to understand the software artifacts and processes to assist in software engineering tasks.
Chapter
This paper proposes an approach to software faults diagnosis in complex fault tolerant systems, encompassing the phases of error detection, fault location, and system recovery. Errors are detected in the first phase, exploiting the operating system support. Faults are identified during the location phase, through a machine learning based approach. Then, the best recovery action is triggered once the fault is located. Feedback actions are also used during the location phase to improve detection quality over time. A real world application from the Air Traffic Control field has been used as case study for evaluating the proposed approach. Experimental results, achieved by means of fault injection, show that the diagnosis engine is able to diagnose faults with high accuracy and at a low overhead.
Chapter
This paper proposes an approach to software faults diagnosis in complex fault tolerant systems, encompassing the phases of error detection, fault location, and system recovery. Errors are detected in the first phase, exploiting the operating system support. Faults are identified during the location phase, through a machine learning based approach. Then, the best recovery action is triggered once the fault is located. Feedback actions are also used during the location phase to improve detection quality over time. A real world application from the Air Traffic Control field has been used as case study for evaluating the proposed approach. Experimental results, achieved by means of fault injection, show that the diagnosis engine is able to diagnose faults with high accuracy and at a low overhead.
Chapter
This work presents an overview of monitoring approaches to support the diagnosis of software faults and proposes a framework to reveal and diagnose the activation of faults in complex and Off-The Shelf (OTS) based software systems. The activation of a fault is detected by means of anomaly detection on data collected by OS-level monitors. Instead, the fault diagnosis is accomplished by means of a machine learning approach. The evaluation of the proposed framework is carried out using an industrial prototype from the Air Traffic Control domain by means of software fault injection. Results show that the monitoring and diagnosis framework is able to reveal and diagnose faults with high recall and precision with low latency and low overhead.
Thesis
Le marché de l’électronique grand public est dominé par les systèmes embarqués du fait de leur puissancede calcul toujours croissante et des nombreuses fonctionnalités qu’ils proposent. Pour procurer de tellescaractéristiques, les architectures des systèmes embarqués sont devenues de plus en plus complexes (pluralité et hétérogénéité des unités de traitements, exécution concurrente des tâches, ...). Cette complexité afortement influencé leur programmabilité au point où rendre difficile la compréhension de l’exécution d’uneapplication sur ces architectures. L’approche la plus utilisée actuellement pour l’analyse de l’exécution desapplications sur les systèmes embarqués est la capture des traces d’exécution (séquences d’événements,tels que les appels systèmes ou les changements de contexte, générés pendant l’exécution des applications).Cette approche est utilisée lors des activités de test, débogage ou de profilage des applications. Toutefois,suivant certains cas d’utilisation, les traces d’exécution générées peuvent devenir très volumineuses, del’ordre de plusieurs centaines de gigaoctets. C’est le cas des tests d’endurance ou encore des tests de validation, qui consistent à tracer l’exécution d’une application sur un système embarqué pendant de longuespériodes, allant de plusieurs heures à plusieurs jours. Les outils et méthodes d’analyse de traces d’exécutionactuels ne sont pas conçus pour traiter de telles quantités de données.Nous proposons une approche de réduction du volume de trace enregistrée à travers une analyse à la voléede la trace durant sa capture. Notre approche repose sur les spécificités des applications multimédia, quisont parmi les plus importantes pour le succès des dispositifs populaires comme les Set-top boxes ou lessmartphones. Notre approche a pour but de détecter automatiquement les fragments (périodes) suspectes del’exécution d’une application afin de n’enregistrer que les parties de la trace correspondant à ces périodesd’activités. L’approche que nous proposons comporte deux étapes : une étape d’apprentissage qui consisteà découvrir les comportements réguliers de l’application à partir de la trace d’exécution, et une étape dedétection d’anomalies qui consiste à identifier les comportements déviant des comportements réguliers.Les nombreuses expériences, réalisées sur des données synthétiques et des données réelles, montrent quenotre approche permet d’obtenir une réduction du volume de trace enregistrée d’un ordre de grandeur avecd’excellentes performances de détection des comportements suspects.
Thesis
Le marché de l’électronique grand public est dominé par les systèmes embarqués du fait de leur puissance de calcul toujours croissante et des nombreuses fonctionnalités qu’ils proposent. Pour procurer de telles caractéristiques, les architectures des systèmes embarqués sont devenues de plus en plus complexes (pluralité et hétérogénéité des unités de traitements, exécution concurrente des tâches, ...). Cette complexité a fortement influencé leur programmabilité au point où rendre difficile la compréhension de l’exécution d’une application sur ces architectures. L’approche la plus utilisée actuellement pour l’analyse de l’exécution des applications sur les systèmes embarqués est la capture des traces d’exécution (séquences d’événements, tels que les appels systèmes ou les changements de contexte, générés pendant l’exécution des applications). Cette approche est utilisée lors des activités de test, débogage ou de profilage des applications. Toutefois,suivant certains cas d’utilisation, les traces d’exécution générées peuvent devenir très volumineuses, de l’ordre de plusieurs centaines de gigaoctets. C’est le cas des tests d’endurance ou encore des tests de validation, qui consistent à tracer l’exécution d’une application sur un système embarqué pendant de longues périodes, allant de plusieurs heures à plusieurs jours. Les outils et méthodes d’analyse de traces d’exécution actuels ne sont pas conçus pour traiter de telles quantités de données. Nous proposons une approche de réduction du volume de trace enregistrée à travers une analyse à la volée de la trace durant sa capture. Notre approche repose sur les spécificités des applications multimédia, qui sont parmi les plus importantes pour le succès des dispositifs populaires comme les Set-top boxes ou les smartphones. Notre approche a pour but de détecter automatiquement les fragments (périodes) suspectes de l’exécution d’une application afin de n’enregistrer que les parties de la trace correspondant à ces périodesd’activités. L’approche que nous proposons comporte deux étapes : une étape d’apprentissage qui consiste à découvrir les comportements réguliers de l’application à partir de la trace d’exécution, et une étape de détection d’anomalies qui consiste à identifier les comportements déviant des comportements réguliers.Les nombreuses expériences, réalisées sur des données synthétiques et des données réelles, montrent que notre approche permet d’obtenir une réduction du volume de trace enregistrée d’un ordre de grandeur avec d’excellentes performances de détection des comportements suspects.
Article
In the present time, software plays a vital role in business, governance, and society in general, so a continuous improvement of software productivity and quality such as reliability, robustness, etc. is an important goal of software engineering. During software development, a large amount of data is produced, such as software attribute repositories and program execution trace, which may help in future development and project management activities. Effective software development needs quantification, measurement, and modelling of previous software artefacts. The development of large and complex software systems is a formidable challenge which requires some additional activities to support software development and project management processes. In this scenario, data mining can provide a helpful hand in the software development process. This chapter discusses the application of data mining in software engineering and includes static and dynamic defect detection, clone detection, maintenance, etc. It provides a way to understand the software artifacts and processes to assist in software engineering tasks.
Article
Dependable complex systems often operate under variable and non-stationary conditions, which requires efficient and extensive monitoring and error detection solutions. Among the many, the paper focuses on anomaly detection techniques, which monitor the evolution of some specific indicators through time to identify anomalies, i.e. deviations from the expected operational behavior. The timely identification of anomalies in dependable, fault tolerant systems allows to timely detect errors in the services and react appropriately. In this paper, we investigate the possibility to monitor the evolution of indicators through time using the random walk model on indicators belonging to Operating Systems, specifically in our study the Linux Red Hat EL5. The approach is based on the experimental evaluation of a large set of heterogeneous indicators, which are acquired under different operating conditions, both in terms of workload and faultload, on an air traffic management target system. The statistical analysis is based on a best-fitting approach aiming to minimize the integral distance between the empirical data distribution and some reference distributions. The outcomes of the analysis show that the idea of adopting a random walk model for the development of an anomaly detection monitor for critical systems that operates at Operating System level is promising. Moreover, standard distributions such as Laplace and Cauchy, rather than Normal, should be used for setting up the thresholds of the monitor. Further studies that involve a new application, a different Operating System and a new layer (an Application Server) will allow verifying the generalization of the approach to other fault tolerant systems, monitored layers and set of indicators.
Conference Paper
For deployed systems, software fault detection can be challenging. Generally, faulty behaviors are detected based on execution logs, which may contain a large volume of execution traces, making analysis extremely difficult. This paper investigates and compares the effectiveness and efficiency of various data mining techniques for software fault detection based on execution logs, including clustering based, density based, and probabilistic automata based methods. However, some existing algorithms suffer from high complexity and do not scale well to large datasets. To address this problem, we present a suite of prefix tree based anomaly detection techniques. The prefix tree model serves as a compact loss less data representation of execution traces. Also, the prefix tree distance metric provides an effective heuristic to guide the search for execution traces having close proximity to each other. In the density based algorithm, the prefix tree distance is used to confine the K-nearest neighbor search to a small subset of the nodes, which greatly reduces the computing time without sacrificing accuracy. Experimental studies show a significant speedup in our prefix tree based and prefix tree distance guided approaches, from days to minutes in the best cases, in automated identification of software failures.
Conference Paper
Virtualization has been widely adopted by lots of business companies. It is now becoming an effective way to manage massive hardware resources in flexible scales. While power and hardware cost can be significantly saved through virtualization, the fault detection of system becomes more difficult due to the increasing scalability and complexity in a virtualized environment. In this paper, a fault detection framework for virtualized environments is proposed, multiple levels, including application server (AS), operating system (OS) and virtual machine monitor (VMM), are considered in a holistic manner. The framework has been evaluated through a benchmark, Bench4Q. Results show the effectiveness of our framework. And the overhead of our framework is less than 10%.
Article
Information is an important element in all decision making processes. Also, the quality of information has to be assured in order to support proper decision making in critical systems. In this paper, we present a comprehensive solution to information assurance. First we develop an ontology of information quality metrics, which can be used in assessment as well as in guiding the information processing procedures. To achieve proper information quality assessment, it is necessary to track the information flow as well as detect anomalous information processing flaws. We leverage existing data provenance and anomaly detection technologies and improve them to achieve real-time information quality assessment and problem detection.
Conference Paper
The Trust4All project aims to define an open, component-based framework for the middleware layer in high-volume embedded appliances that enables robust and reliable operation, upgrading and extension. To improve availability of each individual application in a Trust4All system, we propose a runtime configurable fault management mechanism (FMM) which detects deviations from given service specifications by intercepting interface calls. There are two novel contributions associated with FMM. First, when repair is necessary, FMM picks a repair action that incurs the best tradeoff between the success rate and the cost of repair. Second, considering that it is rather difficult to obtain sufficient information about third party components during their early stage of usage, FMM is designed to be able to accumulate appropriate knowledge, e.g. the success rate of a specific repair action in the past and rules that can avoid a specific failure, and self-adjust its capability accordingly
Conference Paper
We present an approach to the inference of automata models of Web-based business applications using only execution traces recording the externally observable behavior of such applications. The proposed approach yields behavioral models representing both the control flow of an application and the data variations corresponding to different types of users. We also describe how the obtained models allow the use of verification techniques like model checking in the validation phase using a case study featuring a travel reservation agency.
Conference Paper
In certain circumstances mobile robots are unreachable from human being, for example Mars exploration rover. So robots should detect and handle faults of control software themselves. This paper is intended to detect faults of control software by computers. Support vector machine (SVM) based classification is applied to fault diagnostics of control software for a mobile robot. Both training and testing data are sampled by simulating several fault software strategies and recording the operation parameters of the robot. The correct classification percentages for different situations are discussed
Conference Paper
We present an automated framework for the inference of behavioral models from the execution traces of a Web-based business application (WBA). The model inference framework consists of a formal approach to infer automata models from traces of WBA's and an advanced prototype tool set implemented around the data mining engine Weka, the model checker SPIN, the formal language manipulation framework ANTLR and the graph visualization software GraphViz. The traces of a WBA are collected by monitoring the communications in client-server architectures, where a client can be an Internet browser or a service accessing the server side of the application. The inferred models depict both the control and data flow (showing data variations) of the WBA and can be used for its visualization and verification. Finally, we discuss Web-FIM an online deployment of the model inference framework and illustrate the use of the tools with an example.
Conference Paper
Full-text available
We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence sim- ilarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most success- ful method for remote homology detection, while achieving considerable computational savings.
Conference Paper
Full-text available
In a new method for automatic indexing and retrieval, implicit higher-order structure in the association of terms with documents is modeled to improve estimates of term-document association, and therefore the detection of relevant documents on the basis of terms found in queries. Singular-value decomposition is used to decompose a large term by document matrix into 50 to 150 orthogonal factors from which the original matrix can be approximated by linear combination; both documents and terms are represented as vectors in a 50- to 150- dimensional space. Queries are represented as pseudo-documents vectors formed from weighted combinations of terms, and documents are ordered by their similarity to the query. Initial tests find this automatic method very promising.
Article
Full-text available
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data.
Conference Paper
Full-text available
Software configuration problems are a major source of failures in computer systems. In this paper, we present a new framework for categorizing configuration problems. We apply this categorization to Windows registry-related problems obtained from various internal as well as external sources. Although infrequent, registry-related problems are difficult to diagnose and repair. Consequently they frustrate the users. We classify problems based on their manifestation and the scope of impact to gain useful insights into how problems affect users and why PCs are fragile. We then describe techniques to identify and eliminate such registry failures. We propose health predicate monitoring for detecting known problems, fault injection for improving application, robustness, and access protection mechanisms for preventing fragility problems.
Conference Paper
Full-text available
Intrusion detection systems rely on a wide variety of observable data to distinguish between legitimate and illegitimate activities. We study one such observable-sequences of system calls into the kernel of an operating system. Using system-call data sets generated by several different programs, we compare the ability of different data modeling methods to represent normal behavior accurately and to recognize intrusions. We compare the following methods: simple enumeration of observed sequences; comparison of relative frequencies of different sequences; a rule induction technique; and hidden Markov models (HMMs). We discuss the factors affecting the performance of each method and conclude that for this particular problem, weaker methods than HMMs are likely sufficient
Conference Paper
Full-text available
A method for anomaly detection is introduced in which “normal” is defined by short-range correlations in a process' system calls. Initial experiments suggest that the definition is stable during normal behaviour for standard UNIX programs. Further; it is able to detect several common intrusions involving sendmail and 1pr. This work is part of a research program aimed at building computer security systems that incorporate the mechanisms and algorithms used by natural immune systems
Article
Full-text available
We present a new Host-based Intrusion Detection System (IDS) that monitors accesses to the Microsoft Windows Registry using Registry Anomaly Detection (RAD). Our system uses a one class Support Vector Machine (OCSVM) to detect anomalous registry behavior by training on a dataset of normal registry accesses. It then uses this model to detect outliers in new (unclassified) data generated from the same system. Given the success of OCSVMs in other applications, we apply them to the Windows Registry anomaly detection problem. We compare our system to the RAD system using the Probabilistic Anomaly Detection (PAD) algorithm on the same dataset. Surprisingly, we find that PAD outperforms our OCSVM system due to properties of the hierarchical prior incorporated in the PAD algorithm. In the future, these properties may be used to develop an improved kernel and increase the performance of the OCSVM system. 1.
Article
Full-text available
This paper describes a new approach for dealing with the vocabulary problem in human-computer interaction. Most approaches to retrieving textual materials depend on a lexical match between words in users' requests and those in or assigned to database objects. Because of the tremendous diversity in the words people use to describe the same object, lexical matching methods are necessarily incomplete and imprecise [5]. The latent semantic indexing approach tries to overcome these problems by automatically organizing text objects into a semantic structure more appropriate for matching user requests. This is done by taking advantage of implicit higher-order structure in the association of terms with text objects. The particular technique used is singular-value decomposition, in which a large term by text-object matrix is decomposed into a set of about 50 to 150 orthogonal factors from which the original matrix can be approximated by linear combination. Terms and objects are represented by 5...
Article
Full-text available
Intrusion detection systems rely on a wide variety of observable data to distinguish between legitimate and illegitimate activities. In this paper we study one such observable--- sequences of system calls into the kernel of an operating system. Using system-call data sets generated by several different programs, we compare the ability of different data modeling methods to represent normal behavior accurately and to recognize intrusions. We compare the following methods: Simple enumeration of observed sequences, comparison of relative frequencies of different sequences, a rule induction technique, and Hidden Markov Models (HMMs). We discuss the factors affecting the performance of each method, and conclude that for this particular problem, weaker methods than HMMs are likely sufficient. 1. Introduction In 1996, Forrest and others introduced a simple intrusion detection method based on monitoring the system calls used by active, privileged processes [4]. Each process is represented by ...
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
Although computers continue to improve in speed and functionality, they remain difficult to use. Problems frequently occur, and it is hard to find fixes or workarounds. This paper argues for the importance and feasibility of building a global-scale automated problem diagnosis system that captures the natural, although labor intensive, workflow of system diagnosis and repair. The system collects problem symptoms from users' desktops, rather than requiring users to describe their problems to primitive search engines, automatically searches global databases of problem symptoms and fixes, and also allows ordinary users to contribute accurate problem reports in a structured manner.
Conference Paper
We describe a new approach, called Strider, to Change and Configuration Management and Support (CCMS). Strider is a black-box approach: without relying on specifications, it uses state differencing to identify potential causes of differing program behaviors, uses state tracing to identify actual, run-time state dependencies, and uses statistical behavior modeling for noise filtering. Strider is a state-based approach: instead of linking vague, high level descriptions and symptoms to relevant actions, it models management and support problems in terms of individual, named pieces of low level configuration state and provides precise mappings to user-friendly information through a computer genomics database. We use troubleshooting of configuration failures to demonstrate that the Strider approach reduces problem complexity by several orders of magnitude, making root-cause analysis possible.
Article
We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an SVM classifier that allows linear time classification of test sequences. Our experiments provide evidence that string-based kernels, in conjunction with SVMs, could offer a viable and computationally efficient alternative to other methods of protein classification and homology detection.
Computer Genomics: Towards Self-Change and Configuration Management
  • Y.-M Wang
Y.-M. Wang. Computer Genomics: Towards Self-Change and Configuration Management. Microsoft Research.