ArticlePDF Available

Tracking recurring contexts using ensemble classifiers: An application to email filtering

March 2010
Knowledge and Information Systems 22(3):371-391

March 2010
22(3):371-391

DOI:10.1007/s10115-009-0206-2

Source
DBLP

Authors:

Ioannis Katakis

University of Nicosia

Grigorios Tsoumakas

Aristotle University of Thessaloniki

I. Vlahavas

Aristotle University of Thessaloniki

Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using (a) two new real-world concept drifting datasets from the email domain, (b) an instantiation of the proposed framework and (c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.

Static (a) and stream (b) classification problem without concept drift. Stream classification problem with concept drift before (c) and after (d) drift time point (t d ). Numbers at items denote order of arrival

…

Batches of information represented in a conceptual representation model. (a) Sudden concept drift with recurring contexts (b) Gradual concept drift with recurring context. Numbers at items denote order of arrival.

…

. Accuracy, Precision, Recall and Run Time of all methods in the elist dataset

…

The CCP Framework

…

. Accuracy, Precision, Recall and Run Time of all methods in the spam dataset

…

Figures - uploaded by Grigorios Tsoumakas

Content may be subject to copyright.

Content uploaded by Grigorios Tsoumakas

Content may be subject to copyright.

A preview of the PDF is not available

A Novel Neural Ensemble Architecture for On-The-Fly Classification of Evolving Text Streams

Article

Dec 2023

We study on-the-fly classification of evolving text streams in which the relation between the input data and target labels changes over time—i.e. “concept drift”. These variations decrease the model’s performance, as predictions become less accurate over-time and they necessitate a more adaptable system. While most studies focus on concept drift detection and handling with ensemble approaches, the application of neural models in this area is relatively less studied. We introduce Adaptive Neural Ensemble Network ( AdaNEN ), a novel ensemble-based neural approach, capable of handling concept drift in data streams. With our novel architecture, we address some of the problems neural models face when exploited for online adaptive learning environments. Most current studies address concept drift detection and handling in numerical streams, and the evolving text stream classification remains relatively unexplored. We hypothesize that the lack of public and large-scale experimental data could be one reason. To this end, we propose a method based on an existing approach for generating evolving text streams by introducing various types of concept drifts to real-world text datasets. We provide an extensive evaluation of our proposed approach using 12 state-of-the-art baselines and 13 datasets. We first evaluate concept drift handling capability of AdaNEN and the baseline models on evolving numerical streams; this aims to demonstrate the concept drift handling capabilities of our method on a general spectrum and motivate its use in evolving text streams. The models are then evaluated in evolving text stream classification. Our experimental results show that AdaNEN consistently outperforms the existing approaches in terms of predictive performance with conservative efficiency.

Temporal Attention for Few-Shot Concept Drift Detection in Streaming Data

Article

Full-text available

Jun 2024

Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift is a phenomenon in which the statistical properties of a target domain change over time in an arbitrary way. These changes might be caused by changes in hidden variables that cannot be measured directly. With the onset of the big data era, domains such as social networks, meteorology, and finance are generating copious amounts of streaming data. Embedded within these data, the issue of concept drift can affect the attributes of streaming data in various ways, leading to a decline in the accuracy and performance of models. There is a pressing need for new models to re-adapt to the changes in streaming data. Traditional concept drift detection algorithms struggle to effectively capture and utilize the key feature points of concept drift within complex time series, thereby failing to maintain the accuracy and efficiency of the models. In light of these challenges, this study introduces a novel concept drift detection method that incorporates a temporal attention mechanism within a prototypical network. By integrating a temporal attention mechanism during the feature extraction process, our approach enhances the capability to process complex time series data, preserves temporal locality, strengthens the learning of key features, and reduces the amount of labeled data required. This method significantly improves the detection accuracy and efficiency of small sample streaming data by better capturing the local features of the data. Experiments conducted across multiple datasets demonstrate that this method exhibits comprehensive leading performance in terms of accuracy and F1-score, with excellent recall and precision, thereby validating its effectiveness in enhancing concept drift detection in streaming data.

Improving Adversarial Robustness of Ensemble Classifiers by Diversified Feature Selection and Stochastic Aggregation

Article

Full-text available

Mar 2024

Learning-based classifiers are found to be vulnerable to attacks by adversarial samples. Some works suggested that ensemble classifiers tend to be more robust than single classifiers against evasion attacks. However, recent studies have shown that this is not necessarily the case under more realistic settings of black-box attacks. In this paper, we propose a novel ensemble approach to improve the robustness of classifiers against evasion attacks by using diversified feature selection and a stochastic aggregation strategy. Our proposed scheme includes three stages. Firstly, the adversarial feature selection algorithm is used to select a feature each time that can trade-offbetween classification accuracy and robustness, and add it to the feature vector bank. Secondly, each feature vector in the bank is used to train a base classifier and is added to the base classifier bank. Finally, m classifiers from the classifier bank are randomly selected for decision-making. In this way, it can cause each classifier in the base classifier bank to have good performance in terms of classification accuracy and robustness, and it also makes it difficult to estimate the gradients of the ensemble accurately. Thus, the robustness of classifiers can be improved without reducing the classification accuracy. Experiments performed using both Linear and Kernel SVMs on genuine datasets for spam filtering, malware detection, and handwritten digit recognition demonstrate that our proposed approach significantly improves the classifiers’ robustness against evasion attacks.

Stream Concept-cognitive Computing System for Streaming Data Learning

Preprint

Full-text available

Oct 2023

Yunlong Mi

p>People can often acquire knowledge dynamically and rapidly from different types of data, yet existing incremental learning algorithms are still computationally time consuming and most of stream learning methods are mainly designed for streaming data while ignoring other types of data. Hence, this paper proposes a novel dynamic concept learning (CL) algorithm by imitating human cognitive learning processes from the perspective of brain logical cognition, which is named stream concept-cognitive computing system (streamC3S). For streamC3S, it mainly consists of three aspects: the concept space, CL process, and model update process. Moreover, considering the concept drift frequently occurs in the streaming data over time, an extended version of streamC3S (namely, streamC3SE) is also proposed in this work. Specifically, we first show the related theories for streamC3S and streamC3SE on the basis of the concept space. Then an overall framework and its corresponding algorithm are shown. Finally, experimental results on various types of datasets, including the standard machine learning datasets, streaming datasets, image datasets, and two traffic data streams, validate the effectiveness of our streamC3S and streamC3SE compared to the state-of-the-art incremental learning and stream learning algorithms. </p

Stream Concept-cognitive Computing System for Streaming Data Learning

Preprint

Full-text available

Oct 2023

Yunlong Mi

Concept Drift Adaptation Methods under the Deep Learning Framework: A Literature Review

Article

Full-text available

May 2023

With the advent of the fourth industrial revolution, data-driven decision making has also become an integral part of decision making. At the same time, deep learning is one of the core technologies of the fourth industrial revolution that have become vital in decision making. However, in the era of epidemics and big data, the volume of data has increased dramatically while the sources have become progressively more complex, making data distribution highly susceptible to change. These situations can easily lead to concept drift, which directly affects the effectiveness of prediction models. How to cope with such complex situations and make timely and accurate decisions from multiple perspectives is a challenging research issue. To address this challenge, we summarize concept drift adaptation methods under the deep learning framework, which is beneficial to help decision makers make better decisions and analyze the causes of concept drift. First, we provide an overall introduction to concept drift, including the definition, causes, types, and process of concept drift adaptation methods under the deep learning framework. Second, we summarize concept drift adaptation methods in terms of discriminative learning, generative learning, hybrid learning, and others. For each aspect, we elaborate on the update modes, detection modes, and adaptation drift types of concept drift adaptation methods. In addition, we briefly describe the characteristics and application fields of deep learning algorithms using concept drift adaptation methods. Finally, we summarize common datasets and evaluation metrics and present future directions.

Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach Comparison

Article

May 2024

Incremental data drifting is a common problem when employing a machine-learning model in industrial applications. The underlying data distribution evolves gradually, e.g., users change their buying preferences on an E-commerce website over time. The problem needs to be addressed to obtain high performance. Right now, studies regarding incremental data drifting suffer from several issues. For one thing, there is a lack of clear-defined incremental drift datasets for examination. Existing efforts use either collected real datasets or synthetic datasets that show two obvious limitations. One is in particular when and of which type of drifts the distribution undergoes is unknown, and the other is that a simple synthesized dataset cannot reflect the complex representation we would normally face in the real world. For another, there lacks of a well-defined protocol to evaluate a learner’s knowledge transfer capability on an incremental drift dataset. To provide a holistic discussion on these issues, we create approaches to generate datasets with specific drift types, and define a novel protocol for evaluation. Besides, we investigate recent advances in the transfer learning field, including Domain Adaptation and Lifelong Learning, and examine how they perform in the presence of incremental data drifting. The results unfold the relationships among drift types, knowledge preservation, and learning approaches.

Quilt: Robust Data Segment Selection against Concept Drifts

Article

Mar 2024

Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets.

A Broad Ensemble Learning System for Drifting Stream Classification

Article

Full-text available

Jan 2023

In a data stream environment, classification models must effectively and efficiently handle concept drift. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, while in the latter, the model may suffer from inefficiency and instability. To address these issues, we introduce a novel ensemble approach based on the Broad Learning System (BLS), where mini chunks are used at each update. BLS is an effective lightweight neural architecture recently developed for incremental learning. Although it is fast, it requires huge data chunks for effective updates and is unable to handle dynamic changes observed in data streams. Our proposed approach, named Broad Ensemble Learning System (BELS), uses a novel updating method that significantly improves best-in-class model accuracy. It employs an ensemble of output layers to address the limitations of BLS and handle drifts. Our model tracks the changes in the accuracy of the ensemble components and reacts to these changes. We present the mathematical derivation of BELS, perform comprehensive experiments with 35 datasets that demonstrate the adaptability of our model to various drift types, and provide its hyperparameter, ablation, and imbalanced dataset performance analysis. The experimental results show that the proposed approach outperforms 10 state-of-the-art baselines, and supplies an overall improvement of 18.59% in terms of average prequential accuracy.

Partially Supervised Classification for Early Concept Drift Detection

Conference Paper

Oct 2022

Dynamic Feature Space and Incremental Feature Selection for the Classiflcation of Textual Data Streams

Article

Full-text available

Jan 2006

Real world text classiflcation applications are of special inter- est for the machine learning and data mining community, mainly because they introduce and combine a number of special di-culties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic fea- ture space and the utility of incremental feature selection in streaming text classiflcation tasks. In addition, we describe a computationally un- demanding incremental learning framework that could serve as a baseline in the fleld. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies.

Data Mining: practical machine learning tools and techniques

Book

Jan 2011

A streaming ensemble algorithm (SEA) for large-scale classification

Article

Jan 2001

Beyond independence: Conditions for the optimality of the simple Bayesian classifier

Article

Jan 1996

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Article

Jan 1998

Thorsten Joachims

BIRCH

Article

Jun 1996
SIGMOD REC

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

Clustering classifiers for knowledge discovery from physically distributed databases

Article