Conference Paper

Incremental clustering based on decomposed Cauchy-like density for imbalanced data classification from data stream

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a new idea for incremental clustering based on decomposed Cauchy-like (deCauchy) density distribution. The algorithm is based on the metrics where the data sample is written in the form of unity orientation vector multiplied by the scalar of the data vector length. This notation offers a very clear and transparent way to calculate the orientation and length density of each sample which can be also very easily calculated recursively. The development of density as a measure of similarity follows from Cauchy density and is very similar to the typicality defined in the possibilistic clustering approach. The described incremental Cauchy clustering deals with just two tuning parameters, the first one is maximal orientation density and the second one is maximal length density. The algorithm is in on-line form to deal with data streams and evolves the model structure during the operation by adding, merging and removing the clusters. It can be very efficiently used in many different clustering problems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this work we propose to use the Gustafson-Kessel (GK) algorithm within the PFCM (Possibilistic Fuzzy c-Means), such that the cluster distributions have a better adaptation with the natural distribution of the data. The PFCM, proposed by Pal et al. on 2005, is founded on the fuzzy membership degrees of the FCM and the typicality values of the PCM. Nevertheless, this algorithm uses the Euclidian distance which gives circular clusters. So, incorporating the GK algorithm and the Mahalanobis measure for the calculus of the distance, we have the possibility to get ellipsoidal forms as well, allowing a better representation of the clusters.
Article
Full-text available
Intrusion Detection Systems are challenging task for finding the user as normal user or attack user in any organizational information systems or IT Industry. The Intrusion Detection System is an effective method to deal with the kinds of problem in networks. Different classifiers are used to detect the different kinds of attacks in networks. In this paper, the performance of intrusion detection is compared with various neural network classifiers. In the proposed research the four types of classifiers used are Feed Forward Neural Network (FFNN), Generalized Regression Neural Network (GRNN), Probabilistic Neural Network (PNN) and Radial Basis Neural Network (RBNN). The performance of the full featured KDD Cup 1999 dataset is compared with that of the reduced featured KDD Cup 1999 dataset. The MATLAB software is used to train and test the dataset and the efficiency and False Alarm Rate is measured. It is proved that the reduced dataset is performing better than the full featured dataset.
Article
Full-text available
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment.
Article
Full-text available
Increasing demands on effluent quality and loads call for an improved control, monitoring and fault detection of waste-water treatment plants (WWTPs). Improved control and optimization of WWTP lead to increased pollutant removal, a reduced need for chemicals as well as energy savings. An important step towards the optimal functioning of a WWTP is to minimize the influence of sensor faults on the control quality. To achieve this a fault-detection system should be implemented. In this paper the idea of using an evolving method as a base for the fault-detection/monitoring system is tested. The system is based on the evolving-fuzzy-model method. This method allows us to model the nonlinear relations between the variables with the Takagi-Sugeno fuzzy model. The method uses basic evolving mechanisms to add and remove clusters and the adaptation mechanism to adapt the clusters' and local models' parameters. The proposed fault-detection system is tested on measured data from a real WWTP. The results indicate the potential improvement of the WWTP's control during a sensor malfunction.
Article
Full-text available
Knowledge about computer users is very beneficial for assisting them, predicting their future actions or detecting masqueraders. In this paper, a new approach for creating and recognizing automatically the behavior profile of a computer user is presented. In this case, a computer user behavior is represented as the sequence of the commands (s)he types during her/his work. This sequence is transformed into a distribution of relevant subsequences of commands in order to find out a profile that defines its behavior. Also, because a user profile is not necessarily fixed but rather it evolves/changes, we propose an evolving method to keep up to date the created profiles using an Evolving Systems approach. In this paper we combine the evolving classifier with a trie-based user profiling to obtain a powerful self-learning on-line scheme. We also develop further the recursive formula of the potential of a data point to become a cluster center using cosine distance, which is provided in the Appendix. The novel approach proposed in this paper can be applicable to any problem of dynamic/evolving user behavior modeling where it can be represented as a sequence of actions or events. It has been evaluated on several real data streams. Index Terms—Evolving fuzzy systems, fuzzy-rule-based (FRB) classifiers, user modeling.
Article
Full-text available
In this paper an on-line fuzzy identification of Takagi Sugeno fuzzy model is presented. The presented method combines a recursive Gustafson–Kessel clustering algorithm and the fuzzy recursive least squares method. The on-line Gustafson–Kessel clustering method is derived. The recursive equations for fuzzy covariance matrix, its inverse and cluster centers are given. The use of the method is presented on two examples. First example demonstrates the use of the method for monitoring of the waste water treatment process and in the second example the method is used to develop an adaptive fuzzy predictive functional controller for a pH process. The results for the Mackey–Glass time series prediction are also given. KeywordsRecursive fuzzy clustering–Recursive fuzzy identification–Clustering–Online recursive identification–Recursive Gustafson–Kessel clustering
Article
Full-text available
A new approach to the online classification of streaming data is introduced in this paper. It is based on a self-developing ( e volving) fuzzy-rule-based (FRB) classifier system of T akagi- S ugeno ( eTS ) type. The proposed approach, called eClass ( e volving class ifier), includes different architectures and online learning methods. The family of alternative architectures includes: 1) eClass0 , with the classifier consequents representing class label and 2) the newly proposed method for regression over the features using a first-order eTS fuzzy classifier, eClass1 . An important property of eClass is that it can start learning ldquofrom scratch.rdquo Not only do the fuzzy rules not need to be prespecified, but neither do the number of classes for eClass (the number may grow, with new class labels being added by the online learning process). In the event that an initial FRB exists, eClass can evolve/develop it further based on the newly arrived data. The proposed approach addresses the practical problems of the classification of streaming data (video, speech, sensory data generated from robotic, advanced industrial applications, financial and retail chain transactions, intruder detection, etc.). It has been successfully tested on a number of benchmark problems as well as on data from an intrusion detection data stream to produce a comparison with the established approaches. The results demonstrate that a flexible (with evolving structure) FRB classifier can be generated online from streaming data achieving high classification rates and using limited computational resources.
Conference Paper
Full-text available
Visual analytics experts realize that one effective way to push the field forward and to develop metrics for measuring the performance of various visual analytics components is to hold an annual competition. The VAST 2008 Challenge is the third year that such a competition was held in conjunction with the IEEE Visual Analytics Science and Technology (VAST) symposium. The authors restructured the contest format used in 2006 and 2007 to reduce the barriers to participation and offered four mini-challenges and a Grand Challenge. Mini Challenge participants were to use visual analytic tools to explore one of four heterogeneous data collections to analyze specific activities of a fictitious, controversial movement. Questions asked in the Grand Challenge required the participants to synthesize data from all four data sets. In this paper we give a brief overview of the data sets, the tasks, the participation, the judging, and the results.
Conference Paper
Full-text available
The spherical k-means algorithm, i.e., the k-means algorithm with cosine similarity, is a popular method for clustering high-dimensional text data. In this algorithm, each document as well as each cluster mean is represented as a high-dimensional unit-length vector. However, it has been mainly used in hatch mode. Thus is, each cluster mean vector is updated, only after all document vectors being assigned, as the (normalized) average of all the document vectors assigned to that cluster. This paper investigates an online version of the spherical k-means algorithm based on the well-known winner-take-all competitive learning. In this online algorithm, each cluster centroid is incrementally updated given a document. We demonstrate that the online spherical k-means algorithm can achieve significantly better clustering results than the batch version, especially when an annealing-type learning rate schedule is used. We also present heuristics to improve the speed, yet almost without loss of clustering quality.
Article
Full-text available
In 1997, we proposed the fuzzy-possibilistic c-means (FPCM) model and algorithm that generated both membership and typicality values when clustering unlabeled data. FPCM constrains the typicality values so that the sum over all data points of typicalities to a cluster is one. The row sum constraint produces unrealistic typicality values for large data sets. In this paper, we propose a new model called possibilistic-fuzzy c-means (PFCM) model. PFCM produces memberships and possibilities simultaneously, along with the usual point prototypes or cluster centers for each cluster. PFCM is a hybridization of possibilistic c-means (PCM) and fuzzy c-means (FCM) that often avoids various problems of PCM, FCM and FPCM. PFCM solves the noise sensitivity defect of FCM, overcomes the coincident clusters problem of PCM and eliminates the row sum constraints of FPCM. We derive the first-order necessary conditions for extrema of the PFCM objective function, and use them as the basis for a standard alternating optimization approach to finding local minima of the PFCM objective functional. Several numerical examples are given that compare FCM and PCM to PFCM. Our examples show that PFCM compares favorably to both of the previous models. Since PFCM prototypes are less sensitive to outliers and can avoid coincident clusters, PFCM is a strong candidate for fuzzy rule-based system identification.
Article
Intrusion Detection System (IDS) is an effective tool that can help to prevent unauthorized access to network resources. A good intrusion detection system should have higher detection rate and lower false positive. A new classification system using Principal Component Analysis (PCA) neural networks for ID is proposed to detect intrusions from normal connections with satisfactory detection rate and false positive. Experiments and evaluations were performed with the KDD Cup 99 intrusion detection database. Comparison with other approach based on different evaluation parameters showed that proposed approach has noticeable performance with detection rate 99.596% and false positive 0.404% and can classify the network connections with satisfactory performance.
Article
Purely based on a hierarchy of self-organizing feature maps (SOMs), an approach to network intrusion detection is investigated. Our principle interest is to establish just how far such an approach can be taken in practice. To do so, the KDD benchmark data set from the International Knowledge Discovery and Data Mining Tools Competition is employed. Extensive analysis is conducted in order to assess the significance of the features employed, the partitioning of training data and the complexity of the architecture. Contributions that follow from such a holistic evaluation of the SOM include recognizing that (1) best performance is achieved using a two-layer SOM hierarchy, based on all 41-features from the KDD data set. (2) Only 40% of the original training data is sufficient for training purposes. (3) The ‘Protocol’ feature provides the basis for a switching parameter, thus supporting modular solutions to the detection problem. The ensuing detector provides false positive and detection rates of 1.38% and 90.4% under test conditions; where this represents the best performance to date of a detector based on an unsupervised learning algorithm.
Article
We explore an approach to possibilistic fuzzy clustering that avoids a severe drawback of the conventional approach, namely that the objective function is truly minimized only if all cluster centers are identical. Our approach is based on the idea that this undesired property can be avoided if we introduce a mutual repulsion of the clusters, so that they are forced away from each other. We develop this approach for the possibilistic fuzzy c-means algorithm and the Gustafson–Kessel algorithm. In our experiments we found that in this way we can combine the partitioning property of the probabilistic fuzzy c-means algorithm with the advantages of a possibilistic approach w.r.t. the interpretation of the membership degrees.
Article
A recursive approach for adaptation of fuzzy rule-based model structure has been developed and tested. It uses on-line clustering of the input–output data with a recursively calculated spatial proximity measure. Centres of these clusters are then used as prototypes of the centres of the fuzzy rules (as their focal points). The recursive nature of the algorithm makes possible to design an evolving fuzzy rule-base in on-line mode, which adapts to the variations of the data pattern. The proposed algorithm is instrumental for on-line identification of Takagi–Sugeno models, exploiting their dual nature and combined with the recursive modified weighted least squares estimation of the parameters of the consequent part of the model. The resulting evolving fuzzy rule-based models have high degree of transparency, compact form, and computational efficiency. This makes them strongly competitive candidates for on-line modelling, estimation and control in comparison with the neural networks, polynomial and regression models. The approach has been tested with data from a fermentation process of lactose oxidation.
Article
A novel online learning approach for neuro- fuzzy models is proposed in this paper. Unlike most of the previous online methods which use spherical clusters to define validity region of neurons, the proposed learning method is based on a recursive extension of Gath-Geva clustering algorithm, which is capable of constructing elliptical clusters as well. Eliminating the constraint of spherical clusters by considering general structures for covariance matrices, empowers the proposed evolving neuro-fuzzy model (ENFM) to capture more sophisticated behaviors with less modeling error as well as fewer number of neurons. The proposed recursive clustering method has the ability to cluster data streams using online identification of number of required clusters and recursive estimation of cluster parameters. A merging strategy is also proposed to merge similar clusters which consequently hinders the model from having excessive number of neurons with similar behaviors. Applicability of ENFM is also investi- gated in modeling a time varying heat exchanger system and prediction of Mackey-Glass and sunspot numbers time series. Simulation results indicate better performance of the proposed model as compared with that of several well- known modeling and prediction methods.
Article
In this work, a new method consisting of a combination of discretizers, filters and classifiers is presented. Its aim is to improve the performance results of classifiers but using a significantly reduced set of features. The method has been applied to a binary and to a multiple class classification problem. Specifically, the KDD Cup 99 benchmark was used for testing its effectiveness. A comparative study with other methods and the KDD winner was accomplished. The results obtained showed the adequacy of the proposed method, achieving better performance in most cases while reducing the number of features in more than 80%.
Article
Unstructured text documents are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by ...
Recursive possibilistic fuzzy modeling, Evolving and Autonomous Learning Systems (EALS)
  • L Maciel
  • F Gomide
  • R Ballini
Maciel L., F. Gomide, and R. Ballini, Recursive possibilistic fuzzy modeling, Evolving and Autonomous Learning Systems (EALS), 2014 IEEE Symposium on, IEEE SSCI, Orlando, pp. 9 -16, 2014.
Evolving Fuzzy and Neuro-Fuzzy Approaches in Clustering, Regression, Identication, and Classication: A Survey
  • I Škrjanc
  • J Iglesias
  • A Sanchis
  • D Leite
  • E Lughofer
  • F Gomide
Škrjanc I., J. Iglesias, A. Sanchis, D. Leite, E. Lughofer, and F. Gomide, Evolving Fuzzy and Neuro-Fuzzy Approaches in Clustering, Regression, Identication, and Classication: A Survey, Information Sciences, S0020-0255(19)30271-3 https://doi.org/10.1016/j.ins.2019.03.060 INS 14400, 2019.
Simplified fuzzy rule-based systems using non-parametric antecedents and relative data density
  • P P Angelov
  • R Yager
Angelov P.P., and R. Yager (2011), Simplified fuzzy rule-based systems using non-parametric antecedents and relative data density, 2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS), pp. 62-69.
A detailed analysis of the KDD cup 99 data set
  • M Tavallaee
  • E Bagheri
  • W Lu
  • A.-A Ghorbani
Tavallaee M., E. Bagheri, W. Lu, and A.-A. Ghorbani (2009), A detailed analysis of the KDD cup 99 data set, in Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications 2009.
Fast feature reduction in intrusion detection datasets
  • S Parsazad
  • E Saboori
  • A Allahyar
Parsazad S., E. Saboori, and A. Allahyar (2012), Fast feature reduction in intrusion detection datasets, in Proceedings of MIPRO.