Conference PaperPDF Available

Hierarchical Rules for a Hierarchical Classifier

Authors:

Abstract and Figures

A system of rule extraction out of a complex hierarchical classifier is proposed in this paper. There are several methods for rule extraction out of trained artificial neural networks (ANN’s), but these methods do not scale well, i.e. results are satisfactory for small problems. For complicated problems hundreds of rules are produced, which are hard to govern. In this paper a hierarchical classifier with a tree-like structure and simple ANN’s at nodes, is presented, which splits the original problem into several sub-problems that overlap. Node classifiers are all weak (i.e. with accuracy only better than random), and errors are corrected at lower levels. Single sub-problems constitute of examples that were hard to separate. Such architecture is able to classify better than single network models. At the same time if–then rules are extracted, which only answer which sub-problem a given example belongs to. Such rules, by introducing hierarchy, are simpler and easier to modify by hand, giving also a better insight into the original classifier behaviour.
Content may be subject to copyright.
A preview of the PDF is not available
... Multi-stage or hierarchical classification (Giusti et al., 2002; Podolak, 2007; Kurzy´nskiKurzy´nski, 1988) is widely used in many complex multi-category classification tasks. Existing research shows such techniques can potentially achieve right trade-off between accuracy and resource allocation (Giusti et al., 2002; Podolak, 2007 ). ...
... Multi-stage or hierarchical classification (Giusti et al., 2002; Podolak, 2007; Kurzy´nskiKurzy´nski, 1988) is widely used in many complex multi-category classification tasks. Existing research shows such techniques can potentially achieve right trade-off between accuracy and resource allocation (Giusti et al., 2002; Podolak, 2007 ). Our proposed hierarchical system has a tree-like structure with three different types of classifier at nodes (seeFigure 1). ...
Conference Paper
Full-text available
Entity sense disambiguation becomes difficult with few or even zero training instances available, which is known as imbalanced learning problem in machine learning. To overcome the problem, we create a new set of reliable training instances from dictionary, called dictionary-based prototypes. A hierarchical classification system with a tree-like structure is designed to learn from both the prototypes and training instances, and three different types of classifiers are employed. In addition, supervised dimensionality reduction is conducted in a similarity-based space. Experimental results show our system outperforms three baseline systems by at least 8.3% as measured by macro F1 score.
... When studying the properties of HC, it became apparent that the overall accuracy depends on the clustering found in the algorithm, reflected in the correct value found with Cl mod . The actual clusterings are found using machine learning approaches [4, 3]. We have noted that the actual number of possible clusterings is not known, and it became the motivation for this work. ...
Article
Full-text available
This paper shows a new combinatorial problem which emerged from studies on an artificial intelligence classification model of a hierarchical classifier. We introduce the notion of proper clustering and show how to count their number in a special case when 3 clusters are allowed. An algorithm that generates all clusterings is given. We also show that the proposed approach can be generalized to any number of clusters, and can be automatized. Finally, we show the relationship between the problem of counting clusterings and the Dedekind problem.
... Interesting transformation of a frame-based representation with uncertainty into a Bayesian model is described in [21]. Method of conversion of artificial neural networks to rules is presented in [22]; related approach -extraction of hierarchical rules from neural networks -is shown in [23]. Extraction of rules from support vector machine model is described in [24]. ...
Conference Paper
Full-text available
In this paper B2R algorithm that converts Bayesian networks into sets of rules is proposed. It is tested on several data sets with various configurations and results show that accuracy is similar to original Bayesian networks even after pruning a high number of rules. It allows to exploit advantages of both knowledge representation techniques.
... Detailed discussion on this issue is given in [10], where the new notion of weakness for multiclass classification was introduced and analyzed. The HC model was introduced earlier in [11,12]. In this paper we introduce a new method for computing the risk estimationˆRestimationˆ estimationˆR(Cl HC ) of HC. ...
Conference Paper
Full-text available
We describe the Hierarchical Classifier (HC), which is a hybrid architecture [1] built with the help of supervised training and unsupervised problem clustering. We prove a theorem giving the estimation $\hat{R}$\hat{R} of HC risk. The proof works because of an improved way of computing cluster weights, introduced in this paper. Experiments show that $\hat{R}$\hat{R} is correlated with HC real error. This allows us to use $\hat{R}$\hat{R} as the approximation of HC risk without evaluating HC subclusters. We also show how $\hat{R}$\hat{R} can be used in efficient clustering algorithms by comparing HC architectures with different methods of clustering.
Article
Full-text available
The notion of a weak classifier, as one which is “a little better” than a random one, was introduced first for 2-class problems [1]. The extensions to K-class problems are known. All are based on relative activations for correct and incorrect classes and do not take into account the final choice of the answer. A new understanding and definition is proposed here. It takes into account only the final choice of classification that must be taken. It is shown that for a K class classifier to be called “weak”, it needs to achieve lower than 1/K risk value. This approach considers only the probability of the final answer choice, not the actual activations.
Article
In this paper a novel complex classifier architecture is proposed. The architecture has a hierarchical tree-like structure with simple artificial neural networks (ANNs) at each node. The actual structure for a given problem is not preset but is built throughout training.The training algorithm’s ability to build the tree-like structure is based on the assumption that when a weak classifier (i.e., one that classifies only slightly better than a random classifier) is trained and examples from any two output classes are frequently mismatched, then they must carry similar information and constitute a sub-problem. After each ANN has been trained its incorrect classifications are analyzed and new sub-problems are formed. Consequently, new ANNs are built for each of these sub-problems and form another layer of the hierarchical classifier.An important feature of the hierarchical classifier proposed in this work is that the problem partition forms overlapping sub-problems. Thus, the classification follows not just a single path from the root, but may fork enhancing the power of the classification. It is shown how to combine the results of these individual classifiers.
Conference Paper
Full-text available
A novel architecture for a hierarchical classifier (HC) is defined. The objective is to combine several weak classifiers to form a strong one, but a different approach from those known, e.g. AdaBoost, is taken: the training set is split on the basis of previous classifier misclassification between output classes. The problem is split into overlapping subproblems, each classifying into a different set of output classes. This allows for a task size reduction as each sub-problem is smaller in the sense of lower number of output classes, and for higher accuracy. The HC proposes a different approach to the boosting approach. The groups of output classes overlap, thus examples from a single class may end up in several subproblems. It is shown, that this approach ensures that such hierarchical classifier achieves better accuracy. A notion of generalized accuracy is introduced. The sub-problems generation is simple as it is performed with a clustering algorithm operating on classifier outputs. We propose to use the Growing Neural Gas [1] algorithm, because of its good adaptiveness.
Article
This paper addresses the problem of improving the accuracy of an hypothesis output by a learning algorithm in the distribution-free (PAC) learning model. A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. In this paper, it is shown that these two notions of learnability are equivalent. A method is described for converting a weak learning algorithm into one that achieves arbitrarily high accuracy. This construction may have practical applications as a tool for efficiently converting a mediocre learning algorithm into one that performs extremely well. In addition, the construction has some interesting theoretical consequences, including a set of general upper bounds on the complexity of any strong learning algorithm as a function of the allowed error ∈.
Book
The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.
Article
In this paper a novel complex classifier architecture is proposed. The architecture has a hierarchical tree-like structure with simple artificial neural networks (ANNs) at each node. The actual structure for a given problem is not preset but is built throughout training.The training algorithm’s ability to build the tree-like structure is based on the assumption that when a weak classifier (i.e., one that classifies only slightly better than a random classifier) is trained and examples from any two output classes are frequently mismatched, then they must carry similar information and constitute a sub-problem. After each ANN has been trained its incorrect classifications are analyzed and new sub-problems are formed. Consequently, new ANNs are built for each of these sub-problems and form another layer of the hierarchical classifier.An important feature of the hierarchical classifier proposed in this work is that the problem partition forms overlapping sub-problems. Thus, the classification follows not just a single path from the root, but may fork enhancing the power of the classification. It is shown how to combine the results of these individual classifiers.