ThesisPDF Available

Decision Tree and Decision Forest Algorithms: On Improving Accuracy, Efficiency and Knowledge Discovery

Authors:

Abstract and Figures

The “Digital Revolution” has blessed the human civilization with enormous amount of “DATA”. The challenges of automatically analyzing these data has augmented the need for developing sophisticated means for data mining. In the field of data mining, classification plays a very important role in both predicting the “class” of an unseen instance and discovering patterns in data. In today’s data-driven world, classification is being applied in our day-to-day activity. Thus, the importance of improving prediction accuracy and simplifying knowledge discovery from classifiers are paramount. Decision tree is one of the most popular classifiers that is capable of predicting unseen instances with high accuracy and generating human-interpretable knowledge. Besides, due to their sensitive nature decision trees are often used as base classifiers to form ensembles of decision trees. An ensemble of decision trees, popularly known as a decision forest is said to be more robust to noise(s) and more accurate than a single decision tree. Many decision tree and decision forest building algorithms have been proposed in literature. However, the existing algorithms have various limitations that give us room for further improvement. Furthermore, decision forests are more memory-intensive and less knowledge extractable than a single decision tree. Hence, in this thesis we propose several novel algorithms for improving accuracy of decision trees and decision forests, then propose a technique to reduce the size of decision forests while retaining or increasing the ensemble accuracy, and finally propose a framework for effective knowledge discovery from decision forests. In order to validate the proposed algorithms/ techniques, we carry out extensive experiments on several publicly available data sets. The experimental results indicate that the proposed algorithms/techniques clearly improve the current state-of-the-art in applicable areas.
Content may be subject to copyright.
A preview of the PDF is not available
... The use of ensembles in the domain of "classification" is an active area of research [1,2,3,4,5]. An ensemble of classifiers is found to be effective when it is generated from unstable classifiers such as decision trees [6]. ...
... A decision forest (in short, forest) is an ensemble of decision trees (in short, trees) where an individual tree acts as a base classifier [6]. A forest is said to impart better generalization ability and is more robust to noise(s) as the trees in a forest are less tightly coupled with the training data set through the selection of appropriate subset of attributes and/or records [5]. ...
... On the other hand, due to their unstable nature, trees become structurally different when they are generated from differently differentiated training data sets. We know, structurally different trees can increase uncorrelated classification errors [14] and hence can increase the ensemble accuracy of the forest [15,5]. Therefore, it is required to differentiate the training data set differently for each tree to make them diverse in terms of classification errors. ...
Chapter
Full-text available
One of the most important requirements for a decision forest to secure better ensemble accuracy is generating simultaneously accurate as well as diverse decision trees as base classifiers. Most of the decision forest algorithms in literature exploit either or both of the two major sources of diversity: subspacing and sampling for inducing diversity among the decision trees. Recently, a decision forest algorithm named “Stochastic Forest” introduces stochastic selection of splitting attributes in the process of inducing decision trees and reports promising results. In this paper, we explore the worthiness of “stochastic selection of splitting attributes” as a source of inducing diversity in comparison with the existing two major sources of diversity. We carry out experiments on twenty-five popular data sets that are publicly available from the UCI Machine Learning Repository. The experimental analysis demonstrates the worthiness of “stochastic selection of splitting attributes” as an effective source of diversity.
... s For Random Forest, the size of f (|f |) is given by |f | = int(log 2 |m|) +1. This principally dictates the balance between individual tree accuracy and diversity among the trees [4,6,1]. It has been established that |f | = int(log 2 |m|) + 1 reacts unevenly for low and high dimensional data sets [4,6,1]. ...
... This principally dictates the balance between individual tree accuracy and diversity among the trees [4,6,1]. It has been established that |f | = int(log 2 |m|) + 1 reacts unevenly for low and high dimensional data sets [4,6,1]. For example, for a low dimensional data set with 4 attributes, splitting attributes are determined from randomly selected subspaces of 3 attributes (int(log 2 4) + 1 = 3). ...
... Since the weights are assigned randomly from the same uniform distribution, all attributes regardless of them being good or bad possess the same probability of acquiring a low or high weight value. Hence, as before, classification capacities retain their influence in determining the ultimate merit values of the attributes [6,1]. As a solution, a parameter p was introduced as an exponent to the random weights; 4 different p values (p = 1, 2, 3, and 4) were explored in RFW [37]. ...
Article
Full-text available
The foremost requirement for a decision forest to achieve better ensemble accuracy is building a set of accurate and diverse individual decision trees as base classifiers. Existing decision forest algorithms mainly differ from each other on how they induce diversity among the decision trees. At the same time, most of the drawbacks of existing algorithms originate from their induction processes of diversity. In this paper, we propose a new decision forest algorithm that is more balanced through effective synchronization between different sources of diversity. The proposed algorithm is balanced theoretically and empirically. We carried out experiments on 25 well-known data sets that are publicly available from the UCI Machine Learning Repository, to perform an extensive empirical evaluation. The experimental results indicate that the proposed algorithm has the best average ensemble accuracy rank of 1.8 compared to its closest competitor at 3.5. Using the Friedman and Bonferroni-Dunn tests, we also show that such an improvement is indeed statistically significant. In addition, the proposed algorithm is found to be competitive in terms of complexity and other relevant parameters.
... According to Hunt's algorithm, a decision tree is induced in a recursive manner from the training data set, e.g. the data set where all records are labeled with class values. The induction process starts by selecting every non-class attributes to divide the training data set D into a disjoint set of horizontal segments/partitions [11,12,16]. If the non-class attribute A i is categorical with k different domain values i.e. ...
... We generate 100 trees for every decision forest since the number is considered to be large enough to ensure convergence of the ensemble effect [22,32]. All the results reported in this paper are obtained using 10-fold Cross Validation (10-CV) [33,16] for every data set. The best results are distinguished through bold-face. ...
Chapter
Full-text available
Random Forest is one of the most popular decision forest building algorithms that uses decision trees as the base classifier. Decision trees for Random Forest are formed from the records of a training data set. This makes the decision trees almost equally biased towards the training data set. In reality, testing data set can be significantly different from the training data set. Thus, to reduce the bias of decision trees and hence of Random Forest, we introduce a random weight for each of the decision trees. We present experimental results on four widely used data sets from the UCI Machine Learning Repository. The experimental results indicate that the proposed technique can reduce the bias of Random Forest to become less sensitive to noisy data.KeywordsBiasDecision treeRandom Forest
... Though the decision trees proved to be efficient classifying acoustic voice signal data, at the same time decision forests are preferable over traditional decision trees as decision forests are more precise, accurate and able to handle overfitting and underfitting situations smartly through decision tree ensemble. Random Forest, as a random tree ensemble frequently used for Parkinson's detection [23,24], relies on bagging approach of ensemble, which makes the decision forest versatile [25]. However, Random Forest [26] is not the only decision forest that exists in the literature. ...
... Random Forests undertakes bagging as the training method on Random Subspace. For a set of features f the size of features chosen randomly for training purposes are | | 1, where m is the number of inputs [25]. The benefit of Random Forest is its versatility to varying data size. ...
Article
Full-text available
Biomedical engineers prefer decision forests over traditional decision trees to design state-of-the-art Parkinson’s Detection Systems (PDS) on massive acoustic signal data. However, the challenges that the researchers are facing with decision forests is identifying the minimum number of decision trees required to achieve maximum detection accuracy with the lowest error rate. This article examines two recent decision forest algorithms Systematically Developed Forest (SysFor), and Decision Forest by Penalizing Attributes (ForestPA) along with the popular Random Forest to design three distinct Parkinson’s detection schemes with optimum number of decision trees. The proposed approach undertakes minimum number of decision trees to achieve maximum detection accuracy. The training and testing samples and the density of trees in the forest are kept dynamic and incremental to achieve the decision forests with maximum capability for detecting Parkinson’s Disease (PD). The incremental tree densities with dynamic training and testing of decision forests proved to be a better approach for detection of PD. The proposed approaches are examined along with other state-of-the-art classifiers including the modern deep learning techniques to observe the detection capability. The article also provides a guideline to generate ideal training and testing split of two modern acoustic datasets of Parkinson’s and control subjects donated by the Department of Neurology in Cerrahpaşa, Istanbul and Departamento de Matemáticas, Universidad de Extremadura, Cáceres, Spain. Among the three proposed detection schemes the Forest by Penalizing Attributes (ForestPA) proved to be a promising Parkinson’s disease detector with a little number of decision trees in the forest to score the highest detection accuracy of 94.12% to 95.00%.
... Generally, an essential aim for forest algorithms is to improve the Ensemble Accuracy (EA) [1]. As such, every single contending forest algorithm described in this paper aims to increase EA as their principal performance metric. ...
Chapter
Full-text available
Decision Forests have attracted the academic community’s interest mainly due to their simplicity and transparency. This paper proposes two novel decision forest building techniques, called Maximal Information Coefficient Forest (MICF) and Pearson’s Correlation Coefficient Forest (PCCF). The proposed new algorithms use Pearson’s Correlation Coefficient (PCC) and Maximal Information Coefficient (MIC) as extra measures of the classification capacity score of each feature. Using those approaches, we improve the picking of the most convenient feature at each splitting node, the feature with the greatest Gain Ratio. We conduct experiments on 12 datasets that are available in the publicly accessible UCI machine learning repository. Our experimental results indicate that the proposed methods have the best average ensemble accuracy rank of 1.3 (for MICF) and 3.0 (for PCCF), compared to their closest competitor, Random Forest (RF), which has an average rank of 4.3. Additionally, the results from Friedman and Bonferroni-Dunn tests indicate statistically significant improvement.
... The obtained predictive model is constructed by partitioning the data set recursively into subsets and evaluating the information gain, before and after splitting to choose the best split that produces a tree with a minimum error rate. Figure 3-25 illustrates how CART is constructed, and the details of the procedures are explained in the following steps (Adnan, 2017;Tan et al., 2006). Step2: Choosing the best split cut point Each candidate cut point splits the dataset into two nodes 1 and 2 as CART performs binary splitting. ...
Thesis
Full-text available
Lameness can be described as painful erratic movements, which relate to a locomotor system and result in the animal deviating from its normal gait or posture. Lameness is considered one of the major health and welfare concerns for the sheep industry in the UK that leads to a substantial economic problem and causes a reduction in overall farm productivity. According to a report in 2013 by ADAS entitled ‘Economic Impact of Health and Welfare Issues in Beef, Cattle and Sheep in England’, each lame ewe costs £89.80 due to the decline in body condition, lambing percentage, growth rate, and reduced fertility. Thus, early lameness detection eliminates the negative impact of lameness and increase the chance of favourable outcome from treatment. The development of wearable sensor technologies enables the idea of remotely monitoring the changes in animal behaviours or movements which relate to lameness. The aim of this thesis was to evaluate the feasibility and accessibility of a proposed data mining approach (SLDM) to detect the early signs of lameness in sheep via analysing the retrieved data from a mounted wearable motion sensor within a sheep’s neck collar through investigating the most cost effective factors that contribute to lameness detection within the whole data mining process including; sensor sampling rate, segmentation methods, window size, extracted features, feature selection methods, and applicable classification algorithm. Three classes are recognised for sheep while their walking throughout the data collection process (sound, mild, and severe lameness classes). The sheep data were collected using three different sensor applications (Sheep Tracker, SensoDuino, SensorLog) which collect sheep data movements at different sampling rates 10, 5, and 4 Hz. Various sensing data were retrieved in X,Y, and Z dimensions; however, only accelerometer, gyroscope, and orientation readings are considered in the current study. Four sheep datasets are aggregated each of which includes 31, 10, 18, and 7 sheep. The conducted work in this thesis evaluates the performance of ensemble classifiers (Bagging, Boosting, or RusBoosting) using three different validation methods (5-fold, 0.3 hold-out, and proposed one ‘Single Sheep Splitting’) in comparison to three sampling rates (10, 5, 4 Hz), two segmentation approaches (FNSW and FOSW), three feature selection methods (ReliefF, GA, and RF) and three window sizes (10, 7, 5 sec.). Promising results of lameness prediction accuracies are achieved over most of the combinations (3 sampling rates, two segmentation methods, 3 window sizes, 183 extracted features, 3 feature selection methods, 3 ensemble classification models, and 3 model validation methods). However, the highest accuracy is revealed by using the `Bagging ensemble classifier 88.92% with F-score of 87.7%, 91.1%, 88.2% for sound walking, mildly walking, and severely walking classes, respectively. The results are obtained using 5-fold cross-validation over a 10 sec.window for sheep data collected at 10 Hz sampling rate using only the accelerometer hardware sensor reading and calculated orientation readings. The number of features selected is 46 optimised by GA using CHAID tree as a fitness function. Conversely, the lowest prediction accuracy of 56.25% with F-score (63.4% sound walking, 51.9% mildly walking, 48.8% severely walking) is recorded when RusBoosting ensemble is applied using 5-fold cross-validation over a 10 sec.window for dataset collected at the 4 Hz. sampling rate. So, the major research findings recommend that 10 Hz sampling rate is adequate for collect sheep movements, while the best segmentation method is FOSW as 20% of data-points are shared between two successive windows. Whereas, the preferable number of data-points (sheep movements) to be pre-processed is around 100, which is obtained when a 10 sec.window size or 7 sec.window size is applied. Additionally, the 20 features selected by RF out of 183 features could reveal good accuracy results compared to the whole set of extracted features. Although that GA feature selection method has slower execution time than RF, competitive prediction accuracy could be achieved when the selected features by GA were fed to the classifier. Finally, the acceleration sensor data alone are capable of making the decision about the lame sheep. So no extra hardware sensors like Gyroscope is required for decision making; moreover, the orientation sensor features could be directly derived from Acc which contribute most to lameness detection. Since the most cost effective factors are identified in this research, the practice in the meanwhile could be applicable for farmers, stakeholders, and manufacturers as no available sensor to detect the lame sheep developed yet. Therefore, the multidisciplinary nature of the conducted research opens diverse paths towards applying further research studies to develop various data mining approaches and practical sensor kits to detect the early signs of sheep’s lameness for better farm productivity and sheep industry prosperity in the UK.
Article
Full-text available
To classify data whether it is in the field of neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.
Article
Full-text available
Data mining is the computer assisted process of digging through and analysing enormous sets of data and then extracting the meaningful data. Data mining tools predicts behaviours and future trends, allowing businesses to make proactive decisions. It can answer questions that traditionally were very time consuming to resolve. Therefore they can be used to predict meteorological data that is weather prediction. Weather prediction is a vital application in meteorology and has been one of the most scientifically and technologically challenging problems across the world in the last century. Predicting the weather is essential to help preparing for the best and the worst of the climate. Accurate Weather Prediction has been one of the most challenging problems around the world. Many weather predictions like rainfall prediction, thunderstorm prediction, predicting cloud conditions are major challenges for atmospheric research. This paper presents the review of Data Mining Techniques for Weather Prediction and studies the benefit of using it. The paper provides a survey of available literatures of some algorithms employed by different researchers to utilize various data mining techniques, for Weather Prediction. The work that has been done by various researchers in this field has been reviewed and compared in a tabular form. For weather prediction, decision tree and k-mean clustering proves to be good with higher prediction accuracy than other techniques of data mining.
Book
This brief provides methods for harnessing Twitter data to discover solutions to complex inquiries. The brief introduces the process of collecting data through Twitters APIs and offers strategies for curating large datasets. The text gives examples of Twitter data with real-world examples, the present challenges and complexities of building visual analytic tools, and the best strategies to address these issues. Examples demonstrate how powerful measures can be computed using various Twitter data sources. Due to its openness in sharing data, Twitter is a prime example of social media in which researchers can verify their hypotheses, and practitioners can mine interesting patterns and build their own applications. This brief is designed to provide researchers, practitioners, project managers, as well as graduate students with an entry point to jump start their Twitter endeavors. It also serves as a convenient reference for readers seasoned in Twitter data analysis.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.