ThesisPDF Available

Decision Tree and Decision Forest Algorithms: On Improving Accuracy, Efficiency and Knowledge Discovery

September 2017

September 2017

Authors:

Jessore University of Science and Technology

The “Digital Revolution” has blessed the human civilization with enormous amount of “DATA”. The challenges of automatically analyzing these data has augmented the need for developing sophisticated means for data mining. In the field of data mining, classification plays a very important role in both predicting the “class” of an unseen instance and discovering patterns in data. In today’s data-driven world, classification is being applied in our day-to-day activity. Thus, the importance of improving prediction accuracy and simplifying knowledge discovery from classifiers are paramount. Decision tree is one of the most popular classifiers that is capable of predicting unseen instances with high accuracy and generating human-interpretable knowledge. Besides, due to their sensitive nature decision trees are often used as base classifiers to form ensembles of decision trees. An ensemble of decision trees, popularly known as a decision forest is said to be more robust to noise(s) and more accurate than a single decision tree. Many decision tree and decision forest building algorithms have been proposed in literature. However, the existing algorithms have various limitations that give us room for further improvement. Furthermore, decision forests are more memory-intensive and less knowledge extractable than a single decision tree. Hence, in this thesis we propose several novel algorithms for improving accuracy of decision trees and decision forests, then propose a technique to reduce the size of decision forests while retaining or increasing the ensemble accuracy, and finally propose a framework for effective knowledge discovery from decision forests. In order to validate the proposed algorithms/ techniques, we carry out extensive experiments on several publicly available data sets. The experimental results indicate that the proposed algorithms/techniques clearly improve the current state-of-the-art in applicable areas.

Artificial Neural Network

Wilcoxon Signed-Ranks Test for EA

…

+14

Number of Wins in terms of EA for Group 1

…

Figures - uploaded by Md Nasim Adnan

Content may be subject to copyright.

Content uploaded by Md Nasim Adnan

Content may be subject to copyright.

A preview of the PDF is not available

Exploration of Stochastic Selection of Splitting Attributes as a Source of Inducing Diversity

Chapter

Full-text available

Nov 2023

Md Nasim Adnan

One of the most important requirements for a decision forest to secure better ensemble accuracy is generating simultaneously accurate as well as diverse decision trees as base classifiers. Most of the decision forest algorithms in literature exploit either or both of the two major sources of diversity: subspacing and sampling for inducing diversity among the decision trees. Recently, a decision forest algorithm named “Stochastic Forest” introduces stochastic selection of splitting attributes in the process of inducing decision trees and reports promising results. In this paper, we explore the worthiness of “stochastic selection of splitting attributes” as a source of inducing diversity in comparison with the existing two major sources of diversity. We carry out experiments on twenty-five popular data sets that are publicly available from the UCI Machine Learning Repository. The experimental analysis demonstrates the worthiness of “stochastic selection of splitting attributes” as an effective source of diversity.

BDF: A new decision forest algorithm

Article

Full-text available

May 2021
INFORM SCIENCES

The foremost requirement for a decision forest to achieve better ensemble accuracy is building a set of accurate and diverse individual decision trees as base classifiers. Existing decision forest algorithms mainly differ from each other on how they induce diversity among the decision trees. At the same time, most of the drawbacks of existing algorithms originate from their induction processes of diversity. In this paper, we propose a new decision forest algorithm that is more balanced through effective synchronization between different sources of diversity. The proposed algorithm is balanced theoretically and empirically. We carried out experiments on 25 well-known data sets that are publicly available from the UCI Machine Learning Repository, to perform an extensive empirical evaluation. The experimental results indicate that the proposed algorithm has the best average ensemble accuracy rank of 1.8 compared to its closest competitor at 3.5. Using the Friedman and Bonferroni-Dunn tests, we also show that such an improvement is indeed statistically significant. In addition, the proposed algorithm is found to be competitive in terms of complexity and other relevant parameters.

On Reducing the Bias of Random Forest

Chapter

Full-text available

Nov 2022

Md Nasim Adnan

Random Forest is one of the most popular decision forest building algorithms that uses decision trees as the base classifier. Decision trees for Random Forest are formed from the records of a training data set. This makes the decision trees almost equally biased towards the training data set. In reality, testing data set can be significantly different from the training data set. Thus, to reduce the bias of decision trees and hence of Random Forest, we introduce a random weight for each of the decision trees. We present experimental results on four widely used data sets from the UCI Machine Learning Repository. The experimental results indicate that the proposed technique can reduce the bias of Random Forest to become less sensitive to noisy data.KeywordsBiasDecision treeRandom Forest

Machine Learning Methods with Decision Forests for Parkinson’s Detection

Article

Full-text available

Jan 2021

Biomedical engineers prefer decision forests over traditional decision trees to design state-of-the-art Parkinson’s Detection Systems (PDS) on massive acoustic signal data. However, the challenges that the researchers are facing with decision forests is identifying the minimum number of decision trees required to achieve maximum detection accuracy with the lowest error rate. This article examines two recent decision forest algorithms Systematically Developed Forest (SysFor), and Decision Forest by Penalizing Attributes (ForestPA) along with the popular Random Forest to design three distinct Parkinson’s detection schemes with optimum number of decision trees. The proposed approach undertakes minimum number of decision trees to achieve maximum detection accuracy. The training and testing samples and the density of trees in the forest are kept dynamic and incremental to achieve the decision forests with maximum capability for detecting Parkinson’s Disease (PD). The incremental tree densities with dynamic training and testing of decision forests proved to be a better approach for detection of PD. The proposed approaches are examined along with other state-of-the-art classifiers including the modern deep learning techniques to observe the detection capability. The article also provides a guideline to generate ideal training and testing split of two modern acoustic datasets of Parkinson’s and control subjects donated by the Department of Neurology in Cerrahpaşa, Istanbul and Departamento de Matemáticas, Universidad de Extremadura, Cáceres, Spain. Among the three proposed detection schemes the Forest by Penalizing Attributes (ForestPA) proved to be a promising Parkinson’s disease detector with a little number of decision trees in the forest to score the highest detection accuracy of 94.12% to 95.00%.

Novel Decision Forest Building Techniques by Utilising Correlation Coefficient Methods

Chapter

Full-text available

Jun 2022

Decision Forests have attracted the academic community’s interest mainly due to their simplicity and transparency. This paper proposes two novel decision forest building techniques, called Maximal Information Coefficient Forest (MICF) and Pearson’s Correlation Coefficient Forest (PCCF). The proposed new algorithms use Pearson’s Correlation Coefficient (PCC) and Maximal Information Coefficient (MIC) as extra measures of the classification capacity score of each feature. Using those approaches, we improve the picking of the most convenient feature at each splitting node, the feature with the greatest Gain Ratio. We conduct experiments on 12 datasets that are available in the publicly accessible UCI machine learning repository. Our experimental results indicate that the proposed methods have the best average ensemble accuracy rank of 1.3 (for MICF) and 3.0 (for PCCF), compared to their closest competitor, Random Forest (RF), which has an average rank of 4.3. Additionally, the results from Friedman and Bonferroni-Dunn tests indicate statistically significant improvement.

Validating a proposed data mining approach (SLDM) for motion wearable sensors to detect the early signs of lameness in sheep

Thesis

Full-text available

Jul 2021

Zainab Al-Rubaye

Lameness can be described as painful erratic movements, which relate to a locomotor system and result in the animal deviating from its normal gait or posture. Lameness is considered one of the major health and welfare concerns for the sheep industry in the UK that leads to a substantial economic problem and causes a reduction in overall farm productivity. According to a report in 2013 by ADAS entitled ‘Economic Impact of Health and Welfare Issues in Beef, Cattle and Sheep in England’, each lame ewe costs £89.80 due to the decline in body condition, lambing percentage, growth rate, and reduced fertility. Thus, early lameness detection eliminates the negative impact of lameness and increase the chance of favourable outcome from treatment. The development of wearable sensor technologies enables the idea of remotely monitoring the changes in animal behaviours or movements which relate to lameness. The aim of this thesis was to evaluate the feasibility and accessibility of a proposed data mining approach (SLDM) to detect the early signs of lameness in sheep via analysing the retrieved data from a mounted wearable motion sensor within a sheep’s neck collar through investigating the most cost effective factors that contribute to lameness detection within the whole data mining process including; sensor sampling rate, segmentation methods, window size, extracted features, feature selection methods, and applicable classification algorithm. Three classes are recognised for sheep while their walking throughout the data collection process (sound, mild, and severe lameness classes). The sheep data were collected using three different sensor applications (Sheep Tracker, SensoDuino, SensorLog) which collect sheep data movements at different sampling rates 10, 5, and 4 Hz. Various sensing data were retrieved in X,Y, and Z dimensions; however, only accelerometer, gyroscope, and orientation readings are considered in the current study. Four sheep datasets are aggregated each of which includes 31, 10, 18, and 7 sheep. The conducted work in this thesis evaluates the performance of ensemble classifiers (Bagging, Boosting, or RusBoosting) using three different validation methods (5-fold, 0.3 hold-out, and proposed one ‘Single Sheep Splitting’) in comparison to three sampling rates (10, 5, 4 Hz), two segmentation approaches (FNSW and FOSW), three feature selection methods (ReliefF, GA, and RF) and three window sizes (10, 7, 5 sec.). Promising results of lameness prediction accuracies are achieved over most of the combinations (3 sampling rates, two segmentation methods, 3 window sizes, 183 extracted features, 3 feature selection methods, 3 ensemble classification models, and 3 model validation methods). However, the highest accuracy is revealed by using the `Bagging ensemble classifier 88.92% with F-score of 87.7%, 91.1%, 88.2% for sound walking, mildly walking, and severely walking classes, respectively. The results are obtained using 5-fold cross-validation over a 10 sec.window for sheep data collected at 10 Hz sampling rate using only the accelerometer hardware sensor reading and calculated orientation readings. The number of features selected is 46 optimised by GA using CHAID tree as a fitness function. Conversely, the lowest prediction accuracy of 56.25% with F-score (63.4% sound walking, 51.9% mildly walking, 48.8% severely walking) is recorded when RusBoosting ensemble is applied using 5-fold cross-validation over a 10 sec.window for dataset collected at the 4 Hz. sampling rate. So, the major research findings recommend that 10 Hz sampling rate is adequate for collect sheep movements, while the best segmentation method is FOSW as 20% of data-points are shared between two successive windows. Whereas, the preferable number of data-points (sheep movements) to be pre-processed is around 100, which is obtained when a 10 sec.window size or 7 sec.window size is applied. Additionally, the 20 features selected by RF out of 183 features could reveal good accuracy results compared to the whole set of extracted features. Although that GA feature selection method has slower execution time than RF, competitive prediction accuracy could be achieved when the selected features by GA were fed to the classifier. Finally, the acceleration sensor data alone are capable of making the decision about the lame sheep. So no extra hardware sensors like Gyroscope is required for decision making; moreover, the orientation sensor features could be directly derived from Acc which contribute most to lameness detection. Since the most cost effective factors are identified in this research, the practice in the meanwhile could be applicable for farmers, stakeholders, and manufacturers as no available sensor to detect the lame sheep developed yet. Therefore, the multidisciplinary nature of the conducted research opens diverse paths towards applying further research studies to develop various data mining approaches and practical sensor kits to detect the early signs of sheep’s lameness for better farm productivity and sheep industry prosperity in the UK.

A Review of Data Classification Using K-Nearest Neighbour Algorithm

Article

Full-text available

Jun 2013

To classify data whether it is in the field of neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.

International Journal on Recent and Innovation Trends in Computing and Communication Data Mining Techniques for Weather Prediction: A Review

Article

Full-text available

Aug 2014

Data mining is the computer assisted process of digging through and analysing enormous sets of data and then extracting the meaningful data. Data mining tools predicts behaviours and future trends, allowing businesses to make proactive decisions. It can answer questions that traditionally were very time consuming to resolve. Therefore they can be used to predict meteorological data that is weather prediction. Weather prediction is a vital application in meteorology and has been one of the most scientifically and technologically challenging problems across the world in the last century. Predicting the weather is essential to help preparing for the best and the worst of the climate. Accurate Weather Prediction has been one of the most challenging problems around the world. Many weather predictions like rainfall prediction, thunderstorm prediction, predicting cloud conditions are major challenges for atmospheric research. This paper presents the review of Data Mining Techniques for Weather Prediction and studies the benefit of using it. The paper provides a survey of available literatures of some algorithms employed by different researchers to utilize various data mining techniques, for Weather Prediction. The work that has been done by various researchers in this field has been reviewed and compared in a tabular form. For weather prediction, decision tree and k-mean clustering proves to be good with higher prediction accuracy than other techniques of data mining.

Automatic Design of Decision-Tree Induction Algorithms

Book

Full-text available

Jan 2015

Sampling Essentials: Practical Guidelines for Making Sampling Choices

Book

Jan 2012

Johnnie Daniel

A maximally diversified multiple decision tree algorithm for microarray data classification

Conference Paper

Jan 2006

Twitter Data Analytics

Book

Jan 2014

This brief provides methods for harnessing Twitter data to discover solutions to complex inquiries. The brief introduces the process of collecting data through Twitters APIs and offers strategies for curating large datasets. The text gives examples of Twitter data with real-world examples, the present challenges and complexities of building visual analytic tools, and the best strategies to address these issues. Examples demonstrate how powerful measures can be computed using various Twitter data sources. Due to its openness in sharing data, Twitter is a prime example of social media in which researchers can verify their hypotheses, and practitioners can mine interesting patterns and build their own applications. This brief is designed to provide researchers, practitioners, project managers, as well as graduate students with an entry point to jump start their Twitter endeavors. It also serves as a convenient reference for readers seasoned in Twitter data analysis.

System for induction of oblique decision trees

Article

Jan 1994
JAIR

Mini-mental state-practical method for grading cognitive state of patients for clinicians

Article

Jan 1975
J PSYCHIATR RES

Survey of nearest neighbor techniques

Article

Jan 2010

V. Nitin Bhatia

XGBoost: A Scalable Tree Boosting System

Conference Paper

Aug 2016

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Decision Tree and Decision Forest Algorithms: On Improving Accuracy, Efficiency and Knowledge Discovery

Abstract and Figures

Recommended publications

Effective Use of the KDD Process and Data Mining for Computer Performance Professionals.

Visualizing Association Rules for feedback within the legal system.

A mixed similarity measure based on rough sets theory (MSM-R) and some experimental results for data...

KDD for Science Data Analysis: Issues and Examples