Book

C4.5: Programs for Machine Learning

Authors:
  • rulequest research
... Information gain (IG), Fisher score (FS), and chi-square (CS) are only a few of the well-known RVS approaches (Ross, 1993;Okeh and Oyeka, 2013;Guyon and Elisseeff, 2003). The SIS (Sure Independent Screening) strategy, as proposed by Ross (1993), was designed to guarantee that all significant variables are retained after variable screening, with a probability that approaches one. ...
... Information gain (IG), Fisher score (FS), and chi-square (CS) are only a few of the well-known RVS approaches (Ross, 1993;Okeh and Oyeka, 2013;Guyon and Elisseeff, 2003). The SIS (Sure Independent Screening) strategy, as proposed by Ross (1993), was designed to guarantee that all significant variables are retained after variable screening, with a probability that approaches one. The process of variable (gene) selection involves the identification and selection of a limited set of genes from a gene dataset characterized by high dimensionality. ...
... Information gain (IG) is utilized to quantify the quantity of information that a variable provides regarding a class (Ross, 1993). Information gain is frequently used in ranking variable selection (RVS) due to its high computing efficiency and simple interpretability. ...
Article
Full-text available
The technological advancements in recent years have led to high-dimensional data becoming a prominent focus of research in genetics, bioinformatics, and biostatistics. This study develops a two-step approach using regularized logistic regression for cancer classification and gene selection with high-dimensional gene expression data. The method combines sure independence screening (SIS) using filtering methods for initial gene selection, followed by an adaptive LASSO (ALASSO) technique with a proposed weighting scheme. The model was applied to three cancer gene expression datasets-colon, leukemia and prostate. The ALASSO with the Fisher score filter demonstrated the best performance, achieving classification accuracy up to 98.2% and AUC of 96.7% for leukemia data. The top genes selected were biologically relevant to their cancer types. This study demonstrates the promise of integrating SIS filtering and weighted ALASSO for informative gene selection and accurate cancer prediction from high-dimensional gene expression data.
... Learning -The learning algorithm processes the input file and classifies the requirements based on the tracechecking verdict (satisfied or violated). We run J48 [26], a widely used ML algorithm [27] that generates decision trees that classify training data. Figure 5 illustrates an example of a resulting decision tree where ϕ( and ) is the root node of the tree since splitting the ϕ( and ) operator renders a bigger information gain than a split in ϕ( 50 ), ϕ ( 20 ). ...
... Leaf nodes (True, False) are labeled with the frequency of whether the selected term results in the verdict. For the diagnostic generator component ( 3 ), we used the Java implementation of the C4.5 algorithm [26] available in Weka [28]. We selected the C4.5 algorithm, since it is a widely used learning algorithm for decision trees [27], and Weka, since it is a well-known library of machine learning algorithms [29]. ...
Preprint
Cyber-physical systems (CPS) development requires verifying whether system behaviors violate their requirements. This analysis often considers system behaviors expressed by execution traces and requirements expressed by signal-based temporal properties. When an execution trace violates a requirement, engineers need to solve the trace diagnostic problem: They need to understand the cause of the breach. Automated trace diagnostic techniques aim to support engineers in the trace diagnostic activity. This paper proposes search-based trace-diagnostic (SBTD), a novel trace-diagnostic technique for CPS requirements. Unlike existing techniques, SBTD relies on evolutionary search. SBTD starts from a set of candidate diagnoses, applies an evolutionary algorithm iteratively to generate new candidate diagnoses (via mutation, recombination, and selection), and uses a fitness function to determine the qualities of these solutions. Then, a diagnostic generator step is performed to explain the cause of the trace violation. We implemented Diagnosis, an SBTD tool for signal-based temporal logic requirements expressed using the Hybrid Logic of Signals (HLS). We evaluated Diagnosis by performing 34 experiments for 17 trace-requirements combinations leading to a property violation and by assessing the effectiveness of SBTD in producing informative diagnoses and its efficiency in generating them on a time basis. Our results confirm that Diagnosis can produce informative diagnoses in practical time for most of our experiments (33 out of 34).
... Quinlan [18] developed ID3, one of the first notable decision tree algorithms, in 1986. Furthermore, Quinlan [19] enhanced the ID3, introducing the C4.5 decision tree in 1993. These developments and integration of decision trees into ensemble methods like random forests and boosting algorithms have solidified their place as fundamental algorithms in machine learning. ...
... Quinlan [19] proposed the C4.5 in 1993 as an extension of the ID3 algorithm and is designed to handle both continuous and discrete attributes. It introduces the concept of information gain ratio, described in Equation 4, to select the best attribute to split the dataset at each node, aiming to overcome the bias towards attributes with more levels found in the original Information Gain criterion used by ID3. ...
Article
Full-text available
Machine learning (ML) has been instrumental in solving complex problems and significantly advancing different areas of our lives. Decision tree-based methods have gained significant popularity among the diverse range of ML algorithms due to their simplicity and interpretability. This paper presents a comprehensive overview of decision trees, including the core concepts, algorithms, applications, their early development to the recent high-performing ensemble algorithms and their mathematical and algorithmic representations, which are lacking in the literature and will be beneficial to ML researchers and industry experts. Some of the algorithms include classification and regression tree (CART), Iterative Dichotomiser 3 (ID3), C4.5, C5.0, Chi-squared Automatic Interaction Detection (CHAID), conditional inference trees, and other tree-based ensemble algorithms, such as random forest, gradient-boosted decision trees, and rotation forest. Their utilisation in recent literature is also discussed, focusing on applications in medical diagnosis and fraud detection.
... In general, the single decision tree is prone to over fitting and has little generalizability. When forming a decision tree, small changes in learning patterns can cause fundamental changes in the structure of that tree [42]. RF can learn complex patterns and consider the nonlinear relationship between explanatory variables and dependent variables. ...
... All trees have a certain depth and in each node the division of a feature are randomly selected from the set of features and the division or classification is done based on it. Due to the use of several sets of samples, this method is not sensitive to the existence of outliers and Fig. 1 The location of the Sindh River basin and studied stations in India Neural Computing and Applications missing data [42]. In order to create a regression tree, recursive segmentation and multiple regressions are used. ...
Article
Full-text available
In recent years, the application of machine learning methods in the prediction of hydrological processes such as precipitation has been widely considered. These methods can analyze large volumes of data and detect the existing trends and patterns. Therefore, in the present study, machine learning methods, including random forests (RF), Kstar algorithm and Gaussian process regression (GPR), were used to predict the precipitation of Sindh River basin in India during period of 1901 to 2020. In the next step, three distinct input scenarios include (i) using monthly precipitation data and considering the memory of time series up to 5 months delay, (ii) adding periodic term to the first scenario inputs and (iii) decomposing the data using the Daubechies 4 wavelet function and creating hybrid wavelet-learning machine (W-ML) models, were prepared and introduced to the models. The performance of each method was evaluated using the root mean square error (RMSE), mean absolute error (MAE), Kling–Gupta efficiency score (KGE) and Willmott index (WI). The results showed that single models with the first scenario inputs (without taking into account the periodicity of the data) did not have good accuracy, but by adding the periodicity, the performance of these models was significantly improved, and the average value of KGE index for all studied stations increased from 0.466 to 0.672. It was also found that the GPR model for all stations could not have good performance and RF and Kstar models are the most appropriate methods for predicting precipitation in the Sindh River basin, respectively. With the application of the third scenario and the development of W-ML hybrid models, the accuracy of precipitation forecasting was significantly improved, especially the maximum precipitation values were estimated with higher accuracy than standalone models.
... The decision tree (DT) is a well-known non-parametric supervised learning technique [59]. It is used both for classification and regression [60]. ...
... It is used both for classification and regression [60]. The decision tree algorithms: Iterative Dichotomiser 3 (ID3) [61], C4.5 [59], and Classification and Regression Trees (CART) [60] are notable examples. The decision tree classifies instances by descending the tree to its leaf nodes. ...
Thesis
Full-text available
The fast growth of internet applications and services produces a massive amount of information daily. With that rapid development of information, it becomes a challenging task for users to find content that satisfies their needs when they use online applications. Therefore, Recommendation Systems (RS) have become necessary for users. RS is a filtering technique that tries to reduce the available selections for users by finding the relevant items that satisfy their desires. Deep learning algorithms have significantly succeeded in several fields, including RS. Recently, many deep learning-based RSs have been proposed; they involve all the users in datasets to extract the latent representation of input data to be used later for predicting the missing rates. Users have diverse preferences, making it challenging to create a single model that caters to all of them. This diversity results in recommendations that need to reflect individual user preferences accurately, where his work targeted bridging this gap as the primary objective. This dissertation proposed a new Optimized Clustering-based Denoising Autoencoder model (OCB-DAE), which trains multiple models based on users' preferences. The proposed model combined the Artificial Fish Swarm Algorithm (AFSA) with K-means algorithm to determine the best initial centroids for clustering the users based on their similarities, and each cluster trains a Denoising Autoencoder model (DAE) to ensure that users with similar interests train each model. OCB-DAE is applied to movies and food datasets to generate recommendations by utilizing the items features as side information. The proposed model was trained and tested over MovieLens 100K (ML-100K), MovieLens 1M (ML-1M), and Food.com datasets. The Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics were used to evaluate the performance of the proposed model. In ML-100K dataset, the RMSE and MAE were 0.5589 and 0.5050, respectively. Whereas in ML-1M dataset, the RMSE and MAE were 0.6742 and 0.5894, respectively. Regarding the Food dataset, the RMSE and MAE were 0.1922 and 0.1764, respectively. The proposed model outperformed the related works in terms of RMSE with 29.7% and 14.4% using ML-100K and ML-1M datasets, respectively. In the food dataset, the proposed model outperformed non-clustered model in terms of RMSE by 24.3%. The results showed that training multiple models based on users' preferences reduced prediction errors and improved the recommendation systems' performance.
... Our empirical study using 10 datasets compares the splitting criteria of popular decision tree algorithms, including the Gini index (CART) [25], entropy (ID3) [26], gain ratio (C4.5) [27], and chi-square (CHAID) [28]. Splitting criteria are the methods utilized to determine the optimal split in the branching process of the decision tree, which influences the shaping of the tree structure. ...
Article
Full-text available
Objectives: This paper proposes a novel stability metric for decision trees that does not rely on the elusive notion of tree similarity. Existing stability metrics have been constructed in a pairwise fashion to assess the tree similarity between two decision trees. However, quantifying the structural similarities between decision trees is inherently elusive. Conventional stability metrics are simply relying on partial information such as the number of nodes and the depth of the tree, which do not adequately capture structural similarities. Methods: We evaluate the stability based on the computational burden required to generate a stable tree. First, we generate a stable tree using the novel adaptive node-level stabilization method, which determines the most frequently selected predictor during the bootstrapping iterations of a decision tree branching process at each node. Second, the stability is measured based on the number of bootstraps required to achieve the stable tree. Findings: Using the proposed stability metric, we compare the stability of four popular decision tree splitting criteria: Gini index, entropy, gain ratio, and chi-square. In an empirical study across ten datasets, the gain ratio is the most stable splitting criterion among the four popular criteria. Additionally, a case study demonstrates that applying the proposed method to the classification and regression tree (CART) algorithm generates a more stable tree compared to the one produced by the original CART algorithm. Novelty: We propose a stability metric for decision trees without relying on measuring pairwise tree similarity. This paper provides a stability comparison of four popular decision tree splitting criteria, delivering practical insights into their reliability. The adaptive node-level stabilization method can be applied across various decision tree algorithms, enhancing tree stability and reliability in scenarios with updating data.
... Fernandez et al. [49] proposed the Ensemble classifier from a Feature and Instance Selection using a Multi-Objective Evolutionary Algorithm (EFIS-MOEA) for imbalanced data classification. Their approach embeds the C4.5 decision tree [50] in a wrapper procedure, applying the NSGA-II multiobjective optimization algorithm [51]. They use two genes for the chromosome representation, one for feature selection and another for instance selection. ...
Article
Full-text available
This paper introduces a novel algorithm called Ant-based Feature and Instance Selection. This new algorithm addresses the simultaneous selection of instances and features for mixed, incomplete, and imbalanced data in the context of lazy instance-based classifiers. The proposed algorithm uses a hybrid selection strategy based on metaheuristic procedures and Rough Sets. The Ant-based Feature and Instance Selection algorithm combines Ant Colony Optimization and Generic Extended Rough Sets for Mixed and Incomplete Information Systems. It has five stages: reduct computation, metadata computation, intelligent instance preprocessing, submatrices creation, and fusion. To test the performance of the proposed algorithm, we used 25 datasets from the Machine Learning repository of the University of California at Irvine. All these datasets are imbalanced, with multiple classes and represent real-world classification problems. The number of classes ranges between three and eight classes. Most of them also have mixed or incomplete descriptions. We used several performance measures and computed the Instance Retention ratio and the Feature Retention ratio. To determine the existence or not of significant differences in the performance of the compared algorithms, we used non-parametric hypothesis testing. The statistical analysis results confirm the high quality of the proposed algorithm for selecting features and instances in multiclass imbalanced data.
... The C4.5 algorithm is able to handle categorical and numerical data. For categorical data, the C4.5 algorithm selects one of the categories as the best attribute using the highest gain value, while the C4.5 algorithm changes numerical data into two categories first using a certain limit for numerical data (Quinlan, 1993). The stages of building a decision tree using the C4.5 algorithm is choose the attribute with the highest gain as the root. ...
... Random forest is an ensemble-based classification method that combines multiple decision trees. A decision tree can be written as a set of if-then rules, e.g., using the C4.5 Rules algorithm ( Quinlan (1993)). Pruning methods can be applied to reduce the set of rules. ...
Article
Full-text available
Imbalanced graph node classification is a highly relevant and challenging problem in many real-world applications. The inherent data scarcity, a central characteristic of this task, substantially limits the performance of neural classification models driven solely by data. Given the limited instances of relevant nodes and complex graph structures, current methods fail to capture the distinct characteristics of node attributes and graph patterns within the underrepresented classes. In this article, we propose REFUEL—a novel approach for highly imbalanced node classification problems in graphs. Whereas symbolic and neural methods have complementary strengths and weaknesses when applied to such problems, REFUEL combines the power of symbolic and neural learning in a novel neural rule-extraction architecture. REFUEL captures the class semantics in the automatically extracted rule vectors. Then, REFUEL augments the graph nodes with the extracted rules vectors and adopts a Graph Attention Network-based neural node embedding, enhancing the downstream neural node representation. Our evaluation confirms the effectiveness of the proposed REFUEL approach for three real-world datasets with different minority class sizes. REFUEL achieves at least a 4% point improvement in precision on the minority classes of 1.5–2% compared to the baselines.
... Despite its widespread success, constructing each tree in the ensemble is computationally resource-intensive, especially for large datasets. State-of-the-art techniques for constructing each tree are based on threshold feature splitting [7][8][9][10], whose algorithmic complexity scales polynomially with the number of training samples. Once the model is trained and put online, it needs to be updated to account for new training data in order to maintain model accuracy. ...
Preprint
Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications were data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
... In the category of heuristic methods, Rivest [22] proposed greedy splitting techniques, also used in subsequent heuristic algorithms for learning decision trees [5,20]. RIPPER [7] builds rule sets in a greedy fashion, and a similar greedy strategy has been used for finding sets of robust rules in terms of minimum description length [9]. ...
Preprint
Learning interpretable models has become a major focus of machine learning research, given the increasing prominence of machine learning in socially important decision-making. Among interpretable models, rule lists are among the best-known and easily interpretable ones. However, finding optimal rule lists is computationally challenging, and current approaches are impractical for large datasets. We present a novel and scalable approach to learn nearly optimal rule lists from large datasets. Our algorithm uses sampling to efficiently obtain an approximation of the optimal rule list with rigorous guarantees on the quality of the approximation. In particular, our algorithm guarantees to find a rule list with accuracy very close to the optimal rule list when a rule list with high accuracy exists. Our algorithm builds on the VC-dimension of rule lists, for which we prove novel upper and lower bounds. Our experimental evaluation on large datasets shows that our algorithm identifies nearly optimal rule lists with a speed-up up to two orders of magnitude over state-of-the-art exact approaches. Moreover, our algorithm is as fast as, and sometimes faster than, recent heuristic approaches, while reporting higher quality rule lists. In addition, the rules reported by our algorithm are more similar to the rules in the optimal rule list than the rules from heuristic approaches.
... Decision trees learn from inputting the data and results, to create a predictive model [16]. While there are several machine learning techniques involving decision tree models, such as Classification and Regression Tree (CART) [17], C4.5 algorithm [18], and Interactive Dichotomic 3 algorithm [19], CART was used to predict the concrete compressive strength, considering the 4 parameters investigated in this study. This is because CART can train multiple factors, interactions, and relationships, including both categorical and numerical data [20]. ...
Article
Full-text available
Due to the ceramic tile waste’s (CTW) negative impact on workability, this study incorporated three aggregate modification treatments (AMTs) on the CTWs, namely cement impregnation (CI), sodium silicate soaking (SS), and slurry wrapping (SW). Concrete batches were prepared, with varying CTW replacements of 0%, 25%, and 50% to gravel, and water-cement ratios (w/c) of 0.5 and 0.6. Slump tests and compressive strength tests at curing periods of 7 and 28 day were conducted. Experimental results showed that concrete mixes with CI treatment produced the highest compressive strength, while the concrete batches with 0.6 w/c produced higher compressive strengths. However, concrete mix that considered SW treatment showed a reduction in compressive strength relative to the mix with untreated CTW. The optimum design mix incorporated CI treatment, 25% CTW replacement, and 0.5 w/c. This mix yielded about 16.7% stronger nominal strength compared to the control mix. A decision tree regression (DTR) model was generated to predict the compressive strength based on different combinations for the concrete mix. Based on the model, the AMT showed the most influence on the prediction of compressive strength. Overall results indicate the use of CTW in sustainable concrete production could be further enhanced by CI treatment method.
... Several splitting criteria have been proposed, such as information gain [38], information gain ratio [39], and Gini impurity [33]. In this paper, we follow the previous studies [1,18] to use the modified Gini impurity, which is easy to compute in MPC [1,18], as the splitting criterion to implement our proposed Ents. ...
Preprint
Multi-party training frameworks for decision trees based on secure multi-party computation enable multiple parties to train high-performance models on distributed private data with privacy preservation. The training process essentially involves frequent dataset splitting according to the splitting criterion (e.g. Gini impurity). However, existing multi-party training frameworks for decision trees demonstrate communication inefficiency due to the following issues: (1) They suffer from huge communication overhead in securely splitting a dataset with continuous attributes. (2) They suffer from huge communication overhead due to performing almost all the computations on a large ring to accommodate the secure computations for the splitting criterion. In this paper, we are motivated to present an efficient three-party training framework, namely Ents, for decision trees by communication optimization. For the first issue, we present a series of training protocols based on the secure radix sort protocols to efficiently and securely split a dataset with continuous attributes. For the second issue, we propose an efficient share conversion protocol to convert shares between a small ring and a large ring to reduce the communication overhead incurred by performing almost all the computations on a large ring. Experimental results from eight widely used datasets show that Ents outperforms state-of-the-art frameworks by $5.5\times \sim 9.3\times$ in communication sizes and $3.9\times \sim 5.3\times$ in communication rounds. In terms of training time, Ents yields an improvement of $3.5\times \sim 6.7\times$. To demonstrate its practicality, Ents requires less than three hours to securely train a decision tree on a widely used real-world dataset (Skin Segmentation) with more than 245,000 samples in the WAN setting.
... This regularisation helps to maintain independence and uniqueness between features , thus improving the performance and robustness of the model. In traditional machine learning, logistic regression [11], support vector machine (SVM) [12] and decision tree [13] are widely used because of their simplicity and efficiency. Logistic regression is suitable for binary classification problems and has a simple, easy-to-interpret model, but it is difficult to deal with non-linear relationships.SVMs effectively deal with non-linear problems through kernel tricks, but are computationally inefficient on large-scale datasets. ...
Preprint
Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.
... Represented by a binary tree model, where each node will be represented by an input variable and the leaves will represent an output variable used to make the prediction. Its characteristics are the speed to make predictions and accuracy for most problems (Quinlan, 2014). • Logistic Regression (LR): a technique that uses concepts of statistics and probability for binary classification. ...
Article
Full-text available
Refactoring is the process of restructuring source code without changing the external behavior of the software. Refactoring can bring many benefits, such as removing code with poor structural quality, avoiding or reducing technical debt, and improving maintainability, reuse, or code readability. Although there is research on how to predict refactorings, there is still a clear lack of studies that assess the impact of operations considered less complex (trivial) to more complex (non-trivial). In addition, the literature suggests conducting studies that invest in improving automated solutions through detecting and correcting refactoring. This study aims to identify refactoring activity in non-trivial operations through trivial operations accurately. For this, we use classifier models of supervised learning, considering the influence of trivial refactorings and evaluating performance in other data domains. To achieve this goal, we assembled 3 datasets totaling 1,291 open-source projects, extracted approximately 1.9M refactoring operations, collected 45 attributes and code metrics from each file involved in the refactoring and used the algorithms Decision Tree, Random Forest, Logistic Regression, Naive Bayes and Neural Network of supervised learning to investigate the impact of trivial refactorings on the prediction of non-trivial refactorings. For this study, we contextualize the data and call context each experiment configuration in which it combines trivial and non-trivial refactorings. Our results indicate that: (i) Tree-based models such as Random Forest, Decision Tree, and Neural Networks performed very well when trained with code metrics to detect refactoring opportunities. However, only the first two were able to demonstrate good generalization in other data domain contexts of refactoring; (ii) Separating trivial and non-trivial refactorings into different classes resulted in a more efficient model. This approach still resulted in a more efficient model even when tested on different datasets; (iii) Using balancing techniques that increase or decrease samples may not be the best strategy to improve models trained on datasets composed of code metrics and configured according to our study.
... Random Forest (RF) [11] is a traditional ML algorithm that uses multiple Decision Trees (DTs) [12,13] to generate predictions. Figure 2 presents a schematic idea of the algorithm. ...
Conference Paper
Semantic segmentation has been successfully explored in biologicalstudies to handle various applications, such as identifying wounds.This study explores two image segmentation approaches to identifymice wounds, specifically the U-Net and Random Forest algorithms.The latter was combined with features extracted from the first twolayers of VGG16, which was used as a feature extractor. Experimentswere performed with a real dataset developed by the Pain,Neuropathy, and Inflammation Laboratory at the State Universityof Londrina with the approval of the University Ethics Committeeon Animal Research and Welfare. The experimental results werepromising, showing that both alternatives can provide accuratepredictions for most images regarding FScore and IoU evaluationmeasures. Statistical tests were also applied, showing that U-Netobtained statistically better results with an average FScore of 0.72and IoU of 0.58.
... Unfortunately, DT optimization poses a significant challenge due to its NP-completeness, as established by Laurent and Rivest (1976). Consequently, heuristic methods, such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 2014) and CART (Breiman et al., 1984), have been favoured historically. These methods construct DTs greedily by maximising some local purity metric, however, while they are fast and scalable, their greedy nature often leads to suboptimal and overly complex DTs, detracting from their interpretability. ...
Preprint
Full-text available
Decision Tree Learning is a fundamental problem for Interpretable Machine Learning, yet it poses a formidable optimization challenge. Despite numerous efforts dating back to the early 1990's, practical algorithms have only recently emerged, primarily leveraging Dynamic Programming (DP) and Branch & Bound (B&B) techniques. These breakthroughs led to the development of two distinct approaches. Algorithms like DL8.5 and MurTree operate on the space of nodes (or branches), they are very fast, but do not penalise complex Decision Trees, i.e. they do not solve for sparsity. On the other hand, algorithms like OSDT and GOSDT operate on the space of Decision Trees, they solve for sparsity but at the detriment of speed. In this work, we introduce Branches, a novel algorithm that integrates the strengths of both paradigms. Leveraging DP and B&B, Branches achieves exceptional speed while also solving for sparsity. Central to its efficiency is a novel analytical bound enabling substantial pruning of the search space. Theoretical analysis demonstrates that Branches has lower complexity compared to state-of-the-art methods, a claim validated through extensive empirical evaluation. Our results illustrate that Branches not only greatly outperforms existing approaches in terms of speed and number of iterations, it also consistently yields optimal Decision Trees.
... Random Forest (RF) [11] is a traditional ML algorithm that uses multiple Decision Trees (DTs) [12,13] to generate predictions. Figure 2 presents a schematic idea of the algorithm. ...
Conference Paper
Semantic segmentation has been successfully explored in biological studies to handle various applications, such as identifying wounds. This study explores two image segmentation approaches to identify mice wounds, specifically the U-Net and Random Forest algorithms. The latter was combined with features extracted from the first two layers of VGG16, which was used as a feature extractor. Experiments were performed with a real dataset developed by the Pain, Neuropathy, and Inflammation Laboratory at the State University of Londrina with the approval of the University Ethics Committee on Animal Research and Welfare. The experimental results were promising, showing that both alternatives can provide accurate predictions for most images regarding FScore and IoU evaluation measures. Statistical tests were also applied, showing that U-Net obtained statistically better results with an average FScore of 0.72 and IoU of 0.58.
... This has led to the development of a variety of optimization techniques, aimed at refining every aspect of decision tree learning-from the construction phase to the model's final deployment. Techniques such as pruning [26] feature selection [25], and tree ensemble methods [5] have helped greatly improve the performance of decision trees. Advancements in splitting criteria [17], parallel/distributed computing [10], and incremental learning have allowed these algorithms to handle [12] large-scale data analyses. ...
Preprint
Full-text available
We present a novel and systematic method, called Superfast Selection, for selecting the "optimal split" for decision tree and feature selection algorithms over tabular data. The method speeds up split selection on a single feature by lowering the time complexity, from O(MN) (using the standard selection methods) to O(M), where M represents the number of input examples and N the number of unique values. Additionally, the need for pre-encoding, such as one-hot or integer encoding, for feature value heterogeneity is eliminated. To demonstrate the efficiency of Superfast Selection, we empower the CART algorithm by integrating Superfast Selection into it, creating what we call Ultrafast Decision Tree (UDT). This enhancement enables UDT to complete the training process with a time complexity O(KMlogM) (K is the number of features). Additionally, the Training Only Once Tuning enables UDT to avoid the repetitive training process required to find the optimal hyper-parameter. Experiments show that the UDT can finish a single training on KDD99-10% dataset (494K examples with 41 features) within 1 second and tuning with 214.8 sets of hyper-parameters within 0.25 second on a laptop.
... They can be pruned to avoid overfitting. In classification, leaf nodes are class labels; in regression, they are numbers [27,28]. The decision tree's prediction is made by traversing down the tree based on the feature values of the input sample until a leaf node is reached: y = prediction at the leaf node. ...
Article
Full-text available
Current electricity sectors will be unable to keep up with commercial and residential customers’ increasing demand for data-enabled power systems. Therefore, next-generation power systems must be developed. It is possible for the smart grid, an advanced power system of the future, to make decisions, estimate loads, and execute other data-related jobs. Customers can adjust their needs in smart grid systems by monitoring bill information. Due to their reliance on data networks, smart grids are vulnerable to cyberattacks that could compromise billing data and cause power outages and other problems. A false data injection attack (FDIA) is a significant attack that targets the corruption of state estimation vectors. The primary goal of this paper is to show the impact of an FDIA attack on a power dataset and to use machine learning algorithms to detect the attack; to achieve this, the Python software is used. In the experiment, we used the power dataset from the IoT server of a 10 KV solar PV system (to mimic a smart grid system) in a controlled laboratory environment to test the effect of FDIA and detect this anomaly using a machine learning approach. Different machine learning models were used to detect the attack and find the most suitable approach to achieve this goal. This paper compares machine learning algorithms (such as random forest, isolation forest, logistic regression, decision tree, autoencoder, and feed-forward neural network) in terms of their effectiveness in detecting false data injection attacks (FDIAs). The highest F1 score of 0.99 was achieved by the decision tree algorithm, which was closely followed by the logistic regression method, which had an F1 score of 0.98. These algorithms also demonstrated high precision, recall, and model accuracy, demonstrating their efficacy in detecting FDIAs. The research presented in this paper indicates that combining logistic regression and decision tree in an ensemble leads to significant performance enhancements. The resulting model achieves an impressive accuracy of 0.99, a precision of 1, and an F1 score of 1.
... We use WEKA's J48 package (WEKA, Frank et al., 2016) to build our decision trees. J48 is an implementation of the C4.5 algorithm (Quinlan, 1993), which is a Classification and Regression Tree (or CART model; Breiman et al., 1984) that, given the input database, will make partitions within the data based on how well a partition is able to generalize for classification. Variables used for these partitions have a high information gain at that point in the algorithm; the higher information gain a variable has, the more evenly its values subdivide the space, which minimizes the number of additional variables needed to classify a data point. ...
Article
Full-text available
How might data analytic tools support intake decisions? When faced with a request for post-conviction assistance, innocence organizations’ intake staff must determine (1) whether the applicant can be shown to be factually innocent, and (2) whether the organization has the resources to help. These difficult categorization decisions are often made with incomplete information (Weintraub, 2022). We explore data from the National Registry of Exonerations (NRE; 4/26/2023, N = 3,284 exonerations) to inform such decisions, using patterns of features associated with successful prior cases. We first reproduce Berube et al. (2023)’s latent class analysis, identifying four underlying categories across cases. We then apply a second technique to increase transparency, decision tree analysis (WEKA, Frank et al., 2013). Decision trees can decompose complex patterns of data into ordered flows of variables, with the potential to guide intermediate steps that could be tailored to the particular organization’s limitations, areas of expertise, and resources.
... The underlying principle of machine learning classification models involves identifying a relationship or rule between the input data features and categories and subsequently employing this relationship or rule to predict the categories of new data (Aized and Arshad 2017;Kesavaraj and Sukumaran 2013). Taking the classic decision tree algorithm as an example, child nodes are generated and extended when a feature reaches a threshold, continuing until the result achieves optimal "purity"-measured by metrics such as information entropy (Quinlan 2014) and Gini coefficient (Breiman 2017). However, there is a paucity of research dedicated to three-dimensional features for classification purposes, with the majority of studies predominantly emphasizing the application of geometric morphometrics (Liew and Schilthuizen 2016b). ...
Article
Full-text available
Classification of cryptic species is important for assessing biodiversity and conducting ecological studies. However, morphological classification methods face the loss of morphological information due to subjectivity in geometric morphometrics, while an incomplete database and horizontal gene transfer limit the molecular approach. A novel approach combining 3D modeling and artificial intelligence algorithms using morphological and molecular data was developed for species classification. Cryptic species from the Vignadula genus were used to test the feasibility of this new approach. Molecular identification results as data labels were used for training models, and for validating classification results of machine learning and deep learning. Our approach achieved accuracies of over 80% in distinguishing between V. atrata and V. mangle, which were identified by molecular data along China’s coast. The result of the confusion matrix indicated the misidentified individuals were due to the morphological similarity in the intermediate zone. The feature importance analysis highlighted the significant contribution of average curvature—a 3D feature—to the task, indicating the feasibility of the 3D model in cryptic species classification. Utilizing 3D models and artificial intelligence, this study presents a novel approach for classifying cryptic species of molluscs.
... Model-centric XAI approaches aim to explain the learned model itself [15][16][17]. In this sense, decision trees [18] learned from data are one of the most transparent machine learning models. Conversely, post hoc XAI approaches aim to explain a model by providing verbal or visual explanations [15,19]. ...
Article
Full-text available
Gaining clinicians’ trust will unleash the full potential of artificial intelligence (AI) in medicine, and explaining AI decisions is seen as the way to build trustworthy systems. However, explainable artificial intelligence (XAI) methods in medicine often lack a proper evaluation. In this paper, we present our evaluation methodology for XAI methods using forward simulatability. We define the Forward Simulatability Score (FSS) and analyze its limitations in the context of clinical predictors. Then, we applied FSS to our XAI approach defined over an ML-RO, a machine learning clinical predictor based on random optimization over a multiple kernel support vector machine (SVM) algorithm. To Compare FSS values before and after the explanation phase, we test our evaluation methodology for XAI methods on three clinical datasets, namely breast cancer, VTE, and migraine. The ML-RO system is a good model on which to test our XAI evaluation strategy based on the FSS. Indeed, ML-RO outperforms two other base models—a decision tree (DT) and a plain SVM—in the three datasets and gives the possibility of defining different XAI models: TOPK, MIGF, and F4G. The FSS evaluation score suggests that the explanation method F4G for the ML-RO is the most effective in two datasets out of the three tested, and it shows the limits of the learned model for one dataset. Our study aims to introduce a standard practice for evaluating XAI methods in medicine. By establishing a rigorous evaluation framework, we seek to provide healthcare professionals with reliable tools for assessing the performance of XAI methods to enhance the adoption of AI systems in clinical practice.
... The popularity of transformer models [27] shows why they are also an excellent choice for AD tasks [28]. [29] uses an attention-based LSTM model on top of a C4.5 [30] decision tree to predict the lane change intention and execution. While in this work, the attention mechanism automatically extracts the critical information and learns the relationship between the hidden layers, in [28], the trajectory prediction of other vehicles is solved by combining language models and trajectory prediction. ...
Article
Full-text available
This paper investigates the high-level decision-making problem in highway scenarios regarding lane changing and over-taking other slower vehicles. In particular, this paper aims to improve the Travel Assist feature for automatic overtaking and lane changes on highways. About 9 million samples including lane images and other dynamic objects are collected in simulation. This data; Overtaking on Simulated HighwAys (OSHA) dataset is released to tackle this challenge. To solve this problem, an architecture called SwapTransformer is designed and implemented as an imitation learning approach on the OSHA dataset. Moreover, auxiliary tasks such as future points and car distance network predictions are proposed to aid the model in better understanding the surrounding environment. The performance of the proposed solution is compared with a multi-layer perceptron (MLP) and multi-head self-attention networks as baselines in a simulation environment.We also demonstrate the performance of the model with and without auxiliary tasks. All models are evaluated based on different metrics such as time to finish each lap, number of overtakes, and speed difference with speed limit. The evaluation shows that the SwapTransformer model outperforms other models in different traffic densities in the inference phase.
... The J48 classifier was used to build the DTs. This algorithm represents a Java extension of the better-known Quinlan C4.5 algorithm (Quinlan, 1993;Salzberg, 1994). Starting from a set of input data, the algorithm defines a DT that allows for classifying new data into the groups of a specific class variable. ...
Article
Full-text available
The development of psychological assessment tools that accurately and efficiently classify individuals as having or not a specific diagnosis is a major challenge for test developers and mental health professionals. This paper shows how machine learning (ML) provides a valuable framework to improve the accuracy and efficiency of psychodiagnostic classifications. The method is illustrated using an empirical example based on the Patient Health Questionnaire-9 (PHQ-9). The results show that, compared to traditional scorings of the PHQ-9, that based on decision tree (DT) algorithms is more advantageous in terms of accuracy and efficiency. In addition, the DT-based method facilitates the development of short test forms and improves the diagnostic performance of the test by integrating external information (e.g., demographic variables) into the scoring process. These findings suggest that DT-algorithms and ML applications such as feature selection represent a valuable method for supporting test developers and mental health professionals, and highlight the potential of ML for advancing the field of psychological assessment.
... An adaptation of the C4.5 algorithm [29] to the Multi-Label Learning (MLL) setting has been introduced [9]. The original C4.5 algorithm was developed to generate decision trees using the concept of entropy, and is capable of handling both continuous and discrete attributes. ...
Article
Full-text available
Lifelong machine learning concerns the development of systems that continuously learn from diverse tasks, incorporating new knowledge without forgetting the knowledge they have previously acquired. Multi-label classification is a supervised learning process in which each instance is assigned multiple non-exclusive labels, with each label denoted as a binary value. One of the main challenges within the lifelong learning paradigm is the stability-plasticity dilemma, which entails balancing a model’s adaptability in terms of incorporating new knowledge with its stability in terms of retaining previously acquired knowledge. When faced with multi-label data, the lifelong learning challenge becomes even more pronounced, as it becomes essential to preserve relations between multiple labels across sequential tasks. This scoping review explores the intersection of lifelong learning and multi-label classification, an emerging domain that integrates continual adaptation with intricate multi-label datasets. By analyzing the existing literature, we establish connections, identify gaps in the existing research, and propose new directions for research to improve the efficacy of multi-label lifelong learning algorithms. Our review unearths a growing number of algorithms and underscores the need for specialized evaluation metrics and methodologies for the accurate assessment of their performance. We also highlight the need for strategies that incorporate real-world data from varying contexts into the learning process to fully capture the nuances of real-world environments.
... After that, data scientists have made many useful explorations to find the method of selecting the optimal features, J. Ross Quinlan proposed to use "information gain" to select the optimal features in 1986, which formed the "ID3 algorithm", and then he realised the "ID3 algorithm" in 1993. In 1993, Ross Quinlan optimised the ID3 algorithm, i.e. the C4.5 algorithm, which adopts the "information gain ratio" to select the optimal features, and the shortcoming of the ID3 and C4.5 algorithms is that they can only be applied to classification, but not to regression, i.e. the variables to be interpreted are limited to discrete values, and to achieve this goal, they can only be used for classification [3]. In order to make the decision tree realise the application of regression, Leo Breiman used Gini impurity to select the optimal features, and thus proposed the CART (Classification and Regression Tree) algorithm [4]. ...
Article
Full-text available
In this paper, the decision tree model in data mining is applied to select stock characteristics that can be effectively used for stock selection by using the C4.5 algorithm and the CART algorithm, respectively, in combination with the strategies of fundamental analysis and technical analysis. The paper concludes that the decision tree models constructed by the C4.5 and CART algorithms both have better classification ability for stock selection and portfolio construction, but the decision tree model constructed by the C4.5 algorithm is simpler. The stock portfolios determined by the decision tree model are able to achieve an excess return of 13.4% relative to the CSI 300 index, thus proving that the decision tree model is effective in stock selection and stock portfolio construction.
... Text classification has evolved through several methodologies, each with unique advantages and limitations. Early rule-based methods like decision trees and expert systems, such as C4.5 Quinlan (2014) and MYCIN Shortliffe (2012), were straightforward but prone to overfitting and inflexibility with new or noisy data. In contrast, probability-based models like Naive Bayes Xu (2018) and Hidden Markov Models Rabiner (1989) offered better generalization for tasks like spam detection and speech recognition, handling sequential and complex data effectively. ...
Preprint
Full-text available
Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers. The system simplifies the traditional text classification workflow, eliminating the need for extensive preprocessing and domain expertise. The performance of several LLMs, machine learning (ML) algorithms, and neural network (NN) based structures is evaluated on four datasets. Results demonstrate that certain LLMs surpass traditional methods in sentiment analysis, spam SMS detection and multi-label classification. Furthermore, it is shown that the system's performance can be further enhanced through few-shot or fine-tuning strategies, making the fine-tuned model the top performer across all datasets. Source code and datasets are available in this GitHub repository: https://github.com/yeyimilk/llm-zero-shot-classifiers.
... There are several algorithms that can be employed to construct the decision tree including ID3 (Goodfellow et al., 2016), C4.5 (Quinlan, 1985), and Classification and Regression Trees (CART) (Quinlan, 1993). The Scikit-Learn library (Berk, 2020) that uses the CART algorithm which constructs the binary tree to form a predictor was implemented in the present study. ...
Article
Full-text available
Although the aspects that affect the performance and the deterioration of abrasive belt grinding are known, wear prediction of abrasive belts in the robotic arm grinding process is still challenging. Massive wear of coarse grains on the belt surface has a serious impact on the integrity of the tool and it reduces the surface quality of the finished products. Conventional wear status monitoring strategies that use special tools result in the cessation of the manufacturing production process which sometimes takes a long time and is highly dependent on human capabilities. The erratic wear behavior of abrasive belts demands machining processes in the manufacturing industry to be equipped with intelligent decision-making methods. In this study, to maintain a uniform tool movement, an abrasive belt grinding is installed at the end-effector of a robotic arm to grind the surface of a mild steel workpiece. Simultaneously, accelerometers and force sensors are integrated into the system to record its vibration and forces in real-time. The vibration signal responses from the workpiece and the tool reflect the wear level of the grinding belt to monitor the tool’s condition. Intelligent monitoring of abrasive belt grinding conditions using several machine learning algorithms that include K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Decision Tree (DT) are investigated. The machine learning models with the optimized hyperparameters that produce the highest average test accuracy were found using the DT, Random Forest (RF), and XGBoost. Meanwhile, the lowest latency was obtained by DT and RF. A decision-tree-based classifier could be a promising model to tackle the problem of abrasive belt grinding prediction. The application of various algorithms will be a major focus of our research team in future research activities, investigating how we apply the selected methods in real-world industrial environments.
... J48 [58], a Java implementation of C4.5 algorithm, normalize bias using the gain ratio in decision tree classification. Naive Bayes [59] applies Bayes' theorem, assuming feature independence. ...
Article
Full-text available
In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
... Those impurity metrics can be Information Gain, Entropy, Gain Ratio, etc. Because the decision variable (CSI) is a continuous type variable, we utilize the C4.5 [32] algorithm. To manage continuous attributes, the C4.5 algorithm creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. ...
Article
Full-text available
Recently, human activity recognition based on wireless signals has become an active and promising research direction. Researchers have shown that machine learning (ML) models can accurately classify some activities of a person standing between the WiFi transmitter and receiver. However, the availability of public datasets is limited due to labor-intensive dataset collection. Moreover, an efficient signal segmentation algorithm is required for application in practical scenarios. This paper presented a signal enhancement framework for WiFi-based human activity recognition using ML-based signal segmentation. Specifically, we proposed a stable channel state information (CSI) collection platform based on stable USRP devices. Using this platform, we released a public dataset (WiAR-UIT) for various human activities to control smart home devices. To enhance the prediction accuracy as well as the converging ability of ML models, we proposed two algorithms for automatic signal segmentation. The first algorithm uses conventional signal processing procedures (SIGPRO-SEGM). The second algorithm is dataset-independent and based on a CNN model (ML-SEGM). Applying these segmentation algorithms to our dataset, the best performance of 99.2% accuracy is obtained. Moreover, the accuracy is improved by 35% for some ML models including K-nearest neighbors, support vector machine, decision tree, random forest, and multi-layer perceptron. Finally, we have deployed a real-time client–server application using the above segmentation algorithms to emphasize the potential and practicality of the proposed research direction.
... Decision trees can suffer from sensitivity to noise or rotation variance with unconstrained splits. Regularization methods address overfitting, with pruning a common tactic to improve generalization capability [65]. Owing to their conceptual simplicity yet capability to uncover complex patterns, decision trees have thrived across predictive modeling applications in engineering, science and business [66]. ...
Article
Full-text available
Unmanned aerial systems/vehicles (UAS/UAVs) are widely employed for inspecting high-voltage (HV) Tx lines, characterized by elevated electric (E) and magnetic (H) fields. Operating on batteries, these UAVs are equipped with various electrical sensors, microprocessors, and motors, all susceptible to E/H field effects. This paper explores the distribution of E/H fields in multiple HVTx lines and a microwave tower. Data was collected from one 250 kV <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> DC </sub> , four AC Tx lines (69kV, 230kV, 345kV, and 500kV), and one microwave tower, utilizing DJI UAVs (M2EA, M30, and M300) equipped with onboard setups. Measurements included E field in V/m, H field in mG, Battery voltage in V, Battery current in A, Battery percentage, Battery Temperature in F, latitude, and longitude. Preliminary findings highlight larger E/H field levels within AC Tx lines than DC Tx lines. The paper discusses conditions influencing E/H field strength during UAV operation. Additionally, a proposed multi-staged random forest regressor (RFR) and k-nearest neighbor (KNN) hybrid machine learning (ML) model forecasts UAV battery drain. Results indicate that the hybrid RFR and KNN model yields lower MAPE values compared to standalone models.
... The experiments demonstrate that the distance distributions for the same person and different persons classes, as tested with the VGG-Face facial recognition model using L2 Normalized Euclidean distance and enabled alignment mode on unit test items of DeepFace, can be separated, as illustrated in Figure 5. In this context, the C4.5 algorithm [20] was employed to identify the threshold that yields the maximum information gain. ...
Article
Full-text available
Researchers from leading technology companies, prestigious universities worldwide, and the open-source community have made substantial strides in the field of facial recognition studies in recent years. Experiments indicate that facial recognition approaches have not only achieved but surpassed human-level accuracy. A contemporary facial recognition process comprises four key stages: detection, alignment, representation, and verification. Presently, the focus of facial recognition research predominantly centers on the representation stage within the pipelines. This study conducted experiments exploring alternative combinations of nine state-of-the-art facial recognition models, six cutting-edge face detectors, three distance metrics, and two alignment modes. The co-usability performances of implementing and adapting these modules were assessed to precisely gauge the impact of each module on the pipeline. Theoretical and practical findings from the study aim to provide optimal configuration sets for facial recognition pipelines.
Article
Full-text available
This research embarked on a comprehensive analysis of customer churn prediction in the telecommunication sector using various machine learning algorithms. Primarily, the study concentrated on three algorithms: Logistic Regression, Random Forest, and Gradient Boosting. The performance of these algorithms was gauged on empirical data, revealing varied results. Logistic Regression offered a fundamental approach, often serving as a benchmark in churn prediction tasks. Meanwhile, the ensemble techniques, Random Forest and Gradient Boosting, showcased their prowess in handling large data with many predictors, often outperforming simpler models in intricate tasks. Furthermore, this study delved deep into hyperparameter tuning to amplify the accuracy of the Gradient Boosting and Random Forest algorithms. The results illustrated subtle performance enhancements, albeit the trade-offs in precision and recall became evident. Notably, the Gradient Boosting Classifier, when fine-tuned, displayed an accuracy of approximately 80%, with feature importance highlighting 'Contract', 'tenure', and 'MonthlyCharges' as significant predictors. In contrast, the Logistic Regression algorithm manifested consistent performance, making it a reliable option, albeit lacking the sophistication of ensemble methods. This investigation reaffirms the notion posited by numerous scholars, such as Moro et al. (2014) and Barakat et al. (2020), emphasizing the dynamic nature of machine learning algorithms in predicting customer churn. In conclusion, while each algorithm has its merits, their efficacious application rests heavily on understanding the underlying data and the specific business context. Future directions suggest delving into more advanced algorithms and further feature engineering to bolster prediction accuracy. 3
Article
Outlier detection is crucial for preventing financial fraud, network intrusions, and device failures. Users often expect systems to automatically summarize and interpret outlier detection results to reduce human effort and convert outliers into actionable insights. However, existing methods fail to effectively assist users in identifying the root causes of outliers, as they only pinpoint data attributes without considering outliers in the same subspace may have different causes. To fill this gap, we propose STAIR, which learns concise and human-understandable rules to summarize and explain outlier detection results with finer granularity. These rules consider both attributes and associated values. STAIR employs an interpretation-aware optimization objective to generate a small number of rules with minimal complexity for strong interpretability. The learning algorithm of STAIR produces a rule set by iteratively splitting the large rules and is optimal in maximizing this objective in each iteration. Moreover, to effectively handle high dimensional, highly complex data sets that are hard to summarize with simple rules, we propose a localized STAIR approach, called L-STAIR. Taking data locality into consideration, it simultaneously partitions data and learns a set of localized rules for each partition. Our experimental study on many outlier benchmark datasets shows that STAIR significantly reduces the complexity of the rules required to summarize the outlier detection results, thus more amenable for humans to understand and evaluate.
Conference Paper
Context: School dropout in distance learning has become a growing concern in higher education. Private institutions exhibit a 33.6% dropout rate, while public institutions show a slightly lower rate at 31.2%, with an upward trend. Problem: Studies focus on categorical indicators of lack of time, students' personal lives, the educational institution, and course instructors. However, research is still needed to explicitly focus on identifying patterns related to gender with students abandoning courses. Solution: Identifying gender-related patterns among indicators leading to dropout in 36 distance learning undergraduate courses. Theory: Our study incorporated Social Learning Theory alongside Social Cognitive Theory. Social Learning Theory provided insights into how academic performance metrics influence student dropout rates. Social Cognitive Theory also examined the relationship between students' personal factors, including gender and marital status, and their learning behaviors. \textbf{\textit{Method:}}The research conducted is descriptive with a quantitative approach. An experiment was conducted to categorize and identify the most relevant features influencing dropout using machine learning. Results: The results provide patterns for investigated aspects, highlighting women in most analyses. Time-related characteristics exhibit a higher correlation with dropout. Features related to student academic performance and university campus location play a crucial role in classifying a student as a potential dropout, according to the XGBoost classifier, yielding the best performance results. Conclusion: These analyses offer an understanding of factors influencing distance learning dropout, drawing parallels with gender-related situations influencing dropout decisions. This allows for adopting preventive and personalized measures to enhance student retention and improve the academic experience.
Article
Full-text available
Migrating birds optimization algorithm is a promising metaheuristic algorithm recently introduced to the optimization community. In this study, we propose a superior version of the migrating birds optimization algorithm by hybridizing it with the simulated annealing algorithm which is one of the most popular metaheuristics. The new algorithm, called MBOx, is compared with the original migrating birds optimization and four well-known metaheuristics, including the simulated annealing, differential evolution, genetic algorithm and recently proposed harris hawks optimization algorithm. The extensive experiments are conducted on problem instances from both discrete and continuous domains; feature selection problem, obstacle neutralization problem, quadratic assignment problem and continuous functions. On problems from discrete domain, MBOx outperforms the original MBO and others by up to 20.99%. On the continuous functions, it is observed that MBOx does not lead the competition but takes the second position. As a result, MBOx provides a significant performance improvement and therefore, it is a promising solver for computational optimization problems.
Chapter
In this chapter, we give an overview on predictive modeling, used by actuaries. Historically, we moved from relatively homogeneous portfolios to tariff classes, and then to modern insurance, with the concept of “premium personalization.” Modern modeling techniques are presented, starting with econometric approaches, before presenting machine-learning techniques.
Article
Full-text available
Memory has been the subject of scientific study for nearly 150 years. Because a broad range of studies have been done, we can now assess how effective memory is for a range of materials, from simple nonsense syllables to complex materials such as novels. Moreover, we can assess memory effectiveness for a variety of durations, anywhere from a few seconds up to decades later. Our aim here is to assess a range of factors that contribute to the patterns of retention and forgetting under various circumstances. This was done by taking a meta-analytic approach that assesses performance across a broad assortment of studies. Specifically, we assessed memory across 256 papers, involving 916 data sets (e.g., experiments and conditions). The results revealed that exponential-power, logarithmic, and linear functions best captured the widest range of data compared with power and hyperbolic-power functions. Given previous research on this topic, it was surprising that the power function was not the best-fitting function most often. Contrary to what would be expected, a substantial amount of data also revealed either stable memory over time or improvement. These findings can be used to improve our ability to model and predict the amount of information retained in memory. In addition, this analysis of a large set of memory data provides a foundation for expanding behavioral and neuroimaging research to better target areas of study that can inform the effectiveness of memory.
Article
In recent years, advanced machine learning and artificial intelligence techniques have gained popularity due to their ability to solve problems across various domains with high performance and quality. However, these techniques are often so complex that they fail to provide simple and understandable explanations for the outputs they generate. To address this issue, the field of explainable artificial intelligence has recently emerged. On the other hand, most data generated in different domains are inherently structural; that is, they consist of parts and relationships among them. Such data can be represented using either a simple data-structure or form , such as a vector , or a complex data-structure, such as a graph . The effect of this representation form on the explainability and interpretability of machine learning models is not extensively discussed in the literature. In this survey paper, we review efficient algorithms proposed for learning from inherently structured data, emphasizing how their representation form affects the explainability of learning models. A conclusion of our literature review is that using complex forms or data-structures for data representation improves not only the learning performance, but also the explainability and transparency of the model.
Article
Machine Learning is a technology that allows machines to become more accurate in predicting outcomes without being explicitly programmed for it. The basic premise of machine learning is to build models and deploy algorithms that can receive input data and use statistical analysis to predict an output while modifying outputs as the new data becomes available. These models can be used in different areas and trained to match the expectations so that accurate steps can be taken to achieve the organization’s target. In this paper, the case of Big Mart Shopping Centre has been discussed to predict the sales of different types of items and for understanding the effects of different factors on the sales of different items. Taking various features of a dataset collected for Big Mart, and the methodology followed for building a predictive model, results with high levels of accuracy are generated, and these observations can be used to take decisions to improve sales. Key words : Machine Learning, Sales Prediction, Big Mart, Voting classifier algorithm, Linear Regression.
Article
Full-text available
This paper compares five methods for pruning decision trees, developed from sets of examples. When used with uncertain rather than deterministic data, decision-tree induction involves three main stages—creating a complete tree able to classify all the training examples, pruning this tree to give statistical reliability, and processing the pruned tree to improve understandability. This paper concerns the second stage—pruning. It presents empirical comparisons of the five methods across several domains. The results show that three methods—critical value, error complexity and reduced error—perform well, while the other two may cause problems. They also show that there is no significant interaction between the creation and pruning methods.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.