C4.5: Programs for Machine Learning

Gene Selection and Cancer Type Prediction via Filtering and Weighted Regularized Logistic Regression

Article

Full-text available

May 2024

The technological advancements in recent years have led to high-dimensional data becoming a prominent focus of research in genetics, bioinformatics, and biostatistics. This study develops a two-step approach using regularized logistic regression for cancer classification and gene selection with high-dimensional gene expression data. The method combines sure independence screening (SIS) using filtering methods for initial gene selection, followed by an adaptive LASSO (ALASSO) technique with a proposed weighting scheme. The model was applied to three cancer gene expression datasets-colon, leukemia and prostate. The ALASSO with the Fisher score filter demonstrated the best performance, achieving classification accuracy up to 98.2% and AUC of 96.7% for leukemia data. The top genes selected were biologically relevant to their cancer types. This study demonstrates the promise of integrating SIS filtering and weighted ALASSO for informative gene selection and accurate cancer prediction from high-dimensional gene expression data.

Search-based Trace Diagnostic

Preprint

Jun 2024

Cyber-physical systems (CPS) development requires verifying whether system behaviors violate their requirements. This analysis often considers system behaviors expressed by execution traces and requirements expressed by signal-based temporal properties. When an execution trace violates a requirement, engineers need to solve the trace diagnostic problem: They need to understand the cause of the breach. Automated trace diagnostic techniques aim to support engineers in the trace diagnostic activity. This paper proposes search-based trace-diagnostic (SBTD), a novel trace-diagnostic technique for CPS requirements. Unlike existing techniques, SBTD relies on evolutionary search. SBTD starts from a set of candidate diagnoses, applies an evolutionary algorithm iteratively to generate new candidate diagnoses (via mutation, recombination, and selection), and uses a fitness function to determine the qualities of these solutions. Then, a diagnostic generator step is performed to explain the cause of the trace violation. We implemented Diagnosis, an SBTD tool for signal-based temporal logic requirements expressed using the Hybrid Logic of Signals (HLS). We evaluated Diagnosis by performing 34 experiments for 17 trace-requirements combinations leading to a property violation and by assessing the effectiveness of SBTD in producing informative diagnoses and its efficiency in generating them on a time basis. Our results confirm that Diagnosis can produce informative diagnoses in practical time for most of our experiments (33 out of 34).

A Survey of Decision Trees: Concepts, Algorithms, and Applications

Article

Full-text available

Jan 2024

Machine learning (ML) has been instrumental in solving complex problems and significantly advancing different areas of our lives. Decision tree-based methods have gained significant popularity among the diverse range of ML algorithms due to their simplicity and interpretability. This paper presents a comprehensive overview of decision trees, including the core concepts, algorithms, applications, their early development to the recent high-performing ensemble algorithms and their mathematical and algorithmic representations, which are lacking in the literature and will be beneficial to ML researchers and industry experts. Some of the algorithms include classification and regression tree (CART), Iterative Dichotomiser 3 (ID3), C4.5, C5.0, Chi-squared Automatic Interaction Detection (CHAID), conditional inference trees, and other tree-based ensemble algorithms, such as random forest, gradient-boosted decision trees, and rotation forest. Their utilisation in recent literature is also discussed, focusing on applications in medical diagnosis and fraud detection.

Prediction of precipitation using wavelet-based hybrid models considering the periodicity

Article

Full-text available

May 2024
NEURAL COMPUT APPL

In recent years, the application of machine learning methods in the prediction of hydrological processes such as precipitation has been widely considered. These methods can analyze large volumes of data and detect the existing trends and patterns. Therefore, in the present study, machine learning methods, including random forests (RF), Kstar algorithm and Gaussian process regression (GPR), were used to predict the precipitation of Sindh River basin in India during period of 1901 to 2020. In the next step, three distinct input scenarios include (i) using monthly precipitation data and considering the memory of time series up to 5 months delay, (ii) adding periodic term to the first scenario inputs and (iii) decomposing the data using the Daubechies 4 wavelet function and creating hybrid wavelet-learning machine (W-ML) models, were prepared and introduced to the models. The performance of each method was evaluated using the root mean square error (RMSE), mean absolute error (MAE), Kling–Gupta efficiency score (KGE) and Willmott index (WI). The results showed that single models with the first scenario inputs (without taking into account the periodicity of the data) did not have good accuracy, but by adding the periodicity, the performance of these models was significantly improved, and the average value of KGE index for all studied stations increased from 0.466 to 0.672. It was also found that the GPR model for all stations could not have good performance and RF and Kstar models are the most appropriate methods for predicting precipitation in the Sindh River basin, respectively. With the application of the third scenario and the development of W-ML hybrid models, the accuracy of precipitation forecasting was significantly improved, especially the maximum precipitation values were estimated with higher accuracy than standalone models.

Optimized Clustering-based Denoising Autoencoder Model in Recommendation Systems

Thesis

Full-text available

Dec 2023

The fast growth of internet applications and services produces a massive amount of information daily. With that rapid development of information, it becomes a challenging task for users to find content that satisfies their needs when they use online applications. Therefore, Recommendation Systems (RS) have become necessary for users. RS is a filtering technique that tries to reduce the available selections for users by finding the relevant items that satisfy their desires. Deep learning algorithms have significantly succeeded in several fields, including RS. Recently, many deep learning-based RSs have been proposed; they involve all the users in datasets to extract the latent representation of input data to be used later for predicting the missing rates. Users have diverse preferences, making it challenging to create a single model that caters to all of them. This diversity results in recommendations that need to reflect individual user preferences accurately, where his work targeted bridging this gap as the primary objective. This dissertation proposed a new Optimized Clustering-based Denoising Autoencoder model (OCB-DAE), which trains multiple models based on users' preferences. The proposed model combined the Artificial Fish Swarm Algorithm (AFSA) with K-means algorithm to determine the best initial centroids for clustering the users based on their similarities, and each cluster trains a Denoising Autoencoder model (DAE) to ensure that users with similar interests train each model. OCB-DAE is applied to movies and food datasets to generate recommendations by utilizing the items features as side information. The proposed model was trained and tested over MovieLens 100K (ML-100K), MovieLens 1M (ML-1M), and Food.com datasets. The Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics were used to evaluate the performance of the proposed model. In ML-100K dataset, the RMSE and MAE were 0.5589 and 0.5050, respectively. Whereas in ML-1M dataset, the RMSE and MAE were 0.6742 and 0.5894, respectively. Regarding the Food dataset, the RMSE and MAE were 0.1922 and 0.1764, respectively. The proposed model outperformed the related works in terms of RMSE with 29.7% and 14.4% using ML-100K and ML-1M datasets, respectively. In the food dataset, the proposed model outperformed non-clustered model in terms of RMSE by 24.3%. The results showed that training multiple models based on users' preferences reduced prediction errors and improved the recommendation systems' performance.

Assessing Decision Tree Stability: A Comprehensive Method for Generating a Stable Decision Tree

Article

Full-text available

Jan 2024

Objectives: This paper proposes a novel stability metric for decision trees that does not rely on the elusive notion of tree similarity. Existing stability metrics have been constructed in a pairwise fashion to assess the tree similarity between two decision trees. However, quantifying the structural similarities between decision trees is inherently elusive. Conventional stability metrics are simply relying on partial information such as the number of nodes and the depth of the tree, which do not adequately capture structural similarities. Methods: We evaluate the stability based on the computational burden required to generate a stable tree. First, we generate a stable tree using the novel adaptive node-level stabilization method, which determines the most frequently selected predictor during the bootstrapping iterations of a decision tree branching process at each node. Second, the stability is measured based on the number of bootstraps required to achieve the stable tree. Findings: Using the proposed stability metric, we compare the stability of four popular decision tree splitting criteria: Gini index, entropy, gain ratio, and chi-square. In an empirical study across ten datasets, the gain ratio is the most stable splitting criterion among the four popular criteria. Additionally, a case study demonstrates that applying the proposed method to the classification and regression tree (CART) algorithm generates a more stable tree compared to the one produced by the original CART algorithm. Novelty: We propose a stability metric for decision trees without relying on measuring pairwise tree similarity. This paper provides a stability comparison of four popular decision tree splitting criteria, delivering practical insights into their reliability. The adaptive node-level stabilization method can be applied across various decision tree algorithms, enhancing tree stability and reliability in scenarios with updating data.

Ant-based feature and instance selection for multiclass imbalanced data

Article

Full-text available

Jan 2024

This paper introduces a novel algorithm called Ant-based Feature and Instance Selection. This new algorithm addresses the simultaneous selection of instances and features for mixed, incomplete, and imbalanced data in the context of lazy instance-based classifiers. The proposed algorithm uses a hybrid selection strategy based on metaheuristic procedures and Rough Sets. The Ant-based Feature and Instance Selection algorithm combines Ant Colony Optimization and Generic Extended Rough Sets for Mixed and Incomplete Information Systems. It has five stages: reduct computation, metadata computation, intelligent instance preprocessing, submatrices creation, and fusion. To test the performance of the proposed algorithm, we used 25 datasets from the Machine Learning repository of the University of California at Irvine. All these datasets are imbalanced, with multiple classes and represent real-world classification problems. The number of classes ranges between three and eight classes. Most of them also have mixed or incomplete descriptions. We used several performance measures and computed the Instance Retention ratio and the Feature Retention ratio. To determine the existence or not of significant differences in the performance of the compared algorithms, we used non-parametric hypothesis testing. The statistical analysis results confirm the high quality of the proposed algorithm for selecting features and instances in multiclass imbalanced data.

Student Readiness Scores a Rasch Model’s for Facing E-Learning Using Decision Tree and Ensemble Methods

Article

Full-text available

Apr 2024

REFUEL: rule extraction for imbalanced neural node classification

Article

Full-text available

Jun 2024
MACH LEARN

Imbalanced graph node classification is a highly relevant and challenging problem in many real-world applications. The inherent data scarcity, a central characteristic of this task, substantially limits the performance of neural classification models driven solely by data. Given the limited instances of relevant nodes and complex graph structures, current methods fail to capture the distinct characteristics of node attributes and graph patterns within the underrepresented classes. In this article, we propose REFUEL—a novel approach for highly imbalanced node classification problems in graphs. Whereas symbolic and neural methods have complementary strengths and weaknesses when applied to such problems, REFUEL combines the power of symbolic and neural learning in a novel neural rule-extraction architecture. REFUEL captures the class semantics in the automatically extracted rule vectors. Then, REFUEL augments the graph nodes with the extracted rules vectors and adopts a Graph Attention Network-based neural node embedding, enhancing the downstream neural node representation. Our evaluation confirms the effectiveness of the proposed REFUEL approach for three real-world datasets with different minority class sizes. REFUEL achieves at least a 4% point improvement in precision on the minority classes of 1.5–2% compared to the baselines.

QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random Forest

Preprint

Jun 2024

Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications were data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.

Scalable Rule Lists Learning with Sampling

Preprint

Jun 2024

Learning interpretable models has become a major focus of machine learning research, given the increasing prominence of machine learning in socially important decision-making. Among interpretable models, rule lists are among the best-known and easily interpretable ones. However, finding optimal rule lists is computationally challenging, and current approaches are impractical for large datasets. We present a novel and scalable approach to learn nearly optimal rule lists from large datasets. Our algorithm uses sampling to efficiently obtain an approximation of the optimal rule list with rigorous guarantees on the quality of the approximation. In particular, our algorithm guarantees to find a rule list with accuracy very close to the optimal rule list when a rule list with high accuracy exists. Our algorithm builds on the VC-dimension of rule lists, for which we prove novel upper and lower bounds. Our experimental evaluation on large datasets shows that our algorithm identifies nearly optimal rule lists with a speed-up up to two orders of magnitude over state-of-the-art exact approaches. Moreover, our algorithm is as fast as, and sometimes faster than, recent heuristic approaches, while reporting higher quality rule lists. In addition, the rules reported by our algorithm are more similar to the rules in the optimal rule list than the rules from heuristic approaches.

Enhancing Compressive Strength In Concrete With Waste Ceramic Tiles: Effects Of Selected Aggregate Modification Treatments, Water-cement Ratio And Curing Periods For Decision Tree Regression Analysis

Article

Full-text available

Jul 2024

Due to the ceramic tile waste’s (CTW) negative impact on workability, this study incorporated three aggregate modification treatments (AMTs) on the CTWs, namely cement impregnation (CI), sodium silicate soaking (SS), and slurry wrapping (SW). Concrete batches were prepared, with varying CTW replacements of 0%, 25%, and 50% to gravel, and water-cement ratios (w/c) of 0.5 and 0.6. Slump tests and compressive strength tests at curing periods of 7 and 28 day were conducted. Experimental results showed that concrete mixes with CI treatment produced the highest compressive strength, while the concrete batches with 0.6 w/c produced higher compressive strengths. However, concrete mix that considered SW treatment showed a reduction in compressive strength relative to the mix with untreated CTW. The optimum design mix incorporated CI treatment, 25% CTW replacement, and 0.5 w/c. This mix yielded about 16.7% stronger nominal strength compared to the control mix. A decision tree regression (DTR) model was generated to predict the compressive strength based on different combinations for the concrete mix. Based on the model, the AMT showed the most influence on the prediction of compressive strength. Overall results indicate the use of CTW in sustainable concrete production could be further enhanced by CI treatment method.

Ents: An Efficient Three-party Training Framework for Decision Trees by Communication Optimization

Preprint

Jun 2024

Multi-party training frameworks for decision trees based on secure multi-party computation enable multiple parties to train high-performance models on distributed private data with privacy preservation. The training process essentially involves frequent dataset splitting according to the splitting criterion (e.g. Gini impurity). However, existing multi-party training frameworks for decision trees demonstrate communication inefficiency due to the following issues: (1) They suffer from huge communication overhead in securely splitting a dataset with continuous attributes. (2) They suffer from huge communication overhead due to performing almost all the computations on a large ring to accommodate the secure computations for the splitting criterion. In this paper, we are motivated to present an efficient three-party training framework, namely Ents, for decision trees by communication optimization. For the first issue, we present a series of training protocols based on the secure radix sort protocols to efficiently and securely split a dataset with continuous attributes. For the second issue, we propose an efficient share conversion protocol to convert shares between a small ring and a large ring to reduce the communication overhead incurred by performing almost all the computations on a large ring. Experimental results from eight widely used datasets show that Ents outperforms state-of-the-art frameworks by $5.5\times \sim 9.3\times$ in communication sizes and $3.9\times \sim 5.3\times$ in communication rounds. In terms of training time, Ents yields an improvement of $3.5\times \sim 6.7\times$. To demonstrate its practicality, Ents requires less than three hours to securely train a decision tree on a widely used real-world dataset (Skin Segmentation) with more than 245,000 samples in the WAN setting.

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

Preprint

Jun 2024

Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

On the Effectiveness of Trivial Refactorings in Predicting Non-trivial Refactorings

Article

Full-text available

Apr 2024

Refactoring is the process of restructuring source code without changing the external behavior of the software. Refactoring can bring many benefits, such as removing code with poor structural quality, avoiding or reducing technical debt, and improving maintainability, reuse, or code readability. Although there is research on how to predict refactorings, there is still a clear lack of studies that assess the impact of operations considered less complex (trivial) to more complex (non-trivial). In addition, the literature suggests conducting studies that invest in improving automated solutions through detecting and correcting refactoring. This study aims to identify refactoring activity in non-trivial operations through trivial operations accurately. For this, we use classifier models of supervised learning, considering the influence of trivial refactorings and evaluating performance in other data domains. To achieve this goal, we assembled 3 datasets totaling 1,291 open-source projects, extracted approximately 1.9M refactoring operations, collected 45 attributes and code metrics from each file involved in the refactoring and used the algorithms Decision Tree, Random Forest, Logistic Regression, Naive Bayes and Neural Network of supervised learning to investigate the impact of trivial refactorings on the prediction of non-trivial refactorings. For this study, we contextualize the data and call context each experiment configuration in which it combines trivial and non-trivial refactorings. Our results indicate that: (i) Tree-based models such as Random Forest, Decision Tree, and Neural Networks performed very well when trained with code metrics to detect refactoring opportunities. However, only the first two were able to demonstrate good generalization in other data domain contexts of refactoring; (ii) Separating trivial and non-trivial refactorings into different classes resulted in a more efficient model. This approach still resulted in a more efficient model even when tested on different datasets; (iii) Using balancing techniques that increase or decrease samples may not be the best strategy to improve models trained on datasets composed of code metrics and configured according to our study.

Semantic Segmentation of MiceWounds

Conference Paper

May 2024

Semantic segmentation has been successfully explored in biologicalstudies to handle various applications, such as identifying wounds.This study explores two image segmentation approaches to identifymice wounds, specifically the U-Net and Random Forest algorithms.The latter was combined with features extracted from the first twolayers of VGG16, which was used as a feature extractor. Experimentswere performed with a real dataset developed by the Pain,Neuropathy, and Inflammation Laboratory at the State Universityof Londrina with the approval of the University Ethics Committeeon Animal Research and Welfare. The experimental results werepromising, showing that both alternatives can provide accuratepredictions for most images regarding FScore and IoU evaluationmeasures. Statistical tests were also applied, showing that U-Netobtained statistically better results with an average FScore of 0.72and IoU of 0.58.

Branches: A Fast Dynamic Programming and Branch & Bound Algorithm for Optimal Decision Trees

Preprint

Full-text available

Jun 2024

Decision Tree Learning is a fundamental problem for Interpretable Machine Learning, yet it poses a formidable optimization challenge. Despite numerous efforts dating back to the early 1990's, practical algorithms have only recently emerged, primarily leveraging Dynamic Programming (DP) and Branch & Bound (B&B) techniques. These breakthroughs led to the development of two distinct approaches. Algorithms like DL8.5 and MurTree operate on the space of nodes (or branches), they are very fast, but do not penalise complex Decision Trees, i.e. they do not solve for sparsity. On the other hand, algorithms like OSDT and GOSDT operate on the space of Decision Trees, they solve for sparsity but at the detriment of speed. In this work, we introduce Branches, a novel algorithm that integrates the strengths of both paradigms. Leveraging DP and B&B, Branches achieves exceptional speed while also solving for sparsity. Central to its efficiency is a novel analytical bound enabling substantial pruning of the search space. Theoretical analysis demonstrates that Branches has lower complexity compared to state-of-the-art methods, a claim validated through extensive empirical evaluation. Our results illustrate that Branches not only greatly outperforms existing approaches in terms of speed and number of iterations, it also consistently yields optimal Decision Trees.

Semantic Segmentation of Mice Wounds

Conference Paper

May 2024

Semantic segmentation has been successfully explored in biological studies to handle various applications, such as identifying wounds. This study explores two image segmentation approaches to identify mice wounds, specifically the U-Net and Random Forest algorithms. The latter was combined with features extracted from the first two layers of VGG16, which was used as a feature extractor. Experiments were performed with a real dataset developed by the Pain, Neuropathy, and Inflammation Laboratory at the State University of Londrina with the approval of the University Ethics Committee on Animal Research and Welfare. The experimental results were promising, showing that both alternatives can provide accurate predictions for most images regarding FScore and IoU evaluation measures. Statistical tests were also applied, showing that U-Net obtained statistically better results with an average FScore of 0.72 and IoU of 0.58.

Superfast Selection for Decision Tree Algorithms

Preprint

Full-text available

May 2024

We present a novel and systematic method, called Superfast Selection, for selecting the "optimal split" for decision tree and feature selection algorithms over tabular data. The method speeds up split selection on a single feature by lowering the time complexity, from O(MN) (using the standard selection methods) to O(M), where M represents the number of input examples and N the number of unique values. Additionally, the need for pre-encoding, such as one-hot or integer encoding, for feature value heterogeneity is eliminated. To demonstrate the efficiency of Superfast Selection, we empower the CART algorithm by integrating Superfast Selection into it, creating what we call Ultrafast Decision Tree (UDT). This enhancement enables UDT to complete the training process with a time complexity O(KMlogM) (K is the number of features). Additionally, the Training Only Once Tuning enables UDT to avoid the repetitive training process required to find the optimal hyper-parameter. Experiments show that the UDT can finish a single training on KDD99-10% dataset (494K examples with 41 features) within 1 second and tuning with 214.8 sets of hyper-parameters within 0.25 second on a laptop.

Detecting False Data Injection Attacks Using Machine Learning-Based Approaches for Smart Grid Networks

Article

Full-text available

May 2024

Current electricity sectors will be unable to keep up with commercial and residential customers’ increasing demand for data-enabled power systems. Therefore, next-generation power systems must be developed. It is possible for the smart grid, an advanced power system of the future, to make decisions, estimate loads, and execute other data-related jobs. Customers can adjust their needs in smart grid systems by monitoring bill information. Due to their reliance on data networks, smart grids are vulnerable to cyberattacks that could compromise billing data and cause power outages and other problems. A false data injection attack (FDIA) is a significant attack that targets the corruption of state estimation vectors. The primary goal of this paper is to show the impact of an FDIA attack on a power dataset and to use machine learning algorithms to detect the attack; to achieve this, the Python software is used. In the experiment, we used the power dataset from the IoT server of a 10 KV solar PV system (to mimic a smart grid system) in a controlled laboratory environment to test the effect of FDIA and detect this anomaly using a machine learning approach. Different machine learning models were used to detect the attack and find the most suitable approach to achieve this goal. This paper compares machine learning algorithms (such as random forest, isolation forest, logistic regression, decision tree, autoencoder, and feed-forward neural network) in terms of their effectiveness in detecting false data injection attacks (FDIAs). The highest F1 score of 0.99 was achieved by the decision tree algorithm, which was closely followed by the logistic regression method, which had an F1 score of 0.98. These algorithms also demonstrated high precision, recall, and model accuracy, demonstrating their efficacy in detecting FDIAs. The research presented in this paper indicates that combining logistic regression and decision tree in an ensemble leads to significant performance enhancements. The resulting model achieves an impressive accuracy of 0.99, a precision of 1, and an F1 score of 1.

A Computational Decision-Tree Approach to Inform Post-Conviction Intake Decisions

Article

Full-text available

May 2024

How might data analytic tools support intake decisions? When faced with a request for post-conviction assistance, innocence organizations’ intake staff must determine (1) whether the applicant can be shown to be factually innocent, and (2) whether the organization has the resources to help. These difficult categorization decisions are often made with incomplete information (Weintraub, 2022). We explore data from the National Registry of Exonerations (NRE; 4/26/2023, N = 3,284 exonerations) to inform such decisions, using patterns of features associated with successful prior cases. We first reproduce Berube et al. (2023)’s latent class analysis, identifying four underlying categories across cases. We then apply a second technique to increase transparency, decision tree analysis (WEKA, Frank et al., 2013). Decision trees can decompose complex patterns of data into ordered flows of variables, with the potential to guide intermediate steps that could be tailored to the particular organization’s limitations, areas of expertise, and resources.

Applications of 3D modeling in cryptic species classification of molluscs

Article

Full-text available

May 2024
MAR BIOL

Classification of cryptic species is important for assessing biodiversity and conducting ecological studies. However, morphological classification methods face the loss of morphological information due to subjectivity in geometric morphometrics, while an incomplete database and horizontal gene transfer limit the molecular approach. A novel approach combining 3D modeling and artificial intelligence algorithms using morphological and molecular data was developed for species classification. Cryptic species from the Vignadula genus were used to test the feasibility of this new approach. Molecular identification results as data labels were used for training models, and for validating classification results of machine learning and deep learning. Our approach achieved accuracies of over 80% in distinguishing between V. atrata and V. mangle, which were identified by molecular data along China’s coast. The result of the confusion matrix indicated the misidentified individuals were due to the morphological similarity in the intermediate zone. The feature importance analysis highlighted the significant contribution of average curvature—a 3D feature—to the task, indicating the feasibility of the 3D model in cryptic species classification. Utilizing 3D models and artificial intelligence, this study presents a novel approach for classifying cryptic species of molluscs.

Evaluating Explainable Machine Learning Models for Clinicians

Article

Full-text available

May 2024

Gaining clinicians’ trust will unleash the full potential of artificial intelligence (AI) in medicine, and explaining AI decisions is seen as the way to build trustworthy systems. However, explainable artificial intelligence (XAI) methods in medicine often lack a proper evaluation. In this paper, we present our evaluation methodology for XAI methods using forward simulatability. We define the Forward Simulatability Score (FSS) and analyze its limitations in the context of clinical predictors. Then, we applied FSS to our XAI approach defined over an ML-RO, a machine learning clinical predictor based on random optimization over a multiple kernel support vector machine (SVM) algorithm. To Compare FSS values before and after the explanation phase, we test our evaluation methodology for XAI methods on three clinical datasets, namely breast cancer, VTE, and migraine. The ML-RO system is a good model on which to test our XAI evaluation strategy based on the FSS. Indeed, ML-RO outperforms two other base models—a decision tree (DT) and a plain SVM—in the three datasets and gives the possibility of defining different XAI models: TOPK, MIGF, and F4G. The FSS evaluation score suggests that the explanation method F4G for the ML-RO is the most effective in two datasets out of the three tested, and it shows the limits of the learned model for one dataset. Our study aims to introduce a standard practice for evaluating XAI methods in medicine. By establishing a rigorous evaluation framework, we seek to provide healthcare professionals with reliable tools for assessing the performance of XAI methods to enhance the adoption of AI systems in clinical practice.

SwapTransformer: Highway Overtaking Tactical Planner Model via Imitation Learning on OSHA Dataset

Article

Full-text available

Jan 2024

This paper investigates the high-level decision-making problem in highway scenarios regarding lane changing and over-taking other slower vehicles. In particular, this paper aims to improve the Travel Assist feature for automatic overtaking and lane changes on highways. About 9 million samples including lane images and other dynamic objects are collected in simulation. This data; Overtaking on Simulated HighwAys (OSHA) dataset is released to tackle this challenge. To solve this problem, an architecture called SwapTransformer is designed and implemented as an imitation learning approach on the OSHA dataset. Moreover, auxiliary tasks such as future points and car distance network predictions are proposed to aid the model in better understanding the surrounding environment. The performance of the proposed solution is compared with a multi-layer perceptron (MLP) and multi-head self-attention networks as baselines in a simulation environment.We also demonstrate the performance of the model with and without auxiliary tasks. All models are evaluated based on different metrics such as time to finish each lap, number of overtakes, and speed difference with speed limit. The evaluation shows that the SwapTransformer model outperforms other models in different traffic densities in the inference phase.

Shortening and Personalizing Psychodiagnostic Assessments with Decision Tree-Machine Learning Classifiers: An Application Example Based on the Patient Health Questionnaire-9

Article

Full-text available

May 2024
Int J Ment Health Addiction

The development of psychological assessment tools that accurately and efficiently classify individuals as having or not a specific diagnosis is a major challenge for test developers and mental health professionals. This paper shows how machine learning (ML) provides a valuable framework to improve the accuracy and efficiency of psychodiagnostic classifications. The method is illustrated using an empirical example based on the Patient Health Questionnaire-9 (PHQ-9). The results show that, compared to traditional scorings of the PHQ-9, that based on decision tree (DT) algorithms is more advantageous in terms of accuracy and efficiency. In addition, the DT-based method facilitates the development of short test forms and improves the diagnostic performance of the test by integrating external information (e.g., demographic variables) into the scoring process. These findings suggest that DT-algorithms and ML applications such as feature selection represent a valuable method for supporting test developers and mental health professionals, and highlight the potential of ML for advancing the field of psychological assessment.

Multi-Label Lifelong Machine Learning: A Scoping Review of Algorithms, Techniques, and Applications

Article

Full-text available

Jan 2024

Lifelong machine learning concerns the development of systems that continuously learn from diverse tasks, incorporating new knowledge without forgetting the knowledge they have previously acquired. Multi-label classification is a supervised learning process in which each instance is assigned multiple non-exclusive labels, with each label denoted as a binary value. One of the main challenges within the lifelong learning paradigm is the stability-plasticity dilemma, which entails balancing a model’s adaptability in terms of incorporating new knowledge with its stability in terms of retaining previously acquired knowledge. When faced with multi-label data, the lifelong learning challenge becomes even more pronounced, as it becomes essential to preserve relations between multiple labels across sequential tasks. This scoping review explores the intersection of lifelong learning and multi-label classification, an emerging domain that integrates continual adaptation with intricate multi-label datasets. By analyzing the existing literature, we establish connections, identify gaps in the existing research, and propose new directions for research to improve the efficacy of multi-label lifelong learning algorithms. Our review unearths a growing number of algorithms and underscores the need for specialized evaluation metrics and methodologies for the accurate assessment of their performance. We also highlight the need for strategies that incorporate real-world data from varying contexts into the learning process to fully capture the nuances of real-world environments.

A Comparative Study of Stock Selection Models Based on Decision Tree Algorithms

Article

Full-text available

May 2024

Yehan Wang

In this paper, the decision tree model in data mining is applied to select stock characteristics that can be effectively used for stock selection by using the C4.5 algorithm and the CART algorithm, respectively, in combination with the strategies of fundamental analysis and technical analysis. The paper concludes that the decision tree models constructed by the C4.5 and CART algorithms both have better classification ability for stock selection and portfolio construction, but the decision tree model constructed by the C4.5 algorithm is simpler. The stock portfolios determined by the decision tree model are able to achieve an excess return of 13.4% relative to the CSI 300 index, thus proving that the decision tree model is effective in stock selection and stock portfolio construction.

Smart Expert System: Large Language Models as Text Classifiers

Preprint

Full-text available

May 2024

Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers. The system simplifies the traditional text classification workflow, eliminating the need for extensive preprocessing and domain expertise. The performance of several LLMs, machine learning (ML) algorithms, and neural network (NN) based structures is evaluated on four datasets. Results demonstrate that certain LLMs surpass traditional methods in sentiment analysis, spam SMS detection and multi-label classification. Furthermore, it is shown that the system's performance can be further enhanced through few-shot or fine-tuning strategies, making the fine-tuned model the top performer across all datasets. Source code and datasets are available in this GitHub repository: https://github.com/yeyimilk/llm-zero-shot-classifiers.

Use of machine learning models in condition monitoring of abrasive belt in robotic arm grinding process

Article

Full-text available

May 2024
J INTELL MANUF

Although the aspects that affect the performance and the deterioration of abrasive belt grinding are known, wear prediction of abrasive belts in the robotic arm grinding process is still challenging. Massive wear of coarse grains on the belt surface has a serious impact on the integrity of the tool and it reduces the surface quality of the finished products. Conventional wear status monitoring strategies that use special tools result in the cessation of the manufacturing production process which sometimes takes a long time and is highly dependent on human capabilities. The erratic wear behavior of abrasive belts demands machining processes in the manufacturing industry to be equipped with intelligent decision-making methods. In this study, to maintain a uniform tool movement, an abrasive belt grinding is installed at the end-effector of a robotic arm to grind the surface of a mild steel workpiece. Simultaneously, accelerometers and force sensors are integrated into the system to record its vibration and forces in real-time. The vibration signal responses from the workpiece and the tool reflect the wear level of the grinding belt to monitor the tool’s condition. Intelligent monitoring of abrasive belt grinding conditions using several machine learning algorithms that include K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Decision Tree (DT) are investigated. The machine learning models with the optimized hyperparameters that produce the highest average test accuracy were found using the DT, Random Forest (RF), and XGBoost. Meanwhile, the lowest latency was obtained by DT and RF. A decision-tree-based classifier could be a promising model to tackle the problem of abrasive belt grinding prediction. The application of various algorithms will be a major focus of our research team in future research activities, investigating how we apply the selected methods in real-world industrial environments.

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations

Article

Full-text available

May 2024
PLOS ONE

In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

An WiFi CSI Signal Enhancement Framework For Activity Recognition Using Machine Learning Automatic Segmentation

Article

Full-text available

May 2024

Recently, human activity recognition based on wireless signals has become an active and promising research direction. Researchers have shown that machine learning (ML) models can accurately classify some activities of a person standing between the WiFi transmitter and receiver. However, the availability of public datasets is limited due to labor-intensive dataset collection. Moreover, an efficient signal segmentation algorithm is required for application in practical scenarios. This paper presented a signal enhancement framework for WiFi-based human activity recognition using ML-based signal segmentation. Specifically, we proposed a stable channel state information (CSI) collection platform based on stable USRP devices. Using this platform, we released a public dataset (WiAR-UIT) for various human activities to control smart home devices. To enhance the prediction accuracy as well as the converging ability of ML models, we proposed two algorithms for automatic signal segmentation. The first algorithm uses conventional signal processing procedures (SIGPRO-SEGM). The second algorithm is dataset-independent and based on a CNN model (ML-SEGM). Applying these segmentation algorithms to our dataset, the best performance of 99.2% accuracy is obtained. Moreover, the accuracy is improved by 35% for some ML models including K-nearest neighbors, support vector machine, decision tree, random forest, and multi-layer perceptron. Finally, we have deployed a real-time client–server application using the above segmentation algorithms to emphasize the potential and practicality of the proposed research direction.

UAS-Guided Analysis of Electric and Magnetic Field Distribution in High-Voltage Transmission Lines (Tx) and Multi-Stage Hybrid Machine Learning Models for Battery Drain Estimation

Article

Full-text available

Jan 2024

Unmanned aerial systems/vehicles (UAS/UAVs) are widely employed for inspecting high-voltage (HV) Tx lines, characterized by elevated electric (E) and magnetic (H) fields. Operating on batteries, these UAVs are equipped with various electrical sensors, microprocessors, and motors, all susceptible to E/H field effects. This paper explores the distribution of E/H fields in multiple HVTx lines and a microwave tower. Data was collected from one 250 kV <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> DC </sub> , four AC Tx lines (69kV, 230kV, 345kV, and 500kV), and one microwave tower, utilizing DJI UAVs (M2EA, M30, and M300) equipped with onboard setups. Measurements included E field in V/m, H field in mG, Battery voltage in V, Battery current in A, Battery percentage, Battery Temperature in F, latitude, and longitude. Preliminary findings highlight larger E/H field levels within AC Tx lines than DC Tx lines. The paper discusses conditions influencing E/H field strength during UAV operation. Additionally, a proposed multi-staged random forest regressor (RFR) and k-nearest neighbor (KNN) hybrid machine learning (ML) model forecasts UAV battery drain. Results indicate that the hybrid RFR and KNN model yields lower MAPE values compared to standalone models.

A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules

Article

Full-text available

Mar 2024

Researchers from leading technology companies, prestigious universities worldwide, and the open-source community have made substantial strides in the field of facial recognition studies in recent years. Experiments indicate that facial recognition approaches have not only achieved but surpassed human-level accuracy. A contemporary facial recognition process comprises four key stages: detection, alignment, representation, and verification. Presently, the focus of facial recognition research predominantly centers on the representation stage within the pipelines. This study conducted experiments exploring alternative combinations of nine state-of-the-art facial recognition models, six cutting-edge face detectors, three distance metrics, and two alignment modes. The co-usability performances of implementing and adapting these modules were assessed to precisely gauge the impact of each module on the pipeline. Theoretical and practical findings from the study aim to provide optimal configuration sets for facial recognition pipelines.

PREDICTING CUSTOMER CHURN IN THE TELECOMMUNICATION INDUSTRY USING MACHINE LEARNING ALGORITHMS: Performance comparison with logistic regression,random forest, and gradient boosting techniques.

Article

Full-text available

Aug 2022
MACH LEARN

This research embarked on a comprehensive analysis of customer churn prediction in the telecommunication sector using various machine learning algorithms. Primarily, the study concentrated on three algorithms: Logistic Regression, Random Forest, and Gradient Boosting. The performance of these algorithms was gauged on empirical data, revealing varied results. Logistic Regression offered a fundamental approach, often serving as a benchmark in churn prediction tasks. Meanwhile, the ensemble techniques, Random Forest and Gradient Boosting, showcased their prowess in handling large data with many predictors, often outperforming simpler models in intricate tasks. Furthermore, this study delved deep into hyperparameter tuning to amplify the accuracy of the Gradient Boosting and Random Forest algorithms. The results illustrated subtle performance enhancements, albeit the trade-offs in precision and recall became evident. Notably, the Gradient Boosting Classifier, when fine-tuned, displayed an accuracy of approximately 80%, with feature importance highlighting 'Contract', 'tenure', and 'MonthlyCharges' as significant predictors. In contrast, the Logistic Regression algorithm manifested consistent performance, making it a reliable option, albeit lacking the sophistication of ensemble methods. This investigation reaffirms the notion posited by numerous scholars, such as Moro et al. (2014) and Barakat et al. (2020), emphasizing the dynamic nature of machine learning algorithms in predicting customer churn. In conclusion, while each algorithm has its merits, their efficacious application rests heavily on understanding the underlying data and the specific business context. Future directions suggest delving into more advanced algorithms and further feature engineering to bolster prediction accuracy. 3

Mathematical Programming for Data Mining

Chapter

Mar 2024

Machine Learning-based OWC Diagnosis Using Real Measured Data from Wave Power Plants

Conference Paper

Dec 2023

Using artificial intelligence to rapidly identify microplastics pollution and predict microplastics environmental behaviors

Article

Jun 2024

Food Crop Disease Identification system using ML

Conference Paper

Dec 2023

Outlier Summarization via Human Interpretable Rules

Article

May 2024

Outlier detection is crucial for preventing financial fraud, network intrusions, and device failures. Users often expect systems to automatically summarize and interpret outlier detection results to reduce human effort and convert outliers into actionable insights. However, existing methods fail to effectively assist users in identifying the root causes of outliers, as they only pinpoint data attributes without considering outliers in the same subspace may have different causes. To fill this gap, we propose STAIR, which learns concise and human-understandable rules to summarize and explain outlier detection results with finer granularity. These rules consider both attributes and associated values. STAIR employs an interpretation-aware optimization objective to generate a small number of rules with minimal complexity for strong interpretability. The learning algorithm of STAIR produces a rule set by iteratively splitting the large rules and is optimal in maximizing this objective in each iteration. Moreover, to effectively handle high dimensional, highly complex data sets that are hard to summarize with simple rules, we propose a localized STAIR approach, called L-STAIR. Taking data locality into consideration, it simultaneously partitions data and learns a set of localized rules for each partition. Our experimental study on many outlier benchmark datasets shows that STAIR significantly reduces the complexity of the rules required to summarize the outlier detection results, thus more amenable for humans to understand and evaluate.

An Exploratory Analysis on Gender-Related Dropout Students in Distance Learning Higher Education using Machine Learning

Conference Paper

May 2024

Context: School dropout in distance learning has become a growing concern in higher education. Private institutions exhibit a 33.6% dropout rate, while public institutions show a slightly lower rate at 31.2%, with an upward trend. Problem: Studies focus on categorical indicators of lack of time, students' personal lives, the educational institution, and course instructors. However, research is still needed to explicitly focus on identifying patterns related to gender with students abandoning courses. Solution: Identifying gender-related patterns among indicators leading to dropout in 36 distance learning undergraduate courses. Theory: Our study incorporated Social Learning Theory alongside Social Cognitive Theory. Social Learning Theory provided insights into how academic performance metrics influence student dropout rates. Social Cognitive Theory also examined the relationship between students' personal factors, including gender and marital status, and their learning behaviors. \textbf{\textit{Method:}}The research conducted is descriptive with a quantitative approach. An experiment was conducted to categorize and identify the most relevant features influencing dropout using machine learning. Results: The results provide patterns for investigated aspects, highlighting women in most analyses. Time-related characteristics exhibit a higher correlation with dropout. Features related to student academic performance and university campus location play a crucial role in classifying a student as a potential dropout, according to the XGBoost classifier, yielding the best performance results. Conclusion: These analyses offer an understanding of factors influencing distance learning dropout, drawing parallels with gender-related situations influencing dropout decisions. This allows for adopting preventive and personalized measures to enhance student retention and improve the academic experience.

Just Change on Change: Adaptive Splitting Time for Decision Trees in Data Stream Classification

Conference Paper

May 2024

Explainable AI for cybersecurity automation, intelligence and trustworthiness in digital twin: Methods, taxonomy, challenges and prospects

Article

Full-text available

May 2024

Enhanced migrating birds optimization algorithm for optimization problems in different domains

Article

Full-text available

May 2024
ANN OPER RES

Migrating birds optimization algorithm is a promising metaheuristic algorithm recently introduced to the optimization community. In this study, we propose a superior version of the migrating birds optimization algorithm by hybridizing it with the simulated annealing algorithm which is one of the most popular metaheuristics. The new algorithm, called MBOx, is compared with the original migrating birds optimization and four well-known metaheuristics, including the simulated annealing, differential evolution, genetic algorithm and recently proposed harris hawks optimization algorithm. The extensive experiments are conducted on problem instances from both discrete and continuous domains; feature selection problem, obstacle neutralization problem, quadratic assignment problem and continuous functions. On problems from discrete domain, MBOx outperforms the original MBO and others by up to 20.99%. On the continuous functions, it is observed that MBOx does not lead the competition but takes the second position. As a result, MBOx provides a significant performance improvement and therefore, it is a promising solver for computational optimization problems.

Models: Overview on Predictive Models

Chapter

May 2024

Arthur Charpentier

In this chapter, we give an overview on predictive modeling, used by actuaries. Historically, we moved from relatively homogeneous portfolios to tariff classes, and then to modern insurance, with the concept of “premium personalization.” Modern modeling techniques are presented, starting with econometric approaches, before presenting machine-learning techniques.

Convergence rates of oblique regression trees for flexible function libraries

Article

Apr 2024
ANN STAT

Memory from nonsense syllables to novels: A survey of retention

Article

Full-text available

May 2024

Memory has been the subject of scientific study for nearly 150 years. Because a broad range of studies have been done, we can now assess how effective memory is for a range of materials, from simple nonsense syllables to complex materials such as novels. Moreover, we can assess memory effectiveness for a variety of durations, anywhere from a few seconds up to decades later. Our aim here is to assess a range of factors that contribute to the patterns of retention and forgetting under various circumstances. This was done by taking a meta-analytic approach that assesses performance across a broad assortment of studies. Specifically, we assessed memory across 256 papers, involving 916 data sets (e.g., experiments and conditions). The results revealed that exponential-power, logarithmic, and linear functions best captured the widest range of data compared with power and hyperbolic-power functions. Given previous research on this topic, it was surprising that the power function was not the best-fitting function most often. Contrary to what would be expected, a substantial amount of data also revealed either stable memory over time or improvement. These findings can be used to improve our ability to model and predict the amount of information retained in memory. In addition, this analysis of a large set of memory data provides a foundation for expanding behavioral and neuroimaging research to better target areas of study that can inform the effectiveness of memory.

Application of machine learning in delineating groundwater contamination in present and climate change scenarios

Article

May 2024

Fuzzy three-way rule learning and its classification methods

Article

Apr 2024
FUZZY SET SYST

A Review on the Impact of Data Representation on Model Explainability

Article

Apr 2024

Mostafa Haghir Chehreghani

In recent years, advanced machine learning and artificial intelligence techniques have gained popularity due to their ability to solve problems across various domains with high performance and quality. However, these techniques are often so complex that they fail to provide simple and understandable explanations for the outputs they generate. To address this issue, the field of explainable artificial intelligence has recently emerged. On the other hand, most data generated in different domains are inherently structural; that is, they consist of parts and relationships among them. Such data can be represented using either a simple data-structure or form , such as a vector , or a complex data-structure, such as a graph . The effect of this representation form on the explainability and interpretability of machine learning models is not extensively discussed in the literature. In this survey paper, we review efficient algorithms proposed for learning from inherently structured data, emphasizing how their representation form affects the explainability of learning models. A conclusion of our literature review is that using complex forms or data-structures for data representation improves not only the learning performance, but also the explainability and transparency of the model.

Bigmart Sales Prediction Based on Voting Classifier Algorithm and Linear Regression

Article

Apr 2024

Suresh. B

Machine Learning is a technology that allows machines to become more accurate in predicting outcomes without being explicitly programmed for it. The basic premise of machine learning is to build models and deploy algorithms that can receive input data and use statistical analysis to predict an output while modifying outputs as the new data becomes available. These models can be used in different areas and trained to match the expectations so that accurate steps can be taken to achieve the organization’s target. In this paper, the case of Big Mart Shopping Centre has been discussed to predict the sales of different types of items and for understanding the effects of different factors on the sales of different items. Taking various features of a dataset collected for Big Mart, and the methodology followed for building a predictive model, results with high levels of accuracy are generated, and these observations can be used to take decisions to improve sales. Key words : Machine Learning, Sales Prediction, Big Mart, Voting classifier algorithm, Linear Regression.

C4.5: Programs for Machine Learning

Recommended publications

Evidence-Based Static Branch Prediction Using Machine Learning

A machine learning framework for programming by example

A stepwise evolutionary approach to machine learning

On Learning Limiting Programs.