ArticlePDF Available

Reliable prediction of software defects using Shapley interpretable machine learning models

Authors:
  • Jouf University

Abstract and Figures

Predicting defect-prone software components can play a significant role in allocating relevant testing resources to fault-prone modules and hence increasing the business value of software projects. Most of the current software defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software applications. The software datasets utilized in such studies are imbalanced and therefore the reported results cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine learning models employed in fault-predication techniques to determine the contribution of each utilized feature to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven machine learning classifiers over twelve different datasets. For feature selection, we employ four different nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm, and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE) to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, sto-chastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90% accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the other tested feature extraction techniques.
Content may be subject to copyright.
Egyptian Informatics Journal 24 (2023) 100386
Available online 31 July 2023
1110-8665/© 2023 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Computers and Articial Intelligence, Cairo University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Full length article
Reliable prediction of software defects using Shapley interpretable machine
learning models
Yazan Al-Smadi
a
, Mohammed Eshtay
b
, Ahmad Al-Qerem
a
, Shadi Nashwan
c
,
*
, Osama Ouda
c
,
A.A. Abd El-Aziz
d
,
e
a
Department of Computer Science, Faculty of Information Technology, Zarqa University, Zarqa 13110, Jordan
b
Luminus Technical University College, Amman 11118, Jordan
c
Computer Science Department, Computer and Information Sciences College, Jouf University, Sakaka 72388, Saudi Arabia
d
Information System Department, Computer and Information Sciences College, Jouf University, Sakaka 72388, Saudi Arabia
e
Information Systems & Technology Department, Faculty of Graduate Studies for Statistical Research, Cairo UNI, Giza, Egypt
ARTICLE INFO
Keywords:
Software Defect Prediction
Feature importance
Machine learning
Model interpretation
Shapley Additive Explanation
ABSTRACT
Predicting defect-prone software components can play a signicant role in allocating relevant testing resources to
fault-prone modules and hence increasing the business value of software projects. Most of the current software
defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software
applications. The software datasets utilized in such studies are imbalanced and therefore the reported results
cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine
learning models employed in fault-predication techniques to determine the contribution of each utilized feature
to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven
machine learning classiers over twelve different datasets. For feature selection, we employ four different
nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm,
and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE)
to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for
highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, sto-
chastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90%
accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the
other tested feature extraction techniques.
1. Introduction
Nowadays, the globe has seen tremendous growth and application of
technology in a variety of essential industries. As a consequence, soft-
ware quality is the most critical part of software system development.
Software testing, on the other hand, is a well-known quality assurance
activity that aims to evaluate the internal and external quality aspects of
a given software module, subsystem, and component using predened
criteria under certain conditions [1]. The process of ensuring high
software quality is inextricably linked to the presence of software aws.
However, a reliable software system is deemed to be good [2]. For many
years, articial intelligence (AI) technologies [3] were used to face
several challenges in software quality aspects, including forecasting
software aws at early stages, software size and cost prediction, and
software development efforts.
Software organizations are seeing signicant expansion and an in-
crease in the demand for software projects in numerous medical, mili-
tary, and industrial domains. However, to our best knowledge, it is
difcult to produce faultless software projects. Therefore, the process of
detecting and controlling software defects is critical to the projects
success or failure. A software defect, according to the International
Software Testing Qualications Board (ISTQB) [4], is a problem in a
specic software component or system that causes the component or
system to fail to execute its necessary functions. Most software defects
are caused by fundamental development aws such as insufcient
coding and design experience, confusing software requirements and
documentation, insufcient test cases, and poor team communication
[5].
* Corresponding author.
E-mail address: shadi_naswan@ju.edu.sa (S. Nashwan).
Contents lists available at ScienceDirect
Egyptian Informatics Journal
journal homepage: www.sciencedirect.com
https://doi.org/10.1016/j.eij.2023.05.011
Received 10 October 2022; Received in revised form 29 April 2023; Accepted 24 May 2023
Egyptian Informatics Journal 24 (2023) 100386
2
Software companies rely heavily on the process of detecting software
defects. However, for long years, companies manually spotted faults by
classifying reported problems as defects and then utilizing test cases to
determine if the problem was replicated, delayed, or not [6]. Nonethe-
less, because of the extended cycle, this strategy is costly, needs a lot of
human labor and time, and relies heavily on developersexperience [7].
The practice of nding defects in specic software modules using
machine learning theories and methodologies is known as software
defect prediction [8]. Machine learning, on the other hand, is an arti-
cial intelligence subeld that aims to offer computers the capacity to
learn tasks by tting mathematical functions to large amounts of data to
generate various rules for data prediction and classication [9]. How-
ever, predicting software defects, in contrast to manual approaches, is an
automated approach that, in its early stages, has the advantages of
improving software design structure, improving the testing process,
lowering development costs, providing a powerful approach for soft-
ware planning and scheduling, and increasing software reliability [10].
In general, as shown in Fig. 1, the process of predicting software
defects involves numerous consecutive steps [11]. The approach begins
by strongly relying on the usage of open-source software projects [12] or
software metrics [13] as employed by prior researchers. However,
software metrics extraction technologies such as CodeMR may be
employed with software projects [14]. Data preprocessing, on the other
hand, is a critical stage in machine learning workows; it is described as
the process of repairing data from missing values (e.g., null values),
cleaning noisy data, balancing data, feature selection, and scaling data
towards more signicant practices [15]. Finally, tting machine
learning and deep learning approaches may be used to make predictions.
Several researchers proposed alternative ways of predicting software
defects over many years. Deep learning approaches [16], decision trees
[17], random forest [18], and boosting ensemble-based methods [19]
are among the approaches used. In this paper, we propose a new
framework for predicting software defects. The proposed methodology
evaluates the use of eleven machine learning classiers, including lo-
gistic regression (LR), K-Nearest neighbors (KNN), decision trees (DT),
random forests (RF), support vector machine (SVM), adaptive boosting
(AdaBoost), gradient boosting (GB), stochastic gradient boosting (SGB),
extreme gradient boosting (XGBoost), categorical boosting (CatBoost),
and stacked-generalization over twelve datasets from the National
Aeronautics and Space Administration (NASA). In addition, four nature-
inspired algorithms are used for feature selection. Particle swarm opti-
mization (PSO), genetic algorithm (GA), harmony algorithm (HA), and
ant colony optimization (ACO). However, due to the NASA dataset
distribution, we use the synthetic minority oversampling technique
(SMOTE) to balance the data. Also, we apply the Shapley additive
explanation (SHAP) library to explore how much a single data sample
and a series of sequential observations contribute to the prediction
process.
The key contributions are as follows: (1) Employing several old and
modern statistical, bagging, boosting, and stacking classiers to
compare their performance; (2) Applying oversampling techniques to
compensate for imbalanced classes; (3) Utilizing nature-inspired algo-
rithms for feature selection; (4) Using Shapley additive explanations
library to explain the outputs of predictive models and demonstrate how
much a single observation may add to the prediction process.
The rest of the paper is structured as follows: section 2 showcases a
variety of related studies conducted in the eld. Section 3 illustrates the
proposed software defects prediction framework, used datasets, feature
selection approaches, and employed classiers. Section 4 shows the used
evaluation metrics and hyperparameters tuning process. Section 5 dis-
cusses the experimental results and shows classiers output explana-
tions using SHAP. Finally, the conclusion and future directions are
shown in section 6.
2. Related works
Many experimental, investigational, empirical, and comparative
research approaches have recently been proposed in the domain of
software defect prediction. In [7] using seven datasets obtained from the
National Aeronautics and Space Administration, six machine learning
classiers were tested: logistic regression, random forest, naive Bayes,
gradient boosting, support vector machine, and neural networks.
Moreover, the results showed that neural networks offer the greatest
results with an accuracy of 93% while utilizing 10 folds cross-validation.
In [20] deep learning and bio-inspired feature selection methods are
being used. Particle swarm optimization (PSO) was used to evaluate
neural network performance by reducing strongly correlated features. In
addition, four NASA datasets were gathered and preprocessed with
minmax scaling. When considering the results, an area under the curve
(AUC) of 0.92 was found to outperform others. However, the problem of
imbalanced data distribution remains. In [21] rey and wolf bio-
inspired search-based algorithms were employed to select the most
relevant features. Furthermore, utilizing the largest NASA datasets:
JM1, PC5, and MC1, two classiers were evaluated: support vector
machine and random forest. However, using the Waikato Environment
for Knowledge Analysis (WEKA), the results revealed that employing a
support vector machine with a wolf algorithm yields the best results
with an accuracy of 99.3% when compared to others. Also, a receiver
operating characteristic (ROC) of 51% demonstrated that the classiers
continue to execute random classication. Besides that, imbalanced
classes were not remedied. The study in [22] investigated the use of four
feature selection lter techniques and fourteen search-based algorithms
based on correlation and consistency feature subset selection to assess
the performance of four classiers: Naive Bayes, decision trees, logistic
regression, and k-nearest neighbors. In addition, ve NASA datasets
were gathered and preprocessed. However, the results show that
applying lter approaches outperforms others, particularly when
leveraging information gain (IG). Also, when compared to other search
methods, feature selection approaches based on exhaustive search have
the highest effect and consequences. Nonetheless, the classiers
Fig. 1. General Software Defects Prediction Process.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
3
performance was unstable and varied from one dataset to another.
Using a rey search-based algorithm (FA) for feature selection [23].
Three machine learning classiers were used to evaluate the utility of FA
for selecting the best features and enhancing model performance: sup-
port vector machine, Naive Bayes, and k-nearest neighbors. When
compared to the original ndings, the support vector machine resulted
in the best results and improved accuracy by 4.53%. The study in [24]
proposed a new method comprising the use of several deep learning
approaches for software defects prediction utilizing the genetic algo-
rithm (GA) for features selection and particle swarm optimization (PSO)
for data clustering. Moreover, ve NASA datasets were collected and
cleaned from missing values. The results revealed that deep neural
networks outperform others with an accuracy of 98.47% using 10 folds
cross-validation.
Several research recommends combining numerous feature ap-
proaches into a single list rather than using individuals. In [25] Intro-
duced a novel method to handle the problem of high dimensionality data
and lter method ranking in software defect prediction by combining
various feature lter methods: Chi-square, information gain, and relief.
The suggested method compares the outcomes by assessing the effec-
tiveness of decision trees (DT) and Naive Bayes (NB). In addition, nine
NASA datasets were used and sanitized. The ndings show that the
suggested method outperforms the employment of individual methods,
with NB accuracy increasing by 6.73% and DT increasing by 1.87%.
In [26] ten machine learning classiers hyperparameters were ne-
tuned and optimized based on several variants to boost the discrimi-
native power. The proposed method employs the use of three directions
Best First features subset selection to four NASA datasets. Datasets were
preprocessed by resampling and normalizing. However, results were
evaluated using three evaluation metrics: F-measure, accuracy, and
Matthewss correlation coefcient (MCC). The proposed method results
in 91.86% accuracy, 75% F-measure, and 66% of MCC. Using ensemble
learning. In [27] Four classiers: Random forest, decision trees, support
vector machine, and logistic regression were set up as base learners
using bagging and boosting ensemble-based methods. Moreover, ten
NASA datasets were preprocessed and balanced using the synthetic
minority oversampling technique. However, no feature selection tech-
niques were performed. The experiment results revealed that RF base
learner, RF with AdaBoost, and RF with bagging outperform others with
an accuracy of 97%. In [28] boosting, bagging, stacking, and voting
ensemble-based methods were applied to four classiers: DT, KNN, NN,
and sequential minimal optimization (SMO). Eleven NASA datasets were
also obtained and preprocessed. However, for minimizing highly
correlated features, a greedy search-based algorithm was used. The
ndings of the ensemble learning techniques indicated that boosted
SMO yields the best results when compared to others, with an accuracy
of 88.2%, and voting produces an accuracy of 87.9%.
Other studies have used open-source software project datasets to
predict software defects. In [29] an experimental study was carried out
to investigate the usage of three machine learning classiers: LR, KNN,
and NB across nine class-level open-source software projects with 33
distinct versions. Furthermore, a text matching method was used to
remove duplication from the datasets. However, utilizing the process
metrics, the ndings indicated that LR had the highest AUC of 88%.
Soumi Ghosh et al., [30] proposed a new method for predicting software
defects based on non-linear manifold methods. Apache, eclipse, safe,
and zxing open-source software project datasets were collected.
Furthermore, four classiers were used in the experiment: Bayesian
belief network, NB, DT, and KNN. When compared to others employing
the stochastic proximity embedding non-linear approach, the Bayesian
belief network had the highest accuracy of 80.53%. Several studies used
various methodologies to predict software defects. Table 1 summarizes
the related works. In this paper, we propose a new framework for pre-
dicting software defects based on meta-heuristics methods and eleven
machine learning classiers. In addition, we employ the SHAP library
for model explanations and SMOTE as a resampling technique for data
balancing problems.
3. Proposed methodology
In the sections that follow, we will outline the proposed framework
for software defect prediction and emphasize the key process and its
goals. The designed framework for predicting software defects attempts
to improve prediction accuracy and performance by eliminating highly
correlated features and addressing the imbalanced data problem.
Additionally, the proposed frameworks main purpose is to explain the
outcomes of several machine learning models. It is useful to understand
how much a single observation adds to the prediction process. As
evaluation metrics in this work, classication measures such as testing
accuracy and ROC-AUC were used. The proposed framework is shown in
Fig. 2. To assure data authenticity, the essential processes of dataset
preparation were used. The synthetic minority oversampling technique
was used for data balancing since it does not replicate data records.
Nonetheless, for feature selection, this study employs four
Table 1
Summarizes Related Works.
Reference Datasets Classiers Data
Balancing
Meta-Heuristic Results
[7] 7 NASA Datasets LR, NB, GB,SVM,RF,ANN NO NO ANN 93% and GB 89.5% validation accuracy.
[20] 4 NASA Datasets ANN NO Binary PSO Best AUC of 92.9% of overall datasets.
[21] 3 NASA Datasets SVM, RF NO Firey and gray wolf SVM and gray wolf yield the best results with an
accuracy of 99.3%.
[22] 5 NASA Datasets NB,DT,LR,KNN NO 14 search-based algorithms and 4
lter-based methods
Filter methods outperform others, particularly
information gain.
[23] KC1 NASA dataset SVM,NB,KNN NO Firey SVM has the best accuracy improvement by 4.53%.
[24] 5 NASA Datasets FNN, RNN,ANN, DNN NO GA and PSO DNN has the best results compared to others with
an accuracy of 98.4%.
[25] 9 NASA Datasets NB, DT NO NO The ndings show that NB accuracy increased by
6.73% and DT increased by 1.87%.
[26] 4 NASA Datasets NB, ANN, RBF, SVM, KNN,KStar,
OneR,DT, RF, PART
NO NO The proposed method results in 91.86% accuracy,
75% F-measure, and 66% of MCC.
[27] 10 NASA Datasets RF,LG,DT,SVM SMOTE NO RF and RF with AdaBoost outperform others with
an accuracy of 97%.
[28] 11 NASA Datasets DT, KNN, ANN,SMO NO NO SMO yields the best results when compared to
others with an accuracy of 88.2%.
[29] 9 open-source
software projects
LR, KNN, NB NO NO LR has the best outcome with an AUC of 88%.
[30] 4 open-source
software project
NB, DT, KNN, Bayesian belief
network
NO NO Bayesian belief network had the highest accuracy
of 80.53%.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
4
metaheuristics search-based algorithms: particle swarm optimization,
genetic algorithm, harmony algorithm, and ant colony optimization. On
the other hand, eleven machine learning classiers were constructed
and evaluated using testing sets of 20% over twelve NASA datasets.
Finally, we employ Shapley additive explanations, a well-designed tool
that aids in the explanation of prediction models by highlighting the
most valuable features of the predictive models [31]. Also, it is a good
visualization tool to show how much a single observation or a set of
sequential data samples can affect the prediction process by calculating
SHAP values. The SHAP model explanation is shown in Fig. 3.
3.1. Data collection and analysis
The National Aeronautics and Space Administration dataset is made
up of thirteen datasets that describe various software components. In
this study, we employ twelve NASA MDP datasets to predict software
defects. The KC4 dataset, on the other hand, was omitted owing to
missing attributes and values. The data were obtained from an online
repository, which can be found at https://gshare.com/collections/
NASA_MDP_Software_Defects_Data_Sets/4054940/1. The major pre-
dictors for defect prediction are software metrics such as McCabe,
Halsted, miscellaneous, and lines of code count. Table 2 shows a
description of the datasets. Furthermore, the datasets were preprocessed
to clean missing values and balance the data. The datasets, however,
were devoid of missing values, noisy data, and missing attributes.
3.2. Data balancing
Imbalanced data, also known as imbalanced target classes, is a data
distribution problem that causes machine learning classiers to be
biased toward one category over others, negatively affecting prediction
[32]. NASA datasets, on the other hand, exhibit a wide distribution gap
across target classes. As a result, this study is based on the employment
of SMOTE for data balance. SMOTE is an oversampling technique that
generates synthetic data samples to increase minority class data samples
by locating the nearest K - neighbors, computing the distance between
them then multiplying a random value between 0 and 1 for the new
sample [33].
Fig. 2. Overview of the Proposed Software Defects Prediction Framework.
Fig. 3. Shapley additive explanations Model.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
5
Table 2
NASA MDP DVersion Datasets Description.
NASA MDP Dversion datasets
category Metrics CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5
McCabe CYCLOMATIC
COMPLEXITY
× × × × × × × × × × × ×
CYCLOMATIC
DENSITY
× × × × × × × × × ×
DESIGN COMPLEXITY × × × × × × × × × × × ×
ESSENTIAL
COMPLEXITY
× × × × × × × × × × × ×
Halsted CONTENT × × × × × × × × × × × ×
DIFFICULTY × × × × × × × × × × × ×
EFFORT × × × × × × × × × × × ×
ERROR_EST × × × × × × × × × × × ×
LENGTH × × × × × × × × × × × ×
LEVEL × × × × × × × × × × × ×
PROG TIME × × × × × × × × × × × ×
VOLUME × × × × × × × × × × × ×
NUM OPERANDS × × × × × × × × × × × ×
NUM OPERATORS × × × × × × × × × × × ×
NUM UNIQUE
OPERANDS
× × × × × × × × × × × ×
NUM UNIQUE
OPERATORS
× × × × × × × × × × × ×
Misc. metric BRANCH COUNT × × × × × × × × × × × ×
CALL PAIRS × × × × × × × × × ×
CONDITION COUNT × × × × × × × × × ×
DECISION COUNT × × × × × × × × × ×
DECISION DENSITY × × × × × × × ×
DESIGN DENSITY × × × × × × × × × ×
EDGE COUNT × × × × × × × × × ×
ESSENTIAL DENSITY × × × × × × × × × ×
PARAMETER COUNT × × × × × × × × × ×
MAINTENANCE
SEVERITY
× × × × × × × × × ×
MODIFIED
CONDITION COUNT
× × × × × × × × × ×
MULTIPLE
CONDITION COUNT
× × × × × × × × × ×
NODE COUNT × × × × × × × × × ×
NORMALIZED
CYCLOMATIC
COMPLEXITY
× × × × × × × × × ×
PERCENT COMMENTS × × × × × × × × × ×
GLOBAL DATA
COMPLEXITY
× × × ×
GLOBAL DATA
DENSITY
× × × ×
LOC
counts
LOC BLANK × × × × × × × × × × ×
LOC CODE AND
COMMENT
× × × × × × × × × × × ×
LOC COMMENTS × × × × × × × × × × × ×
LOC EXECUTABLE × × × × × × × × × × × ×
NUMBER OF LINES × × × × × × × × × ×
LOC TOTAL × × × × × × × × × × × ×
Total number of featuresTotal number
of instancesDefective modulesNon-
defective modulesProgramming
languageLines of code (Thousands)
3734442302C17 219593175
97834C457
212096325
1771C++43
3920036164Java8 389277689209C++66 391274483C++6 3726427237C8 3775961698C26 361585161569C25 371125140985C36 3713991781221C30 3817,00150316,
498C++162
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
6
3.3. Features selection
The process of selecting the most informative predictors of existing
features by removing highly correlated features is known as feature
selection. In this paper, we apply four bio-inspired search-based feature
selection algorithms.
3.3.1. Particle swarm optimization (PSO)
PSO is a nature-inspired optimization algorithm inuenced by birds
behavior. Its a population-based algorithm in which each bird is
referred to as a particle and a group of birds is referred to as a swarm (i.
e., population). The core notion of PSO is that each particle represents
the best solution to the problem. However, following the random
initialization, the location of each particle is modied, resulting in new
positions: best previous position PK
i and best global positionPK
g as shown
in Fig. 4. Where r1 and r2 represent randomly generated values between
[0 and 1]. Also, c1 and c2 represent constant values that update the
weight of Pi (previous position of ith particle) and Pg (previous position
of all particles) [34].
3.3.2. Genetic algorithm (GA)
A genetic algorithm is a meta-heuristic that is population-based and
is used to address optimization problems. It is a chromosome-inspired
search-based algorithm. GA passes through the following stages:
Initialize the population, use the tness function, selection, and repro-
duction [35]. To begin, random individuals are initiated, each of which
has a parameter known as a gene, which creates chromosomes. The core
idea behind GA is that each individual offers a solution to a problem.
However, each individual is evaluated by a tness score using the tness
function to determine whether or not the individual is suitable for
reproduction. Following the selection process, the crossover function
can be used to generate a new individual [11]. Fig. 5 shows the GA
process.
3.3.3. Harmony search algorithm (HA)
Harmony search is a population-based meta-heuristic algorithm. It
was inspired by the musical practice of nding the best harmonies. The
entire point of HA is to emulate the musiciansapproach to discover the
optimal solution. However, HA passes through multiple phases, which
are as follows: Initialization of parameters, initialization of harmony
memory, improvisation of new harmony, and updating harmony mem-
ory [36]. Following the initialization of the basic parameters: harmony
memory (HM) size, harmony memory consideration rate (HMCR), and
pitch adjustment rate (PAR). The harmony memory matrix is designed
and lled at random with the values of the possible variables. Each row
in the matrix indicates a potential solution. However, relying on HMCR
and PAR, additional vectors are generated. Finally, if the new harmony
vectors outperform the old ones, they will be preserved as the best so-
lution in the HM [37]. The HA pseudocode is shown in Fig. 6.
3.3.4. Ant colony optimization (ACO)
In several disciplines, ACO is one of the most widely utilized meta-
heuristic search-based algorithms. It is an improved version of Marco
Dorigos original algorithm, the ant system, which was introduced in
1992 [38]. ACO was inspired by the foraging activity of ant colonies.
The goal of ACO is to locate the shortest path between the ant nest and
the supplies. Each route shows a possible solution. However, after the
pathways have been initialized, and with the assistance of ant phero-
mones, which are organic chemical substances, ants travel the shortest
path to the nest, which represents the ideal solution. Many more models
have been used and enhanced with ACO. To solve the challenge of
autonomous surface vehicles, Dimitrios [39] suggested an updated
version of ACO based on fuzzy logic. The ACO pseudocode is shown in
Fig. 7.
3.4. Classication models
In this paper, we propose a new software defect prediction frame-
work comprised of the following eleven machine learning models: Sta-
tistical classiers (logistic regression), K-nearest neighbors approach,
support vector machine-based classiers (Support vector machine),
Decision trees approach (J48), and numerous ensemble methods
(Stacking, random forest, adaptive boosting, gradient boosting, sto-
chastic gradient boosting, extreme gradient boosting, and categorical
boosting).
3.4.1. Logistic regression
A logistic regression model, like a linear regression model, predicts a
dependent variable by examining the relationship between independent
data variables. However, it may be utilized for true or false binary
predictions rather than continuous predictions [40]. Furthermore, it ts
an s-shape line using the sigmoid function wherehθ(x) = 1/(1+
e [03B8]T.x), causing the predicted value to range between 0 and 1, giving
us the possibility that y =1, which is true based on the ×value.
Furthermore, a high ×value indicates a high probability that y =1, but
an intermediate ×value indicates a 50% likelihood that y =1 when the
anticipated Y equals 0.5.
3.4.2. K-nearest neighbors
KNN [41] is a distance-based method that determines the distance
between training and test samples. KNN classier is a sluggish algorithm
since it learns during the evaluation phase and saves data samples
during the learning stage. Without standing, the KNN passes through
many phases. Begin by determining the value of K (number of nearest
samples), then calculate the distance between the training samples and
the new data sample using a distance function. Furthermore, the dis-
tance values will be sorted and the nearest samples identied based on
the value of K, and the prediction value will be determined based on the
majority of the class values.
3.4.3. Decision trees
DT [42] is a tree decision-based classier that divides the qualities of
the training data into parts. In DT, predictions are created by making
judgments based on a series of questions that are closely relevant to the
predictions aim. It is a non-parametric approach that does not neces-
sitate the use of a mapping function between predictors and target data.
The process of picking the optimum decision values at each depth level
is referred to as impurity in DT. The Gini-index function, on the other
hand, may be used to measure the impurity of the tree.
3.4.4. Random forests
RF [43] is a bagging ensemble-based classier. As the basic learners,
it divides the training data into several decision trees. RF is also a non-
parametric model. However, a restricted random number of rows is
chosen from the data population, and a set of decision trees is built for
each segment to generate classication output. An individual decision
tree might be bigger than others. A majority vote in RF indicates a meta-
classier, which is a classication decision made based on the judgments
of the basic learners. It supports numeric and categorical data types, as
well as complicated predictor functions.
3.4.5. Support vector machine
A support vector machine [44] is a classier that uses hyperplanes to
nd the best decision boundary in dimensional space based on the
number of features. However, because the hyperplane has numerous
options, the goal of the support vector machine is to pick the ideal hy-
perplane with the largest margins among data samples to improve
classication accuracy. Additionally, linear and non-linear classication
can be done. Generally, a support vector machine performs better and is
less prone to overtting when the class distribution is known. However,
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
7
when maximum margins are employed, new data may be appropriately
t and categorized.
3.4.6. Adaptive boosting
AdaBoost [45] is an ensemble-based boosting classier. In contrast to
decision trees and random forests. Several decision trees are merged
independently in AdaBoost. Each of these contains a single node and two
leaves, forming what is known as a stump. A stumps forest is a collection
of stumps. However, stumps are poor classication learners. AdaBoosts
entire concept is to combine numerous weak learners for classication.
The sequence of stump creation is critical in AdaBoost since each pro-
duced tree inuences the output of the following one. Each data sample
is weighted (w) to determine the order of stumps where w =1/Total
number of samples. However, sample weight is updated frequently by
sw =wold*eamountofsay(
α
). Also, the Gini-Index may be used to rank stumps,
with the lower the Gini value, the more essential the stump. Fig. 8 shows
AdaBoost pseudocode.
3.4.7. Gradient boosting
GB [46] is a boosting ensemble-based algorithm for classication and
regression that employs the idea of pseudo residuals (PR). Unlike Ada-
Boost, gradient boosting begins with a single leaf that reects the
average values of anticipated classes rather than a stump. GB, on the
other hand, creates a xed-sized tree, similar to AdaBoost, where each
tree can be larger than a stump. Decision trees are used as base learners
in GB. The difference between the actual and anticipated values is rep-
resented by PR, which may be determined by tting a linear regression
line as a loss function. In GB, prediction errors may be reduced by
frequently updating PR values from tree to tree, making it a powerful
prediction approach. Fig. 9 shows the GB pseudocode.
Fig. 4. Particle Swarm Optimization Architecture Illustration.
Fig. 5. Genetic Algorithm Pseudocode.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
8
Fig. 6. Harmony Search Algorithm Pseudocode.
Fig. 7. Ant Colony Optimization pseudocode.
Fig. 8. Adaptive Boosting pseudocode.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
9
3.4.8. Stochastic gradient boosting
Friedman [47] introduced the stochastic gradient boosting approach,
which is an ensemble-based method. The hybrid procedure of homog-
enous (boosting and bagging) approaches inspired it. By weighting the
base learners, it may be utilized for regression and classication. Deci-
sion trees, on the other hand, are the most widely used SGB base
learners. Each tree, on the other hand, is trained by selecting a random
number of data sample records. SGB, as compared to bagging methods,
is an effective instrument for limiting the chance of overtting by
randomly eliminating a portion of input data [48]. SGB has been utilized
in a variety of areas for many years. S´
ergio Godinho et al., [49]
employed SGB for multi-image classication in their inquiry study.
3.4.9. Extreme gradient boosting
XGBoost [50] is an ensemble method based on boosting. It is a more
advanced version of gradient boosting. It was also created for large and
intricate datasets. Unlike adaptive and gradient boosting, the XGBoost
employs unique regression trees. Forming what is known as born-again
trees. XGBoost tree formation, on the other hand, begins with a single
leaf, much like gradient boosting. Furthermore, gradient boosting and
regularization are critical steps in XGBoost. For many years, XGBoost
has been one of the most effective boosting methods. In their experi-
mental investigation, Ismail and Faisal [51] demonstrated that XGBoost
outperforms other classiers such as RF, SVM, radial basis function
neural network, and naive Bayes. It is simple to use and provides a high
level of discriminative prediction accuracy. In addition to its ability to
Fig. 9. Gradient Boosting Pseudocode.
Fig. 10. ROC-AUC ranges and values.
Table 3
Boosting Hyperparameters Grid-Search Space.
Parameters Grid-Search Space
Number of estimators 50, 70, 90, 100, 150, 120, 180, 200
Learning rate 0.001, 0.01, 0.1, 1, 10
Loss function Deviance, exponential
Algorithms SAMME, SAMME.R
1
3
5
7
9
11
13
15
17
19
21
23
25
CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5
SMOTE+PSO SMOTE+GA SMOTE+HA SMOTE+AC
Fig. 11. NASA MDP Selected Features Using Meta-Heuristic Search-Based.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
10
handle large data distribution with imbalanced target classes.
3.4.10. Categorical boosting
CatBoost is a more advanced kind of gradient boosting. It was pro-
posed in 2018 by Anna Veronika Dorogush and the Yandex corporate
team [52]. It was created to be the optimum option for heterogeneous
data, as opposed to bagging and stacking. Based on a boosting approach,
it was designed to handle categorical and numerical features. CatBoost is
an open-source package that allows for extremely rapid graphics pro-
cessing unit (GPU) and central processing unit (CPU) calculations. Very
effective for small datasets and simple to utilize. It is also a decision tree-
based method, which has the benet of decreasing overtting. CatBoost
has long been a rival to other gradient boosting methods. Jengei Hong
[53] found that CatBoost outperformed XGBoost and LightGBM imple-
mentations in his experimental investigation on predicting house
pricing.
3.4.11. Ensemble stacking
Stacking [54] is a heterogeneous ensemble-based method. Stacking,
as opposed to bagging and boosting, divides a machine learning model
into two layers. Stacking proceeds in numerous stages: Dataset gathering
and features extractor, base learner construction, and Meta learner
creation. The base layer has N distinct classiers. To begin, training data
is divided into K portions. Each base learner produces a prediction based
on K. A Meta dataset, on the other hand, integrates the prediction out-
comes of base learners. A Meta classier is the second level of prediction.
It generates prediction output based on previous predictions. In this
work, we employ the use of RF and SVM as base learners. Also, we use
LG as a Meta classier.
4. Evaluation metrics and hyperparameters tuning
Several evaluation metrics may be used to evaluate machine learning
classiers, including accuracy, precision, recall, receiver operating
characteristics (ROC), F-measure, and area under the curve (AUC). A
Table 5
Classiers Testing Accuracy Comparison over JM1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.870766 0.690316 0.700403 0.794889 0.701076
KNN 0.936425 0.972091 0.972764 0.970746 0.972428
Decision trees 0.930693 0.956288 0.954943 0.954270 0.949227
Random Forest 0.915060 0.924681 0.920309 0.912239 0.914257
Ada Boost 0.915060 0.924681 0.920309 0.912239 0.914257
Gradient Boosting 0.955706 0.969401 0.970746 0.970746 0.971083
SGB 0.959354 0.975118 0.975118 0.975454 0.975118
XGBoost 0.818134 0.790182 0.459314 0.828850 0.823134
Stacking 0.925000 0.967720 0.968393 0.967720 0.967720
Cat Boost 0.957791 0.974781 0.974781 0.973436 0.973100
SVM 0.819177 0.968393 0.969738 0.969065 0.968729
Accuracy (AVG) 0.851 0.919 0.889 0.929 0.92
Table 6
Classiers Testing Accuracy Comparison over KC1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.850000 0.865041 0.865041 0.860163 0.871545
KNN 0.828571 0.889431 0.889431 0.887805 0.887805
Decision trees 0.809524 0.856911 0.866667 0.879675 0.876423
Random Forest 0.852381 0.882927 0.886179 0.892683 0.891057
Ada Boost 0.852381 0.882927 0.886179 0.892683 0.891057
Gradient Boosting 0.866667 0.884553 0.884553 0.889431 0.884553
SGB 0.869048 0.894309 0.891057 0.899187 0.895935
XGBoost 0.847619 0.461789 0.577236 0.856911 0.866667
Stacking 0.890476 0.889251 0.889251 0.889251 0.889251
Cat Boost 0.876190 0.897561 0.897561 0.895935 0.895935
SVM 0.850000 0.886179 0.886179 0.884553 0.886179
Accuracy (AVG) 0.853 0.844 0.856 0.883 0.884
Table 4
Classiers Testing Accuracy Comparison over CM1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.826087 0.867188 0.859375 0.789062 0.882812
KNN 0.884058 0.898438 0.875000 0.906250 0.890625
Decision trees 0.869565 0.906250 0.882812 0.914062 0.859375
Random Forest 0.884058 0.906250 0.906250 0.906250 0.906250
Ada Boost 0.884058 0.906250 0.906250 0.906250 0.906250
Gradient Boosting 0.913043 0.937500 0.921875 0.929688 0.937500
SGB 0.942029 0.921875 0.906250 0.937500 0.929688
XGBoost 0.884058 0.820312 0.882812 0.851562 0.632812
Stacking 0.857143 0.875000 0.890625 0.890625 0.890625
Cat Boost 0.884058 0.906250 0.898438 0.898438 0.906250
SVM 0.884058 0.914062 0.890625 0.914062 0.914062
Accuracy (AVG) 0.882 0.896 0.892 0.894 0.877
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
11
Fig. 12. The 10th Observation Impact on Predictive Models and Highest Discriminative Power Features over CM1 dataset. A) Shows call pairs as the most impact
feature in GB with a 1.343 value compared to a 0.5 base value. B) Shows call pairs with a 1.343 value compared to a 0.49 base value in SGB. C) Shows lines of code
comments with a 25.17 value compared to a 0.6 base value in Cat Boost.
Table 7
Classiers Testing Accuracy Comparison over KC3 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.825 0.860163 0.860163 0.919355 0.967742
KNN 0.775 0.887805 0.887805 0.903226 0.838710
Decision trees 0.925 0.873171 0.876423 0.903226 0.919355
Random Forest 0.800 0.889431 0.889431 0.935484 0.935484
Ada Boost 0.800 0.889431 0.889431 0.935484 0.935484
Gradient Boosting 0.875 0.889431 0.889431 0.919355 0.903226
SGB 0.875 0.897561 0.895935 0.935484 0.919355
XGBoost 0.825 0.855285 0.858537 0.919355 0.870968
Stacking 0.700 0.889251 0.889251 0.903226 0.903226
Cat Boost 0.900 0.895935 0.895935 0.935484 0.935484
SVM 0.825 0.884553 0.884553 0.919355 0.919355
Accuracy (AVG) 0.829 0.882 0.882 0.92 0.913
Table 8
Classiers Testing Accuracy Comparison over MC1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.990841 0.950879 0.950879 0.950879 0.933296
KNN 0.994073 0.998046 0.998046 0.998046 0.997767
Decision trees 0.991379 0.998046 0.998325 0.998046 0.996930
Random Forest 0.994612 0.996651 0.996651 0.996651 0.996651
Ada Boost 0.994612 0.996651 0.996651 0.996651 0.996651
Gradient Boosting 0.993534 0.996930 0.996930 0.996930 0.997488
SGB 0.993534 0.996372 0.997767 0.996651 0.919355
XGBoost 0.992457 0.485906 0.991627 0.975440 0.891711
Stacking 0.995690 0.996094 0.996094 0.996094 0.996094
Cat Boost 0.994612 0.996093 0.996093 0.996093 0.996093
SVM 0.992457 0.995814 0.995814 0.995814 0.996372
Accuracy (AVG) 0.992 0.945 0.992 0.99 0.973
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
12
confusion matrix, on the other hand, is a table that contains a collection
of characteristics that may be used to describe the performance of a
classier [55]. It may be used to visualize and compare the statistical
performance of various algorithms.
In short, a confusion matrix has four key properties that may be
utilized to derive different evaluation metrics, which are as follows:
True positive (TP): Presents the number of defective modules that
have been classied correctly as defective modules.
True negative (TN): Presents the number of non-defective modules
that have been classied correctly as non-defective modules.
False positive (FP): Presents the number of misclassied non-
defective modules that are defective.
False negative (FN): Presents the number of misclassied defective
modules that are non-defective modules.
In this study, we evaluate the proposed classiers using testing ac-
curacy and AUC. AUC, on the other hand, is a performance metric that
indicates how well a certain classier performs classication by dis-
tinguishing between target classes [56]. True positive rate (sensitivity)
and false positive rate (1-specicity) are two threshold values that have
an inverse relationship, with sensitivity increasing as specicity de-
creases and vice versa. The higher the AUC, however, the better, indi-
cating that the classier is not doing random classication. Fig. 10
shows the AUC ranges and values. 20% of the data were maintained as
testing sets to evaluate classication accuracy. The confusion matrix
may be used to extract all of the statistic measures listed above as
follows:
Accuracy =TP+TN
TP+FP+TN+FN
Turepositiverate(Sensitivity) = TP
TP+FN
Falsepositiverate =1specificity =1FP
TN+FP
In this work, we follow the use of the Grid-Search method for tuning and
optimizing classiershyperparameters. Grid-Search, on the other hand,
is a search strategy for choosing the optimum parameters by merging all
of the possible value entries in the search space [57]. Boosting ensemble
methods parameters including the number of estimators, learning rate,
loss function, and algorithms were tuned based on a set of values. Also, 5
Table 9
Classiers Testing Accuracy Comparison over MC2 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.730769 0.950879 0.542857 0.542857 0.714286
KNN 0.769231 0.998046 0.542857 0.542857 0.771429
Decision trees 0.576923 0.998046 0.657143 0.657143 0.771429
Random Forest 0.769231 0.996651 0.800000 0.857143 0.828571
Ada Boost 0.769231 0.996651 0.800000 0.857143 0.828571
Gradient Boosting 0.769231 0.996930 0.828571 0.828571 0.800000
SGB 0.692308 0.996930 0.800000 0.857143 0.857143
XGBoost 0.653846 0.973207 0.657143 0.685714 0.657143
Stacking 0.846154 0.996094 0.705882 0.705882 0.647059
Cat Boost 0.692308 0.996093 0.800000 0.800000 0.800000
SVM 0.692308 0.995814 0.685714 0.685714 0.685714
Accuracy (AVG) 0.723 0.99 0.71 0.728 0.759
Fig. 13. Shows 35 data samples contribution using AdaBoost and DT over the MC2 dataset utilizing SMOTE and PSO. A) Shows Halsted effort, call pairs and Halsted
difculty metrics as the highest discriminative features using AdaBoost. B) Shows call pairs, edge count, and Halsted difculty metrics as the most affected features
over 35 data samples using DT.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
13
Fig. 14. Shows 35 data samples contribution using RF and SGB over the MC2 dataset utilizing SMOTE and PSO. A) Shows global data density and call pairs metrics
effects using RF. B) Shows Halsted effort, global data density, and call pairs metrics effects using SGB.
Table 10
Classiers Testing Accuracy Comparison over MW1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.924528 0.542857 0.542857 0.542857 0.824176
KNN 0.905660 0.542857 0.542857 0.542857 0.956044
Decision trees 0.886792 0.657143 0.657143 0.657143 0.879121
Random Forest 0.924528 0.828571 0.857143 0.828571 0.956044
Ada Boost 0.924528 0.828571 0.857143 0.828571 0.956044
Gradient Boosting 0.886792 0.800000 0.828571 0.800000 0.901099
SGB 0.849057 0.800000 0.800000 0.800000 0.923077
XGBoost 0.905660 0.685714 0.800000 0.742857 0.912088
Stacking 0.925926 0.705882 0.705882 0.705882 0.869565
Cat Boost 0.924528 0.800000 0.800000 0.800000 0.956044
SVM 0.905660 0.685714 0.685714 0.685714 0.934066
Accuracy (AVG) 0.905 0.715 0.733 0.72 0.915
Table 11
Classiers Testing Accuracy Comparison over PC1 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.914474 0.862816 0.862816 0.862816 0.894958
KNN 0.901316 0.945848 0.945848 0.945848 0.962185
Decision trees 0.914474 0.909747 0.916968 0.924188 0.941176
Random Forest 0.921053 0.938628 0.924188 0.938628 0.957983
Ada Boost 0.921053 0.938628 0.924188 0.938628 0.957983
Gradient Boosting 0.907895 0.942238 0.942238 0.935018 0.962185
SGB 0.901316 0.942238 0.945848 0.942238 0.957983
XGBoost 0.940789 0.837545 0.859206 0.902527 0.907563
Stacking 0.934211 0.927536 0.927536 0.927536 0.974790
Cat Boost 0.921053 0.942238 0.942238 0.942238 0.962185
SVM 0.921053 0.942238 0.942238 0.942238 0.949580
Accuracy (AVG) 0.917 0.92 0.927 0.933 0.953
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
14
folds cross-validation was employed. The grid-search parameter space
for boosting methods is shown in Table 3.
5. Experimental results and models explanations
The ndings of the proposed classiers for predicting software de-
fects are discussed in this section. SMOTE was utilized for data balance
after data preprocessing. Four Meta heuristics methods, on the other
hand, were tuned and used for feature selection. We employ the logistic
map function as a chaotic type and 20 initial particles with 20 iterations
in ACO. We utilized a population size of 20 with 20 iterations in PSO. In
GA, we employed a crossover probability of 6% with a maximum gen-
eration of 20. In addition, we employed the logistic map function in HA
with 20 iterations and population size.
Fig. 11 shows the number of features selected for each dataset. In
comparison to the others, GA has the most selected amount of features
whereas ACO comes next.
Tables 4-6 compare the accuracy of classiers in the CM1, JM1, and
Table 12
Classiers Testing Accuracy Comparison over PC2 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.990536 0.862816 0.862816 0.862816 0.955684
KNN 0.990536 0.945848 0.945848 0.945848 0.996146
Decision trees 0.987382 0.909747 0.916968 0.909747 0.996146
Random Forest 0.990536 0.938628 0.935018 0.938628 0.996146
Ada Boost 0.990536 0.938628 0.935018 0.938628 0.996146
Gradient Boosting 0.990536 0.938628 0.935018 0.945848 0.996146
SGB 0.993691 0.945848 0.942238 0.945848 0.998073
XGBoost 0.990536 0.898917 0.675090 0.862816 0.445087
Stacking 0.987421 0.927536 0.927536 0.927536 0.992308
Cat Boost 0.990536 0.942238 0.942238 0.942238 0.996146
SVM 0.990536 0.942238 0.942238 0.942238 0.994220
Accuracy (AVG) 0.9902 0.932 0.909 0.929 0.9406
Fig. 15. Shows the most effective features over the PC1 dataset using ACO and SMOTE.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
15
KC1 datasets before and after using meta-heuristic feature selection al-
gorithms. The testing accuracy average, on the other hand, was
computed to highlight the comparison of outcomes. Overall, the results
showed an improvement in testing accuracy. GB, SGB, and CatBoost, on
the other hand, outperform the others. This is not surprising given that
boosting methods enhance variance while decreasing bias. In addition,
by expanding minority class samples, SMOTE increases the number of
data samples.
However, XGBoost performs the worst when compared to others
utilizing PSO and GA in the JM1 and KC1 datasets, with testing accuracy
ranging from 45 to 57 percent. To be clear, the reason is most likely
overtting. In light of this, GridSeachCV was used to adjust XGBoost
hyperparameters. To the best of our understanding, boosting methods in
general are prone to overtting. Particularly, deep trees. SMOTE with
the selected features utilizing PSO yields the best results in the CM1
dataset, with an average accuracy of 89.6%. The results of the JM1
dataset demonstrate that SMOTE and HA outperform others with an
average testing accuracy of 92.2%. Also, in the KC1 dataset, SMOTE and
ACO outperform others with an average accuracy of 88.4%.
Fig. 12 shows SHAP plots to explain the GB, SGB, and Cat Boost re-
sults. The bee swarm plot was used to describe important features based
on SHAP values, and the local force plot was used to demonstrate how a
single observation contributed to the prediction models. We chose the
tenth observation at random over the CM1 dataset. The call pairs metric
pushes the prediction to the right in GB and SGB, with a value of 1.343
compared to the base value of 0.5.
However, the call pairs impact on the prediction model decreases as
parameter count and design complexity metrics increase. Lines of code
comments have the strongest impact in Cat Boost at the 10th observa-
tion, at 25.17, relative to the base value of 0.6, however, the design
complexity metric has a negative impact. Overall, it can be notated that
lines of code comments, design density, and Halsted content metrics
have the highest discriminative power to the predictive models.
The comparative ndings using meta-heuristics and SMOTE across
the KC3, MC1, and MC2 datasets are shown in Tables 7-9. Aside from
general performance enhancement. SGB, Cat Boost, LR, and RF surpass
others using the HA algorithm in the KC3 dataset, with an average ac-
curacy of 92% compared to the original ndings of 82%. The MC1
Table 14
Classiers Testing Accuracy Comparison over PC4 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.925000 0.862816 0.862816 0.862816 0.860262
KNN 0.875000 0.945848 0.945848 0.945848 0.938865
Decision trees 0.964286 0.924188 0.924188 0.924188 0.978166
Random Forest 0.928571 0.935018 0.931408 0.938628 0.958515
Ada Boost 0.928571 0.935018 0.931408 0.938628 0.958515
Gradient Boosting 0.975000 0.938628 0.935018 0.942238 0.973799
SGB 0.971429 0.942238 0.938628 0.945848 0.931507
XGBoost 0.860714 0.469314 0.758123 0.866426 0.871179
Stacking 0.921429 0.927536 0.927536 0.927536 0.890830
Cat Boost 0.971429 0.942238 0.942238 0.942238 0.982533
SVM 0.871429 0.942238 0.942238 0.942238 0.901747
Accuracy (AVG) 0.926 0.89 0.917 0.931 0.938
Table 15
Classiers Testing Accuracy Comparison over PC5 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.963834 0.862816 0.862816 0.913023 0.955054
KNN 0.975889 0.945848 0.945848 0.991103 0.988802
Decision trees 0.973243 0.898917 0.909747 0.987728 0.989109
Random Forest 0.971479 0.942238 0.931408 0.978831 0.979291
Ada Boost 0.971479 0.942238 0.931408 0.978831 0.979291
Gradient Boosting 0.978536 0.938628 0.942238 0.990950 0.991256
SGB 0.976183 0.938628 0.938628 0.989262 0.931507
XGBoost 0.971479 0.862816 0.862816 0.969781 0.961190
Stacking 0.970588 0.927536 0.927536 0.987423 0.985276
Cat Boost 0.974125 0.942238 0.942238 0.986041 0.987421
SVM 0.970891 0.942238 0.942238 0.984814 0.982973
Accuracy (AVG) 0.973 0.928 0.927 0.984 0.977
Table 13
Classiers Testing Accuracy Comparison over PC3 Dataset.
Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE
LR 0.871111 0.862816 0.862816 0.862816 0.805479
KNN 0.871111 0.945848 0.945848 0.945848 0.928767
Decision trees 0.857778 0.916968 0.916968 0.898917 0.893151
Random Forest 0.875556 0.935018 0.942238 0.931408 0.920548
Ada Boost 0.875556 0.935018 0.942238 0.931408 0.920548
Gradient Boosting 0.906667 0.938628 0.938628 0.942238 0.906849
SGB 0.880000 0.935018 0.945848 0.942238 0.931507
XGBoost 0.235556 0.884477 0.873646 0.884477 0.857534
Stacking 0.902655 0.927536 0.927536 0.927536 0.939891
Cat Boost 0.875556 0.942238 0.942238 0.942238 0.920548
SVM 0.875556 0.942238 0.942238 0.942238 0.926027
Accuracy (AVG) 0.815 0.930 0.931 0.928 0.914
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
16
dataset, on the other hand, revealed that DT, KNN, and GB outperform
others employing GA with an average accuracy of 92%.
However, the MC1 dataset is regarded to be one of the largest NASA
datasets, with a total of 9277 occurrences, and we were able to retain the
same accuracy by applying the GA algorithm due to the broad data
dispersion over training and testing sets. The ndings of the MC2 dataset
revealed that six classiers, including DT, KNN, GB, SGB, AdaBoost, and
RF, out of eleven, had the best accuracy by comparison. Additionally,
SMOTE and PSO yield the best outcomes, with an average accuracy of
99%.
Figs. 13 and 14 show the SHAP explanation of the DT, AdaBoost, RF,
and SGB models on the MC2 dataset using SMOTE and PSO. These
classiers have the best outcome on the KC3, MC1, and MC2 datasets.
Explaining how a single observation adds to the prediction model, on the
other hand, is useful but insufcient. As a result, we employ the SHAP
global force plot to demonstrate how a series of consecutive data sam-
ples contribute to the prediction model.
The plots show the contributions of 35 data samples. The x-axis in
global force reects the number of data samples, while the y-axis shows
SHAP predicted values. In addition, we employ a bee swarm plot to
emphasize the features with the highest discriminative power. For
clarity, the most signicant features of overall classiers are call pairs,
global data density, and essential density metrics.
The results of classiers using meta-heuristic methods on the MW1,
PC1, and PC2 datasets are shown in Tables 10-12. Overall, ACO has the
best performance improvement when compared to others. The ndings
on the MW1 dataset demonstrate that KNN, RF, AdaBoost, and Cat Boost
surpass others using ACO with an average accuracy of 91.5%. With only
264 total instances, the MW1 dataset is considered one of the smallest
NASA datasets. We demonstrate overall results improvement using all
search methods on the PC1 dataset. However, we found that ACO has the
best results when compared to others, with an average accuracy of
95.3%. It is also notable that KNN, GB, Cat Boost, and Stacking
outperform the others. ACO outperforms the others in the PC2 dataset,
with an average testing accuracy of 94%. However, unlike in PC1 and
MW1, XGBoost performs the lowest in PC2, which is most likely due to
overtting. Nonetheless, SGB surpasses others. Fig. 15 shows the most
effective features over the PC1 dataset utilizing KNN, DT, GB, RF, and
Cat boost with ACO and SMOTE.
Tables 13-15 show the ndings before and after using metaheuristics
and SMOTE on the PC3, PC4, and PC5 datasets, respectively. The nd-
ings of the PC3 dataset revealed an overall improvement. Particularly by
applying GA, which has an average testing accuracy of 93.1%. However,
it is worth noting that KNN, SGB, and stacking exceed others in terms of
accuracy. PSO also improved its accuracy, with an average accuracy of
93%. In comparison to the original, XGBoost accuracy improved
signicantly utilizing PSO and HA, with an accuracy of 88.4%. ACO has
the best results in the PC4 dataset, with an average testing accuracy of
93.8%, followed by HA, which has an accuracy of 93.1%. Also, when
comparing metaheuristics, KNN, DT, SGB, and Cat Boost outperform the
others. XGBoost and LR, on the other hand, perform poorly. Even
though the fact that the PC5 dataset is the largest NASA dataset, with
17,001 total instances, we found that using HA and SMOTE performs the
best, with an average accuracy of 98.4% when compared to the original
accuracy of 97.3%. ACO also showed a slight improvement of 0.004%.
Surprisingly, the KNN outperforms others by utilizing PSO, GA, and HA.
However, in ACO, the KNN, DT, and GB performed the best.
Fig. 16 shows the most effective features (software metrics) using
KNN, AdaBoost, CatBoost, and DT with GA on the PC3 dataset. These
classiers have the best overall performance. The SHAP explanation
plot, on the other hand, reveals that lines of code blank, parameters
count, and Halsted content metrics have the highest discriminative
power.
Fig. 17 shows the ROC-AUC performance of classiers across NASA
datasets. Notably, when compared to others, GB, SGB, AdaBoost, RF, DT,
and Cat Boost have the best performance. This is not unexpected given
that these classiers had the greatest overall accuracy scores. XGBoost
and SVM, on the other hand, performed admirably. However, with an
AUC of higher than 90%, all classiers performed best on the PC5, PC2,
MC1, MW1, and PC4 datasets. In comparison to the others, SGB and Cat
Fig. 16. Shows the Most Effective Features Utilizing KNN, AdaBoost, CatBoost, and DT with GA over the PC3 Dataset.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
17
Fig. 17. Shows Classiers ROC-AUC Comparison.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
18
Fig. 17. (continued).
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
19
Boost performed the best. SGB, on the other hand, outperforms Cat Boost
by a little percentage.
6. Conclusion
Software defects decrease software quality in terms of efciency,
testability, and maintainability. Also, it can lead to project success or
failure. Therefore, the process of software defects prediction at the early
stages of development enhance software quality, decrease development
and maintenance cost, and keeps the project on track. In this paper, we
proposed a new framework for predicting software defects. Twelve
NASA datasets were subjected to the use of eleven machine learning
classiers from the statistical, boosting, bagging, and stacking families.
Also, we investigated the use of four search-based algorithms (Particle
swarm optimization, Genetic algorithm, Harmony algorithm, and Ant
colony optimization) for feature selection. However, NASA datasets
have a large distribution gap between minority and majority classes.
Therefore, we employed the use of SMOTE as a resampling technique for
data balancing. On the other hand, we used the SHAP library to explain
the results of the models and highlight the most effective features.
Testing accuracy and ROC-AUC were employed for model evaluation.
The results revealed that GB, SGB, DT, and Cat Boost classiers have the
best accuracy and performance when compared to others. Also, we
found that PC5 and MC1 datasets have the best results. However, when
comparing the meta-heuristics algorithms, we found that ant colony
optimization has the best results improvement followed by the harmony
algorithm. Finally, we showed that call pairs, Halsted content, param-
eters count, and lines of code comments metrics had the highest
discriminative power by SHAP employment. As a future direction, we
intend to compare the meta-heuristic algorithms with lter methods for
feature selection and compare the performance of SMOTE with other
resampling techniques.
CRediT authorship contribution statement
Yazan Al-Smadi: Conceptualization, Methodology, Software, Data
curation, Writing original draft, Investigation, Validation, Writing
review & editing. Mohammed Eshtay: Conceptualization, Methodol-
ogy, Software, Data curation, Writing original draft, Investigation,
Validation, Writing review & editing. Ahmad Al-Qerem: Conceptu-
alization, Methodology, Software, Data curation, Writing original
draft, Investigation, Validation, Writing review & editing. Shadi
Nashwan: Conceptualization, Methodology, Software, Data curation,
Writing original draft, Investigation, Validation, Writing review &
editing. Osama Ouda: Conceptualization, Methodology, Software, Data
curation, Writing original draft, Investigation, Validation, Writing
review & editing. A.A. Abd El-Aziz: Conceptualization, Methodology,
Software, Data curation, Writing original draft, Investigation, Vali-
dation, Writing review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
References
[1] J¨
ockel L, et al. Towards a Common Testing Terminology for Software Engineering
and Data Science Experts. International Conference on Product-Focused Software
Process Improvement. Springer; 2021.
[2] Quyoum A, Dar M-U-D, Quadri SMK. Improving software reliability using software
engineering approach- a review. Int J Computer Applications 2010;10(5):417.
[3] Zhang D, Tsai JJJSQJ. Machine learning and software engineering. Available at
SSRN 4141236. 2003;11(2):87119.
[4] Graham, D., R. Black, and E. Van Veenendaal, Foundations of software testing
ISTQB Certication. 2021: Cengage Learning.
[5] Asghar B, Awan I, Bhatti AM. Process Improvement through Reduction in Software
Defects using Six Sigma Methods. IEEE; 2018.
[6] Kessentini, M., et al. Search-based design defects detection by example. in
International Conference on Fundamental Approaches to Software Engineering.
2011. Springer.
[7] Yadav V, Singh R, Yadav V. Estimation Model for enhanced predictive object point
metric in OO software size estimation using deep learning. IAJIT 2023;20(3).
[8] Helm JM, et al. Machine learning and articial intelligence: denitions,
applications, and future directions. Current Reviews in Musculoskeletal Medicine
2020;13(1):6976.
[9] ¨
Ozakıncı, R., A.J.J.o.S. Tarhan, and Software, Early software defect prediction: A
systematic map and review. 2018. Journal of Systems and Software, 144: p. 216-
239.
[10] Pachouly, J., et al., A systematic literature review on software defect prediction
using articial intelligence: Datasets, Data Validation Methods, Approaches, and
Tools. 2022. Engineering Applications of Articial Intelligence, 111: p. 104773.
[11] Chen X, et al. Software defect number prediction: unsupervised vs supervised
methods. Information and Software Technology 2019;106:16181.
[12] Xu Z, et al. Software defect prediction based on kernel PCA and weighted extreme
learning machine. Information and Software Technology 2019;106:182200.
[13] Moshin Reza S, et al. Performance analysis of machine learning approaches in
software complexity prediction. Springer; 2021.
[14] Huang J, et al. An empirical analysis of data preprocessing for machine learning-
based software cost estimation. Information and Software Technology 2015;67:
10827.
[15] Liang H, et al. Seml: a semantic LSTM model for software defect prediction. IEEE
Access 2019;7:8381224.
[16] Wu Y, Limcr,, et al. Less-informative majorities cleaning rule based on Naïve Bayes
for imbalance learning in software defect prediction. Applied Sciences 2020;10
(23):8324.
[17] Catolino G, Di Nucci D, Ferrucci F. Cross-project just-in-time bug prediction for
mobile apps: An empirical assessment. IEEE; 2019.
[18] Gao K. The use of under-and oversampling within ensemble feature selection and
classication for software quality prediction. Int J Reliability Quality and Safety
Engineering 2014;21(01):1450004.
[19] Malhotra R, Shakya A, Ranjan R, Banshi R. Software defect prediction using binary
particle swarm optimization with binary cross entropy as the tness function.
J. Phys.: Conf. Ser. 2021;1767(1):012003.
[20] Alauthman M, Aldweesh A, Al-qerem A, Aburub F, Al-Smadi Y, Abaker AM, et al.
Tabular data generation to improve classication of liver disease diagnosis. Appl
Sci 2023;13(4):2678.
[21] Balogun AO, et al. Performance analysis of feature selection methods in software
defect prediction: a search method approach. Appl Sci 2019;9(13):2764.
[22] Anbu, M. and G.J.C.C. Anandha Mala, Feature selection using rey algorithm in
software defect prediction. 2019. Cluster Computing, 22(5): p. 10925-10934.
[23] Ayon SI. Neural network based software defect prediction using genetic algorithm
and particle swarm optimization. IEEE; 2019.
[24] Balogun AO, et al. Empirical analysis of rank aggregation-based multi-lter feature
selection methods in software defect prediction. Electronics 2021;10(2):179.
[25] Ali U, Aftab S, Iqbal A, Nawaz Z, Salman Bashir M, Anwaar Saeed M. Software
defect prediction using variant based ensemble learning and feature selection
techniques. Int J Modern Education & Computer Science 2020;12(5):2940.
[26] Alsaeedi, A., M.Z.J.J.o.S.E. Khan, and Applications, Software defect prediction
using supervised machine learning and ensemble techniques: a comparative study.
2019. Journal of Software Engineering and Applications, 12(5): p. 85-100.
[27] Balogun AO, et al. Software defect prediction using ensemble learning: an ANP
based evaluation method. FUOYE J Eng Tech 2018;3(2):505.
[28] Yu Q, et al. Process metrics for software defect prediction in object-oriented
programs. IET Software 2020;14(3):28392.
[29] Ghosh S, Rana A, Kansal VJPcs. A nonlinear manifold detection based model for
software defect prediction. Procedia Computer Science 2018;132:58194.
[30] Ghosh S, et al. A benchmarking framework using nonlinear manifold detection
techniques for software defect prediction. Int J Comput Sci Eng 2020;21(4):
593614.
[31] Lundberg SM, et al. From local explanations to global understanding with
explainable AI for trees. Nature machine intelligence 2020;2(1):5667.
[32] Kaur, H., H.S. Pannu, and A.K.J.A.C.S. Malhi, A systematic review on imbalanced
data challenges in machine learning: Applications and solutions. 2019. ACM
Computing Surveys (CSUR), 52(4): p. 1-36.
[33] Fern´
andez A, et al. SMOTE for learning from imbalanced data: progress and
challenges, marking the 15-year anniversary. J Articial Intelligence Res 2018;61:
863905.
[34] Dai H-P, Chen D-D, Zheng Z-S-J-A. Effects of random values for particle swarm
optimization algorithm. Algorithms 2018;11(2):23.
[35] Katoch S, et al. A review on genetic algorithm: past, present, and future.
Multimedia Tools and Applications 2021;80(5):8091126.
[36] Alaa, A., et al., Comprehensive review of the development of the harmony search
algorithm and its applications. 2019. IEEE Access, 7: p. 14233-14245.
[37] Abualigah L, Diabat A, Geem ZWJAS. A comprehensive survey of the harmony
search algorithm in clustering applications. Appl Sci 2020;10(11):3827.
[38] Dorigo M, Birattari M, Stutzle TJIcim. Ant colony optimization. IEEE Comput
Intelligence Magazine 2006;1(4):2839.
[39] Lyridis DVJOE. An improved ant colony optimization algorithm for unmanned
surface vehicle local path planning with multi-modality constraints. Ocean Eng
2021;241:109890.
Y. Al-Smadi et al.
Egyptian Informatics Journal 24 (2023) 100386
20
[40] Wang Q, et al. Overview of logistic regression model analysis and application. Chin
J Prev Med 2019;53(9):95560.
[41] Abu Alfeilat HA, et al. Effects of distance measure choice on k-nearest neighbor
classier performance: a review. Big Data 2019;7(4):22148.
[42] Patel, H.H., P.J.I.J.o.C.S. Prajapati, and Engineering, Study and analysis of decision
tree based classication algorithms. 2018. International Journal of Computer
Sciences and Engineering, 6(10): p. 74-78.
[43] Buskirk TDJSP. Surveying the forests and sampling the trees: an overview of
classication and regression trees and random forests with applications in survey
research. Survey Practice 2018;11(1):113.
[44] Pisner DA, Schnyer DM. Support vector machine. In: Machine learning. Academic
Press Elsevier; 2020. p. 10121. https://doi.org/10.1016/B978-0-12-815739-
8.00006-7.
[45] Ferreira, A.J. and M.A.J.E.m.l. Figueiredo, Boosting algorithms: A review of
methods, theory, and applications. 2012. Ensemble machine learning: Methods and
applications,p. 35-85.
[46] Natekin, A. and A.J.F.i.n. Knoll, Gradient boosting machines, a tutorial. 2013.
Frontiers in neurorobotics, 7: p. 21.
[47] Friedman, J.H.J.C.s. and d. analysis, Stochastic gradient boosting. 2002.
Computational statistics & data analysis, 38(4): p. 367-378.
[48] Shin, Y.J.A.i.C.E., Application of stochastic gradient boosting approach to early
prediction of safety accidents at construction site. 2019. Advances in Civil
Engineering, 2019.
[49] Godinho S, Guiomar N, Gil A. Estimating tree canopy cover percentage in a
mediterranean silvopastoral systems using Sentinel-2A imagery and the stochastic
gradient boosting algorithm. Int J Remote Sensing 2018;39(14):464062.
[50] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings
of the 22nd acm sigkdd international conference on knowledge discovery and data
mining. 2016.
[51] Babajide Mustapha I, Saeed FJM. Bioactive molecule prediction using extreme
gradient boosting. Molecules 2016;21(8):983.
[52] Dorogush, A.V., V. Ershov, and A.J.a.p.a. Gulin, CatBoost: gradient boosting with
categorical features support. arXiv preprint arXiv:1810.11363. 2018. https://doi.
org/10.48550/arXiv.1810.11363.
[53] Hong JJHFR. An application of XGBoost, LightGBM, CatBoost algorithms on house
price appraisal system. Housing Finance Research 2020;4:3364. https://doi.
org/10.52344/hfr.2020.4.0.33.
[54] Alauthman M, Al-qerem A, Sowan B, Alsarhan A, Eshtay M, Aldweesh A, et al.
Enhancing small medical dataset classication performance using GAN.
Informatics 2023;10(1):28.
[55] Handelman GS, et al. Peering into the black box of articial intelligence:
evaluation metrics of machine learning methods. Am J Roentgenol 2019;212(1):
3843.
[56] Al-qerem A, Al-Naymat G, Alhasan M, Al-Debei M. Default prediction model: the
signicant role of data engineering in the quality of outcomes. Int Arab J Inf
Technol 2020;17(4A):63544.
[57] Alibrahim, H. and S.A. Ludwig. Hyperparameter optimization: comparing genetic
algorithm against grid search and bayesian optimization. in 2021 IEEE Congress on
Evolutionary Computation (CEC). 2021. IEEE.
Y. Al-Smadi et al.
... Ensemble learning methods were employed in [22] by higher prediction performance. A novel software defect prediction framework was discussed in [23] to lessen time. The data imbalance issue was determined with the Synthetic minority oversampling method. ...
Article
Full-text available
The development of defect prediction plays a significant role in improving software quality. Such predictions are used to identify defective modules before the testing and to minimize the time and cost. The software with defects negatively impacts operational costs and finally affects customer satisfaction. Numerous approaches exist to predict software defects. However, the timely and accurate software bugs are the major challenging issues. To improve the timely and accurate software defect prediction, a novel technique called Nonparametric Statistical feature scaled QuAdratic regressive convolution Deep nEural Network (SQADEN) is introduced. The proposed SQADEN technique mainly includes two major processes namely metric or feature selection and classification. First, the SQADEN uses the nonparametric statistical Torgerson–Gower scaling technique for identifying the relevant software metrics by measuring the similarity using the dice coefficient. The feature selection process is used to minimize the time complexity of software fault prediction. With the selected metrics, software fault perdition with the help of the Quadratic Censored regressive convolution deep neural network-based classification. The deep learning classifier analyzes the training and testing samples using the contingency correlation coefficient. The softstep activation function is used to provide the final fault prediction results. To minimize the error, the Nelder–Mead method is applied to solve non-linear least-squares problems. Finally, accurate classification results with a minimum error are obtained at the output layer. Experimental evaluation is carried out with different quantitative metrics such as accuracy, precision, recall, F-measure, and time complexity. The analyzed results demonstrate the superior performance of our proposed SQADEN technique with maximum accuracy, sensitivity and specificity by 3%, 3%, 2% and 3% and minimum time and space by 13% and 15% when compared with the two state-of-the-art methods.
... Ant Colony Optimization (ACO) is a prominent evolutionary computation approach that draws inspiration from the foraging behavior of ants, employing swarm intelligence. ACO has been utilized in machine learning to efficiently find subsets of features, hence improving the interpretability of models [112]. Evolutionary computing techniques are crucial in optimizing feature subsets, leading to improved model performance and interpretability in diverse applications such as smart grids and machine learning [113]. ...
Article
Full-text available
The Smart Grid is a modern power grid that relies on advanced technologies to provide reliable and sustainable electricity. However, its integration with various communication technologies and IoT devices makes it vulnerable to cyber-attacks. Such attacks can lead to significant damage, economic losses, and public safety hazards. To ensure the security of the smart grid, increasingly strong security solutions are needed. This paper provides a comprehensive analysis of the vulnerabilities of the smart grid and the different approaches for detecting cyber-attacks. It examines the different vulnerabilities of the smart grid, including system vulnerabilities and cyber-attacks, and discusses the vulnerabilities of all its elements. The paper also investigates various approaches for detecting cyber-attacks, including rule-based, signature-based, anomaly detection, and ma-chine learning-based methods, with a focus on their effectiveness and related research. Finally, prospective cybersecurity approaches for the smart grid, such as AI approaches and blockchain, are discussed along with the challenges and future prospects of cyberattacks on the smart grid. The paper’s findings can help policymakers and stakeholders make informed decisions about the security of the smart grid and develop effective strategies to protect it from cyber-attacks.
Article
Full-text available
The Software industry’s rapid growth contributes to the need for new technologies. PRICE software system uses Predictive Object Point (POP) as a size measure to estimate Effort and cost. A refined POP metric value for object-oriented software written in Java can be calculated using the Automated POP Analysis tool. This research used 25 open-source Java projects. The refined POP metric improves the drawbacks of the PRICE system and gives a more accurate size measure of software. This paper uses refined POP metrics with curve-fitting neural networks and multi-layer perceptron neural network-based deep learning to estimate the software development effort. Results show that this approach gives an effort estimate closer to the actual Effort obtained through Constructive Cost Estimation Model (COCOMO) estimation models and thus validates refined POP as a better size measure of object-oriented software than POP. Therefore we consider the MLP approach to help construct the metric for the scale of the Object-Oriented (OO) model system.
Article
Full-text available
Developing an effective classification model in the medical field is challenging due to limited datasets. To address this issue, this study proposes using a generative adversarial network (GAN) as a data-augmentation technique. The research aims to enhance the classifier’s generalization performance, stability, and precision through the generation of synthetic data that closely resemble real data. We employed feature selection and applied five classification algorithms to thirteen benchmark medical datasets, augmented using the least-square GAN (LS-GAN). Evaluation of the generated samples using different ratios of augmented data showed that the support vector machine model outperforms other methods with larger samples. The proposed data augmentation approach using a GAN presents a promising solution for enhancing the performance of classification models in the healthcare field.
Article
Full-text available
Liver diseases are among the most common diseases worldwide. Because of the high incidence and high mortality rate, these diseases diagnoses are vital. Several elements harm the liver. For instance, obesity, undiagnosed hepatitis infection, and alcohol abuse. This causes abnormal nerve function, bloody coughing or vomiting, insufficient kidney function, hepatic failure, jaundice, and liver encephalopathy.. The diagnosis of this disease is very expensive and complex. Therefore, this work aims to assess the performance of various machine learning algorithms at decreasing the cost of predictive diagnoses of chronic liver disease. In this study, five machine learning algorithms were employed: Logistic Regression, K-Nearest Neighbor, Decision Tree, Support Vector Machine, and Artificial Neural Network (ANN) algorithm. In this work, we examined the effects of the increased prediction accuracy of Generative Adversarial Networks (GANs) and the synthetic minority oversampling technique (SMOTE). Generative opponents’ networks (GANs) are a mechanism to produce artificial data with a distribution close to real data distribution. This is achieved by training two different networks: the generator, which seeks to produce new and real samples, and the discriminator, which classifies the augmented samples using supervised classifications. Statistics show that the use of increased data slightly improves the performance of the classifier.
Article
Full-text available
Software is a consequential asset because concrete software is needed in virtually every industry, in every business, and for every function. It becomes more paramount as time goes on – if something breaks within your application portfolio and expeditious, efficient, and efficacious fine-tune needs to transpire as anon as possible. Therefore, recognizing the faults in the early phase of the software development lifecycle is essential for both, diminishing the cost in terms of efforts and money Similarly, It is important to find out features which could be redundant or features which are highly correlated with each other, as it could largely affect the model’s learning process. This analysis refers to the use of crossover Artificial Neural Network (ANN) and Binary Particle Swarm Optimization (BPSO) with Binary Cross-Entropy (BCE) loss for the fitness function. The conclusion of proposed paper is to provide the significance and opportunity of using Binary Cross-Entropy (BCE) in Binary Particle Swarm Optimization (BPSO) for the feature reduction process in order to minimize the developer’s effort and costs for software development as well as for its maintenance.
Article
Full-text available
Selecting the most suitable filter method that will produce a subset of features with the best performance remains an open problem that is known as filter rank selection problem. A viable solution to this problem is to independently apply a mixture of filter methods and evaluate the results. This study proposes novel rank aggregation-based multi-filter feature selection (FS) methods to address high dimensionality and filter rank selection problem in software defect prediction (SDP). The proposed methods combine rank lists generated by individual filter methods using rank aggregation mechanisms into a single aggregated rank list. The proposed methods aim to resolve the filter selection problem by using multiple filter methods of diverse computational characteristics to produce a dis-joint and complete feature rank list superior to individual filter rank methods. The effectiveness of the proposed method was evaluated with Decision Tree (DT) and Naïve Bayes (NB) models on defect datasets from NASA repository. From the experimental results, the proposed methods had a superior impact (positive) on prediction performances of NB and DT models than other experimented FS methods. This makes the combination of filter rank methods a viable solution to filter rank selection problem and enhancement of prediction models in SDP.
Article
Delivering high-quality software products is a challenging task. It needs proper coordination from various teams in planning, execution, and testing. Many software products have high numbers of defects revealed in a production environment. Software failures are costly regarding money, time, and reputation for a business and even life-threatening if utilized in critical applications. Identifying and fixing software defects in the production system is costly, which could be a trivial task if detected before shipping the product. Binary classification is commonly used in existing software defect prediction studies. With the advancements in Artificial Intelligence techniques, there is a great potential to provide meaningful information to software development teams for producing quality software products. An extensive survey for Software Defect Prediction is necessary for exploring datasets, data validation methods, defect detection, and prediction approaches and tools. The survey infers standard datasets utilized in early studies lack adequate features and data validation techniques. According to the finding of the literature survey, the standard datasets has few labels, resulting in insufficient details regarding defects. Systematic Literature Reviews (SLR) on Software Defect Prediction are limited. Hence this SLR presents a comprehensive analysis of defect datasets, dataset validation, detection, prediction approaches, and tools for Software Defect Prediction. The survey exhibits the futuristic recommendations that will allow researchers to develop a tool for Software Defect Prediction. The survey introduces the architecture for developing a software prediction dataset with adequate features and statistical data validation techniques for multi-label classification for software defects.
Chapter
Analytical quality assurance, especially testing, is an integral part of software-intensive system development. With the increased usage of Artificial Intelligence (AI) and Machine Learning (ML) as part of such systems, this becomes more difficult as well-understood software testing approaches cannot be applied directly to the AI-enabled parts of the system. The required adaptation of classical testing approaches and the development of new concepts for AI would benefit from a deeper understanding and exchange between AI and software engineering experts. We see the different terminologies used in the two communities as a major obstacle on this way. As we consider a mutual understanding of the testing terminology a key, this paper contributes a mapping between the most important concepts from classical software testing and AI testing. In the mapping, we highlight differences in the relevance and naming of the mapped concepts.
Article
The increase application of machine learning and artificial intelligence in the field of robotics, have emerged the need for real time algorithms for autonomous unmanned aircraft and surface vehicles. In recent years, various studies have been conducted for path planning of unmanned surface vehicle (USV), especially to maritime transportation. Most of the traditional approaches include optimization algorithms for finding the shortest or fastest path. However, USV motion in complex environments demand multi-objective optimization and multi-modality constraints to cope with dynamic environments and moving obstacles. To this end, an improved Ant Colony Optimization with Fuzzy Logic (ACO-FL) is proposed to deal with local path planning for obstacle avoidance by taking into account wind, current, wave and dynamic obstacles. The proposed algorithm was compared to original ACO and popular Bug algorithm in simulation tests. The results showed that ACO-FL reached better performance compared to the other algorithms under examination in terms of optimal solution and convergence speed. Thus, the proposed algorithm can be considered as an effective approach for path planning of USVs.