ArticlePDF Available

Reliable prediction of software defects using Shapley interpretable machine learning models

July 2023
Egyptian Informatics Journal 24(3):100386

July 2023
24(3):100386

DOI:10.1016/j.eij.2023.05.011

License
CC BY-NC-ND 4.0

Authors:

Mohammed Eshtay

LTUC

Ahmad al-Qerem

Zarqa University

Shadi Nashwan

Jouf University

Show all 6 authorsHide

Predicting defect-prone software components can play a significant role in allocating relevant testing resources to fault-prone modules and hence increasing the business value of software projects. Most of the current software defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software applications. The software datasets utilized in such studies are imbalanced and therefore the reported results cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine learning models employed in fault-predication techniques to determine the contribution of each utilized feature to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven machine learning classifiers over twelve different datasets. For feature selection, we employ four different nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm, and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE) to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, sto-chastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90% accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the other tested feature extraction techniques.

General Software Defects Prediction Process.

…

Overview of the Proposed Software Defects Prediction Framework.

…

Shapley additive explanations Model.

…

Particle Swarm Optimization Architecture Illustration.

…

+18

Genetic Algorithm Pseudocode.

…

Figures - uploaded by Abd El-Aziz Ahmed

Content may be subject to copyright.

Content uploaded by Abd El-Aziz Ahmed

Content may be subject to copyright.

Egyptian Informatics Journal 24 (2023) 100386

Available online 31 July 2023

article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Full length article

Reliable prediction of software defects using Shapley interpretable machine

learning models

Yazan Al-Smadi

, Mohammed Eshtay

, Ahmad Al-Qerem

, Shadi Nashwan

, Osama Ouda

A.A. Abd El-Aziz

Department of Computer Science, Faculty of Information Technology, Zarqa University, Zarqa 13110, Jordan

Luminus Technical University College, Amman 11118, Jordan

Computer Science Department, Computer and Information Sciences College, Jouf University, Sakaka 72388, Saudi Arabia

Information System Department, Computer and Information Sciences College, Jouf University, Sakaka 72388, Saudi Arabia

Information Systems & Technology Department, Faculty of Graduate Studies for Statistical Research, Cairo UNI, Giza, Egypt

ARTICLE INFO

Keywords:

Software Defect Prediction

Feature importance

Machine learning

Model interpretation

Shapley Additive Explanation

ABSTRACT

Predicting defect-prone software components can play a signicant role in allocating relevant testing resources to

fault-prone modules and hence increasing the business value of software projects. Most of the current software

defect prediction studies utilize traditional supervised machine learning algorithms to predict defects in software

applications. The software datasets utilized in such studies are imbalanced and therefore the reported results

cannot be reliably used to judge their performance. Moreover, it is important to explain the output of machine

learning models employed in fault-predication techniques to determine the contribution of each utilized feature

to the model output. In this paper, we propose a new framework for predicting software defects utilizing eleven

machine learning classiers over twelve different datasets. For feature selection, we employ four different

nature-inspired search algorithms, namely, particle swarm optimization, genetic algorithm, harmony algorithm,

and ant colony optimization. Moreover, we make use of the synthetic minority oversampling technique (SMOTE)

to address the problem of data imbalance. Furthermore, we utilize the Shapley additive explanation model for

highlighting the highest determinative features. The obtained results demonstrate that gradient boosting, sto-

chastic gradient boosting, decision trees, and categorical boosting outperform others tested model with over 90%

accuracy and ROC-AUC. Additionally, we found that the ant colony optimization technique outperforms the

other tested feature extraction techniques.

1. Introduction

Nowadays, the globe has seen tremendous growth and application of

technology in a variety of essential industries. As a consequence, soft-

ware quality is the most critical part of software system development.

Software testing, on the other hand, is a well-known quality assurance

activity that aims to evaluate the internal and external quality aspects of

a given software module, subsystem, and component using predened

criteria under certain conditions [1]. The process of ensuring high

software quality is inextricably linked to the presence of software aws.

However, a reliable software system is deemed to be good [2]. For many

years, articial intelligence (AI) technologies [3] were used to face

several challenges in software quality aspects, including forecasting

software aws at early stages, software size and cost prediction, and

software development efforts.

Software organizations are seeing signicant expansion and an in-

crease in the demand for software projects in numerous medical, mili-

tary, and industrial domains. However, to our best knowledge, it is

difcult to produce faultless software projects. Therefore, the process of

detecting and controlling software defects is critical to the project’s

success or failure. A software defect, according to the International

Software Testing Qualications Board (ISTQB) [4], is a problem in a

specic software component or system that causes the component or

system to fail to execute its necessary functions. Most software defects

are caused by fundamental development aws such as insufcient

coding and design experience, confusing software requirements and

documentation, insufcient test cases, and poor team communication

[5].

* Corresponding author.

E-mail address: shadi_naswan@ju.edu.sa (S. Nashwan).

Contents lists available at ScienceDirect

Egyptian Informatics Journal

journal homepage: www.sciencedirect.com

https://doi.org/10.1016/j.eij.2023.05.011

Received 10 October 2022; Received in revised form 29 April 2023; Accepted 24 May 2023

Egyptian Informatics Journal 24 (2023) 100386

Software companies rely heavily on the process of detecting software

defects. However, for long years, companies manually spotted faults by

classifying reported problems as defects and then utilizing test cases to

determine if the problem was replicated, delayed, or not [6]. Nonethe-

less, because of the extended cycle, this strategy is costly, needs a lot of

human labor and time, and relies heavily on developers’ experience [7].

The practice of nding defects in specic software modules using

machine learning theories and methodologies is known as software

defect prediction [8]. Machine learning, on the other hand, is an arti-

cial intelligence subeld that aims to offer computers the capacity to

learn tasks by tting mathematical functions to large amounts of data to

generate various rules for data prediction and classication [9]. How-

ever, predicting software defects, in contrast to manual approaches, is an

automated approach that, in its early stages, has the advantages of

improving software design structure, improving the testing process,

lowering development costs, providing a powerful approach for soft-

ware planning and scheduling, and increasing software reliability [10].

In general, as shown in Fig. 1, the process of predicting software

defects involves numerous consecutive steps [11]. The approach begins

by strongly relying on the usage of open-source software projects [12] or

software metrics [13] as employed by prior researchers. However,

software metrics extraction technologies such as CodeMR may be

employed with software projects [14]. Data preprocessing, on the other

hand, is a critical stage in machine learning workows; it is described as

the process of repairing data from missing values (e.g., null values),

cleaning noisy data, balancing data, feature selection, and scaling data

towards more signicant practices [15]. Finally, tting machine

learning and deep learning approaches may be used to make predictions.

Several researchers proposed alternative ways of predicting software

defects over many years. Deep learning approaches [16], decision trees

[17], random forest [18], and boosting ensemble-based methods [19]

are among the approaches used. In this paper, we propose a new

framework for predicting software defects. The proposed methodology

evaluates the use of eleven machine learning classiers, including lo-

gistic regression (LR), K-Nearest neighbors (KNN), decision trees (DT),

random forests (RF), support vector machine (SVM), adaptive boosting

(AdaBoost), gradient boosting (GB), stochastic gradient boosting (SGB),

extreme gradient boosting (XGBoost), categorical boosting (CatBoost),

and stacked-generalization over twelve datasets from the National

Aeronautics and Space Administration (NASA). In addition, four nature-

inspired algorithms are used for feature selection. Particle swarm opti-

mization (PSO), genetic algorithm (GA), harmony algorithm (HA), and

ant colony optimization (ACO). However, due to the NASA dataset

distribution, we use the synthetic minority oversampling technique

(SMOTE) to balance the data. Also, we apply the Shapley additive

explanation (SHAP) library to explore how much a single data sample

and a series of sequential observations contribute to the prediction

process.

The key contributions are as follows: (1) Employing several old and

modern statistical, bagging, boosting, and stacking classiers to

compare their performance; (2) Applying oversampling techniques to

compensate for imbalanced classes; (3) Utilizing nature-inspired algo-

rithms for feature selection; (4) Using Shapley additive explanations

library to explain the outputs of predictive models and demonstrate how

much a single observation may add to the prediction process.

The rest of the paper is structured as follows: section 2 showcases a

variety of related studies conducted in the eld. Section 3 illustrates the

proposed software defects prediction framework, used datasets, feature

selection approaches, and employed classiers. Section 4 shows the used

evaluation metrics and hyperparameters tuning process. Section 5 dis-

cusses the experimental results and shows classiers output explana-

tions using SHAP. Finally, the conclusion and future directions are

shown in section 6.

2. Related works

Many experimental, investigational, empirical, and comparative

research approaches have recently been proposed in the domain of

software defect prediction. In [7] using seven datasets obtained from the

National Aeronautics and Space Administration, six machine learning

classiers were tested: logistic regression, random forest, naive Bayes,

gradient boosting, support vector machine, and neural networks.

Moreover, the results showed that neural networks offer the greatest

results with an accuracy of 93% while utilizing 10 folds cross-validation.

In [20] deep learning and bio-inspired feature selection methods are

being used. Particle swarm optimization (PSO) was used to evaluate

neural network performance by reducing strongly correlated features. In

addition, four NASA datasets were gathered and preprocessed with

min–max scaling. When considering the results, an area under the curve

(AUC) of 0.92 was found to outperform others. However, the problem of

imbalanced data distribution remains. In [21] rey and wolf bio-

inspired search-based algorithms were employed to select the most

relevant features. Furthermore, utilizing the largest NASA datasets:

JM1, PC5, and MC1, two classiers were evaluated: support vector

machine and random forest. However, using the Waikato Environment

for Knowledge Analysis (WEKA), the results revealed that employing a

support vector machine with a wolf algorithm yields the best results

with an accuracy of 99.3% when compared to others. Also, a receiver

operating characteristic (ROC) of 51% demonstrated that the classiers

continue to execute random classication. Besides that, imbalanced

classes were not remedied. The study in [22] investigated the use of four

feature selection lter techniques and fourteen search-based algorithms

based on correlation and consistency feature subset selection to assess

the performance of four classiers: Naive Bayes, decision trees, logistic

regression, and k-nearest neighbors. In addition, ve NASA datasets

were gathered and preprocessed. However, the results show that

applying lter approaches outperforms others, particularly when

leveraging information gain (IG). Also, when compared to other search

methods, feature selection approaches based on exhaustive search have

the highest effect and consequences. Nonetheless, the classiers’

Fig. 1. General Software Defects Prediction Process.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

performance was unstable and varied from one dataset to another.

Using a rey search-based algorithm (FA) for feature selection [23].

Three machine learning classiers were used to evaluate the utility of FA

for selecting the best features and enhancing model performance: sup-

port vector machine, Naive Bayes, and k-nearest neighbors. When

compared to the original ndings, the support vector machine resulted

in the best results and improved accuracy by 4.53%. The study in [24]

proposed a new method comprising the use of several deep learning

approaches for software defects prediction utilizing the genetic algo-

rithm (GA) for features selection and particle swarm optimization (PSO)

for data clustering. Moreover, ve NASA datasets were collected and

cleaned from missing values. The results revealed that deep neural

networks outperform others with an accuracy of 98.47% using 10 folds

cross-validation.

Several research recommends combining numerous feature ap-

proaches into a single list rather than using individuals. In [25] Intro-

duced a novel method to handle the problem of high dimensionality data

and lter method ranking in software defect prediction by combining

various feature lter methods: Chi-square, information gain, and relief.

The suggested method compares the outcomes by assessing the effec-

tiveness of decision trees (DT) and Naive Bayes (NB). In addition, nine

NASA datasets were used and sanitized. The ndings show that the

suggested method outperforms the employment of individual methods,

with NB accuracy increasing by 6.73% and DT increasing by 1.87%.

In [26] ten machine learning classiers hyperparameters were ne-

tuned and optimized based on several variants to boost the discrimi-

native power. The proposed method employs the use of three directions

Best First features subset selection to four NASA datasets. Datasets were

preprocessed by resampling and normalizing. However, results were

evaluated using three evaluation metrics: F-measure, accuracy, and

Matthews’s correlation coefcient (MCC). The proposed method results

in 91.86% accuracy, 75% F-measure, and 66% of MCC. Using ensemble

learning. In [27] Four classiers: Random forest, decision trees, support

vector machine, and logistic regression were set up as base learners

using bagging and boosting ensemble-based methods. Moreover, ten

NASA datasets were preprocessed and balanced using the synthetic

minority oversampling technique. However, no feature selection tech-

niques were performed. The experiment results revealed that RF base

learner, RF with AdaBoost, and RF with bagging outperform others with

an accuracy of 97%. In [28] boosting, bagging, stacking, and voting

ensemble-based methods were applied to four classiers: DT, KNN, NN,

and sequential minimal optimization (SMO). Eleven NASA datasets were

also obtained and preprocessed. However, for minimizing highly

correlated features, a greedy search-based algorithm was used. The

ndings of the ensemble learning techniques indicated that boosted

SMO yields the best results when compared to others, with an accuracy

of 88.2%, and voting produces an accuracy of 87.9%.

Other studies have used open-source software project datasets to

predict software defects. In [29] an experimental study was carried out

to investigate the usage of three machine learning classiers: LR, KNN,

and NB across nine class-level open-source software projects with 33

distinct versions. Furthermore, a text matching method was used to

remove duplication from the datasets. However, utilizing the process

metrics, the ndings indicated that LR had the highest AUC of 88%.

Soumi Ghosh et al., [30] proposed a new method for predicting software

defects based on non-linear manifold methods. Apache, eclipse, safe,

and zxing open-source software project datasets were collected.

Furthermore, four classiers were used in the experiment: Bayesian

belief network, NB, DT, and KNN. When compared to others employing

the stochastic proximity embedding non-linear approach, the Bayesian

belief network had the highest accuracy of 80.53%. Several studies used

various methodologies to predict software defects. Table 1 summarizes

the related works. In this paper, we propose a new framework for pre-

dicting software defects based on meta-heuristics methods and eleven

machine learning classiers. In addition, we employ the SHAP library

for model explanations and SMOTE as a resampling technique for data

balancing problems.

3. Proposed methodology

In the sections that follow, we will outline the proposed framework

for software defect prediction and emphasize the key process and its

goals. The designed framework for predicting software defects attempts

to improve prediction accuracy and performance by eliminating highly

correlated features and addressing the imbalanced data problem.

Additionally, the proposed framework’s main purpose is to explain the

outcomes of several machine learning models. It is useful to understand

how much a single observation adds to the prediction process. As

evaluation metrics in this work, classication measures such as testing

accuracy and ROC-AUC were used. The proposed framework is shown in

Fig. 2. To assure data authenticity, the essential processes of dataset

preparation were used. The synthetic minority oversampling technique

was used for data balancing since it does not replicate data records.

Nonetheless, for feature selection, this study employs four

Table 1

Summarizes Related Works.

Reference Datasets Classiers Data

Balancing

Meta-Heuristic Results

[7] 7 NASA Datasets LR, NB, GB,SVM,RF,ANN NO NO ANN 93% and GB 89.5% validation accuracy.

[20] 4 NASA Datasets ANN NO Binary PSO Best AUC of 92.9% of overall datasets.

[21] 3 NASA Datasets SVM, RF NO Firey and gray wolf SVM and gray wolf yield the best results with an

accuracy of 99.3%.

[22] 5 NASA Datasets NB,DT,LR,KNN NO 14 search-based algorithms and 4

lter-based methods

Filter methods outperform others, particularly

information gain.

[23] KC1 NASA dataset SVM,NB,KNN NO Firey SVM has the best accuracy improvement by 4.53%.

[24] 5 NASA Datasets FNN, RNN,ANN, DNN NO GA and PSO DNN has the best results compared to others with

an accuracy of 98.4%.

[25] 9 NASA Datasets NB, DT NO NO The ndings show that NB accuracy increased by

6.73% and DT increased by 1.87%.

[26] 4 NASA Datasets NB, ANN, RBF, SVM, KNN,KStar,

OneR,DT, RF, PART

NO NO The proposed method results in 91.86% accuracy,

75% F-measure, and 66% of MCC.

[27] 10 NASA Datasets RF,LG,DT,SVM SMOTE NO RF and RF with AdaBoost outperform others with

an accuracy of 97%.

[28] 11 NASA Datasets DT, KNN, ANN,SMO NO NO SMO yields the best results when compared to

others with an accuracy of 88.2%.

[29] 9 open-source

software projects

LR, KNN, NB NO NO LR has the best outcome with an AUC of 88%.

[30] 4 open-source

software project

NB, DT, KNN, Bayesian belief

network

NO NO Bayesian belief network had the highest accuracy

of 80.53%.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

metaheuristics search-based algorithms: particle swarm optimization,

genetic algorithm, harmony algorithm, and ant colony optimization. On

the other hand, eleven machine learning classiers were constructed

and evaluated using testing sets of 20% over twelve NASA datasets.

Finally, we employ Shapley additive explanations, a well-designed tool

that aids in the explanation of prediction models by highlighting the

most valuable features of the predictive models [31]. Also, it is a good

visualization tool to show how much a single observation or a set of

sequential data samples can affect the prediction process by calculating

SHAP values. The SHAP model explanation is shown in Fig. 3.

3.1. Data collection and analysis

The National Aeronautics and Space Administration dataset is made

up of thirteen datasets that describe various software components. In

this study, we employ twelve NASA MDP datasets to predict software

defects. The KC4 dataset, on the other hand, was omitted owing to

missing attributes and values. The data were obtained from an online

repository, which can be found at https://gshare.com/collections/

NASA_MDP_Software_Defects_Data_Sets/4054940/1. The major pre-

dictors for defect prediction are software metrics such as McCabe,

Halsted, miscellaneous, and lines of code count. Table 2 shows a

description of the datasets. Furthermore, the datasets were preprocessed

to clean missing values and balance the data. The datasets, however,

were devoid of missing values, noisy data, and missing attributes.

3.2. Data balancing

Imbalanced data, also known as imbalanced target classes, is a data

distribution problem that causes machine learning classiers to be

biased toward one category over others, negatively affecting prediction

[32]. NASA datasets, on the other hand, exhibit a wide distribution gap

across target classes. As a result, this study is based on the employment

of SMOTE for data balance. SMOTE is an oversampling technique that

generates synthetic data samples to increase minority class data samples

by locating the nearest K - neighbors, computing the distance between

them then multiplying a random value between 0 and 1 for the new

sample [33].

Fig. 2. Overview of the Proposed Software Defects Prediction Framework.

Fig. 3. Shapley additive explanations Model.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Table 2

NASA MDP D’ Version Datasets Description.

NASA MDP D’ version datasets

category Metrics CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5

McCabe CYCLOMATIC

COMPLEXITY

× × × × × × × × × × × ×

CYCLOMATIC

DENSITY

× × × × × × × × × ×

DESIGN COMPLEXITY × × × × × × × × × × × ×

ESSENTIAL

COMPLEXITY

× × × × × × × × × × × ×

Halsted CONTENT × × × × × × × × × × × ×

DIFFICULTY × × × × × × × × × × × ×

EFFORT × × × × × × × × × × × ×

ERROR_EST × × × × × × × × × × × ×

LENGTH × × × × × × × × × × × ×

LEVEL × × × × × × × × × × × ×

PROG TIME × × × × × × × × × × × ×

VOLUME × × × × × × × × × × × ×

NUM OPERANDS × × × × × × × × × × × ×

NUM OPERATORS × × × × × × × × × × × ×

NUM UNIQUE

OPERANDS

× × × × × × × × × × × ×

NUM UNIQUE

OPERATORS

× × × × × × × × × × × ×

Misc. metric BRANCH COUNT × × × × × × × × × × × ×

CALL PAIRS × × × × × × × × × ×

CONDITION COUNT × × × × × × × × × ×

DECISION COUNT × × × × × × × × × ×

DECISION DENSITY × × × × × × × ×

DESIGN DENSITY × × × × × × × × × ×

EDGE COUNT × × × × × × × × × ×

ESSENTIAL DENSITY × × × × × × × × × ×

PARAMETER COUNT × × × × × × × × × ×

MAINTENANCE

SEVERITY

× × × × × × × × × ×

MODIFIED

CONDITION COUNT

× × × × × × × × × ×

MULTIPLE

CONDITION COUNT

× × × × × × × × × ×

NODE COUNT × × × × × × × × × ×

NORMALIZED

CYCLOMATIC

COMPLEXITY

× × × × × × × × × ×

PERCENT COMMENTS × × × × × × × × × ×

GLOBAL DATA

COMPLEXITY

× × × ×

GLOBAL DATA

DENSITY

× × × ×

LOC

counts

LOC BLANK × × × × × × × × × × ×

LOC CODE AND

COMMENT

× × × × × × × × × × × ×

LOC COMMENTS × × × × × × × × × × × ×

LOC EXECUTABLE × × × × × × × × × × × ×

NUMBER OF LINES × × × × × × × × × ×

LOC TOTAL × × × × × × × × × × × ×

Total number of featuresTotal number

of instancesDefective modulesNon-

defective modulesProgramming

languageLines of code (Thousands)

3734442302C17 219593175

97834C457

212096325

1771C++43

3920036164Java8 389277689209C++66 391274483C++6 3726427237C8 3775961698C26 361585161569C25 371125140985C36 3713991781221C30 3817,00150316,

498C++162

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

3.3. Features selection

The process of selecting the most informative predictors of existing

features by removing highly correlated features is known as feature

selection. In this paper, we apply four bio-inspired search-based feature

selection algorithms.

3.3.1. Particle swarm optimization (PSO)

PSO is a nature-inspired optimization algorithm inuenced by birds’

behavior. It’s a population-based algorithm in which each bird is

referred to as a particle and a group of birds is referred to as a swarm (i.

e., population). The core notion of PSO is that each particle represents

the best solution to the problem. However, following the random

initialization, the location of each particle is modied, resulting in new

positions: best previous position PK

i and best global positionPK

g as shown

in Fig. 4. Where r1 and r2 represent randomly generated values between

[0 and 1]. Also, c1 and c2 represent constant values that update the

weight of Pi (previous position of ith particle) and Pg (previous position

of all particles) [34].

3.3.2. Genetic algorithm (GA)

A genetic algorithm is a meta-heuristic that is population-based and

is used to address optimization problems. It is a chromosome-inspired

search-based algorithm. GA passes through the following stages:

Initialize the population, use the tness function, selection, and repro-

duction [35]. To begin, random individuals are initiated, each of which

has a parameter known as a gene, which creates chromosomes. The core

idea behind GA is that each individual offers a solution to a problem.

However, each individual is evaluated by a tness score using the tness

function to determine whether or not the individual is suitable for

reproduction. Following the selection process, the crossover function

can be used to generate a new individual [11]. Fig. 5 shows the GA

process.

3.3.3. Harmony search algorithm (HA)

Harmony search is a population-based meta-heuristic algorithm. It

was inspired by the musical practice of nding the best harmonies. The

entire point of HA is to emulate the musicians’ approach to discover the

optimal solution. However, HA passes through multiple phases, which

are as follows: Initialization of parameters, initialization of harmony

memory, improvisation of new harmony, and updating harmony mem-

ory [36]. Following the initialization of the basic parameters: harmony

memory (HM) size, harmony memory consideration rate (HMCR), and

pitch adjustment rate (PAR). The harmony memory matrix is designed

and lled at random with the values of the possible variables. Each row

in the matrix indicates a potential solution. However, relying on HMCR

and PAR, additional vectors are generated. Finally, if the new harmony

vectors outperform the old ones, they will be preserved as the best so-

lution in the HM [37]. The HA pseudocode is shown in Fig. 6.

3.3.4. Ant colony optimization (ACO)

In several disciplines, ACO is one of the most widely utilized meta-

heuristic search-based algorithms. It is an improved version of Marco

Dorigo’s original algorithm, the ant system, which was introduced in

1992 [38]. ACO was inspired by the foraging activity of ant colonies.

The goal of ACO is to locate the shortest path between the ant nest and

the supplies. Each route shows a possible solution. However, after the

pathways have been initialized, and with the assistance of ant phero-

mones, which are organic chemical substances, ants travel the shortest

path to the nest, which represents the ideal solution. Many more models

have been used and enhanced with ACO. To solve the challenge of

autonomous surface vehicles, Dimitrios [39] suggested an updated

version of ACO based on fuzzy logic. The ACO pseudocode is shown in

Fig. 7.

3.4. Classication models

In this paper, we propose a new software defect prediction frame-

work comprised of the following eleven machine learning models: Sta-

tistical classiers (logistic regression), K-nearest neighbors approach,

support vector machine-based classiers (Support vector machine),

Decision trees approach (J48), and numerous ensemble methods

(Stacking, random forest, adaptive boosting, gradient boosting, sto-

chastic gradient boosting, extreme gradient boosting, and categorical

boosting).

3.4.1. Logistic regression

A logistic regression model, like a linear regression model, predicts a

dependent variable by examining the relationship between independent

data variables. However, it may be utilized for true or false binary

predictions rather than continuous predictions [40]. Furthermore, it ts

an s-shape line using the sigmoid function wherehθ(x) = 1/(1+

e− [03B8]T.x), causing the predicted value to range between 0 and 1, giving

us the possibility that y =1, which is true based on the ×value.

Furthermore, a high ×value indicates a high probability that y =1, but

an intermediate ×value indicates a 50% likelihood that y =1 when the

anticipated Y equals 0.5.

3.4.2. K-nearest neighbors

KNN [41] is a distance-based method that determines the distance

between training and test samples. KNN classier is a sluggish algorithm

since it learns during the evaluation phase and saves data samples

during the learning stage. Without standing, the KNN passes through

many phases. Begin by determining the value of K (number of nearest

samples), then calculate the distance between the training samples and

the new data sample using a distance function. Furthermore, the dis-

tance values will be sorted and the nearest samples identied based on

the value of K, and the prediction value will be determined based on the

majority of the class values.

3.4.3. Decision trees

DT [42] is a tree decision-based classier that divides the qualities of

the training data into parts. In DT, predictions are created by making

judgments based on a series of questions that are closely relevant to the

prediction’s aim. It is a non-parametric approach that does not neces-

sitate the use of a mapping function between predictors and target data.

The process of picking the optimum decision values at each depth level

is referred to as impurity in DT. The Gini-index function, on the other

hand, may be used to measure the impurity of the tree.

3.4.4. Random forests

RF [43] is a bagging ensemble-based classier. As the basic learners,

it divides the training data into several decision trees. RF is also a non-

parametric model. However, a restricted random number of rows is

chosen from the data population, and a set of decision trees is built for

each segment to generate classication output. An individual decision

tree might be bigger than others. A majority vote in RF indicates a meta-

classier, which is a classication decision made based on the judgments

of the basic learners. It supports numeric and categorical data types, as

well as complicated predictor functions.

3.4.5. Support vector machine

A support vector machine [44] is a classier that uses hyperplanes to

nd the best decision boundary in dimensional space based on the

number of features. However, because the hyperplane has numerous

options, the goal of the support vector machine is to pick the ideal hy-

perplane with the largest margins among data samples to improve

classication accuracy. Additionally, linear and non-linear classication

can be done. Generally, a support vector machine performs better and is

less prone to overtting when the class distribution is known. However,

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

when maximum margins are employed, new data may be appropriately

t and categorized.

3.4.6. Adaptive boosting

AdaBoost [45] is an ensemble-based boosting classier. In contrast to

decision trees and random forests. Several decision trees are merged

independently in AdaBoost. Each of these contains a single node and two

leaves, forming what is known as a stump. A stumps forest is a collection

of stumps. However, stumps are poor classication learners. AdaBoost’s

entire concept is to combine numerous weak learners for classication.

The sequence of stump creation is critical in AdaBoost since each pro-

duced tree inuences the output of the following one. Each data sample

is weighted (w) to determine the order of stumps where w =1/Total

number of samples. However, sample weight is updated frequently by

sw =wold*eamountofsay(

). Also, the Gini-Index may be used to rank stumps,

with the lower the Gini value, the more essential the stump. Fig. 8 shows

AdaBoost pseudocode.

3.4.7. Gradient boosting

GB [46] is a boosting ensemble-based algorithm for classication and

regression that employs the idea of pseudo residuals (PR). Unlike Ada-

Boost, gradient boosting begins with a single leaf that reects the

average values of anticipated classes rather than a stump. GB, on the

other hand, creates a xed-sized tree, similar to AdaBoost, where each

tree can be larger than a stump. Decision trees are used as base learners

in GB. The difference between the actual and anticipated values is rep-

resented by PR, which may be determined by tting a linear regression

line as a loss function. In GB, prediction errors may be reduced by

frequently updating PR values from tree to tree, making it a powerful

prediction approach. Fig. 9 shows the GB pseudocode.

Fig. 4. Particle Swarm Optimization Architecture Illustration.

Fig. 5. Genetic Algorithm Pseudocode.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Fig. 6. Harmony Search Algorithm Pseudocode.

Fig. 7. Ant Colony Optimization pseudocode.

Fig. 8. Adaptive Boosting pseudocode.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

3.4.8. Stochastic gradient boosting

Friedman [47] introduced the stochastic gradient boosting approach,

which is an ensemble-based method. The hybrid procedure of homog-

enous (boosting and bagging) approaches inspired it. By weighting the

base learners, it may be utilized for regression and classication. Deci-

sion trees, on the other hand, are the most widely used SGB base

learners. Each tree, on the other hand, is trained by selecting a random

number of data sample records. SGB, as compared to bagging methods,

is an effective instrument for limiting the chance of overtting by

randomly eliminating a portion of input data [48]. SGB has been utilized

in a variety of areas for many years. S´

ergio Godinho et al., [49]

employed SGB for multi-image classication in their inquiry study.

3.4.9. Extreme gradient boosting

XGBoost [50] is an ensemble method based on boosting. It is a more

advanced version of gradient boosting. It was also created for large and

intricate datasets. Unlike adaptive and gradient boosting, the XGBoost

employs unique regression trees. Forming what is known as born-again

trees. XGBoost tree formation, on the other hand, begins with a single

leaf, much like gradient boosting. Furthermore, gradient boosting and

regularization are critical steps in XGBoost. For many years, XGBoost

has been one of the most effective boosting methods. In their experi-

mental investigation, Ismail and Faisal [51] demonstrated that XGBoost

outperforms other classiers such as RF, SVM, radial basis function

neural network, and naive Bayes. It is simple to use and provides a high

level of discriminative prediction accuracy. In addition to its ability to

Fig. 9. Gradient Boosting Pseudocode.

Fig. 10. ROC-AUC ranges and values.

Table 3

Boosting Hyperparameters Grid-Search Space.

Parameters Grid-Search Space

Number of estimators 50, 70, 90, 100, 150, 120, 180, 200

Learning rate 0.001, 0.01, 0.1, 1, 10

Loss function Deviance, exponential

Algorithms SAMME, SAMME.R

CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5

SMOTE+PSO SMOTE+GA SMOTE+HA SMOTE+AC

Fig. 11. NASA MDP Selected Features Using Meta-Heuristic Search-Based.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

handle large data distribution with imbalanced target classes.

3.4.10. Categorical boosting

CatBoost is a more advanced kind of gradient boosting. It was pro-

posed in 2018 by Anna Veronika Dorogush and the Yandex corporate

team [52]. It was created to be the optimum option for heterogeneous

data, as opposed to bagging and stacking. Based on a boosting approach,

it was designed to handle categorical and numerical features. CatBoost is

an open-source package that allows for extremely rapid graphics pro-

cessing unit (GPU) and central processing unit (CPU) calculations. Very

effective for small datasets and simple to utilize. It is also a decision tree-

based method, which has the benet of decreasing overtting. CatBoost

has long been a rival to other gradient boosting methods. Jengei Hong

[53] found that CatBoost outperformed XGBoost and LightGBM imple-

mentations in his experimental investigation on predicting house

pricing.

3.4.11. Ensemble stacking

Stacking [54] is a heterogeneous ensemble-based method. Stacking,

as opposed to bagging and boosting, divides a machine learning model

into two layers. Stacking proceeds in numerous stages: Dataset gathering

and features extractor, base learner construction, and Meta learner

creation. The base layer has N distinct classiers. To begin, training data

is divided into K portions. Each base learner produces a prediction based

on K. A Meta dataset, on the other hand, integrates the prediction out-

comes of base learners. A Meta classier is the second level of prediction.

It generates prediction output based on previous predictions. In this

work, we employ the use of RF and SVM as base learners. Also, we use

LG as a Meta classier.

4. Evaluation metrics and hyperparameters tuning

Several evaluation metrics may be used to evaluate machine learning

classiers, including accuracy, precision, recall, receiver operating

characteristics (ROC), F-measure, and area under the curve (AUC). A

Table 5

Classiers Testing Accuracy Comparison over JM1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.870766 0.690316 0.700403 0.794889 0.701076

KNN 0.936425 0.972091 0.972764 0.970746 0.972428

Decision trees 0.930693 0.956288 0.954943 0.954270 0.949227

Random Forest 0.915060 0.924681 0.920309 0.912239 0.914257

Ada Boost 0.915060 0.924681 0.920309 0.912239 0.914257

Gradient Boosting 0.955706 0.969401 0.970746 0.970746 0.971083

SGB 0.959354 0.975118 0.975118 0.975454 0.975118

XGBoost 0.818134 0.790182 0.459314 0.828850 0.823134

Stacking 0.925000 0.967720 0.968393 0.967720 0.967720

Cat Boost 0.957791 0.974781 0.974781 0.973436 0.973100

SVM 0.819177 0.968393 0.969738 0.969065 0.968729

Accuracy (AVG) 0.851 0.919 0.889 0.929 0.92

Table 6

Classiers Testing Accuracy Comparison over KC1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.850000 0.865041 0.865041 0.860163 0.871545

KNN 0.828571 0.889431 0.889431 0.887805 0.887805

Decision trees 0.809524 0.856911 0.866667 0.879675 0.876423

Random Forest 0.852381 0.882927 0.886179 0.892683 0.891057

Ada Boost 0.852381 0.882927 0.886179 0.892683 0.891057

Gradient Boosting 0.866667 0.884553 0.884553 0.889431 0.884553

SGB 0.869048 0.894309 0.891057 0.899187 0.895935

XGBoost 0.847619 0.461789 0.577236 0.856911 0.866667

Stacking 0.890476 0.889251 0.889251 0.889251 0.889251

Cat Boost 0.876190 0.897561 0.897561 0.895935 0.895935

SVM 0.850000 0.886179 0.886179 0.884553 0.886179

Accuracy (AVG) 0.853 0.844 0.856 0.883 0.884

Table 4

Classiers Testing Accuracy Comparison over CM1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.826087 0.867188 0.859375 0.789062 0.882812

KNN 0.884058 0.898438 0.875000 0.906250 0.890625

Decision trees 0.869565 0.906250 0.882812 0.914062 0.859375

Random Forest 0.884058 0.906250 0.906250 0.906250 0.906250

Ada Boost 0.884058 0.906250 0.906250 0.906250 0.906250

Gradient Boosting 0.913043 0.937500 0.921875 0.929688 0.937500

SGB 0.942029 0.921875 0.906250 0.937500 0.929688

XGBoost 0.884058 0.820312 0.882812 0.851562 0.632812

Stacking 0.857143 0.875000 0.890625 0.890625 0.890625

Cat Boost 0.884058 0.906250 0.898438 0.898438 0.906250

SVM 0.884058 0.914062 0.890625 0.914062 0.914062

Accuracy (AVG) 0.882 0.896 0.892 0.894 0.877

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Fig. 12. The 10th Observation Impact on Predictive Models and Highest Discriminative Power Features over CM1 dataset. A) Shows call pairs as the most impact

feature in GB with a 1.343 value compared to a 0.5 base value. B) Shows call pairs with a 1.343 value compared to a 0.49 base value in SGB. C) Shows lines of code

comments with a 25.17 value compared to a 0.6 base value in Cat Boost.

Table 7

Classiers Testing Accuracy Comparison over KC3 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.825 0.860163 0.860163 0.919355 0.967742

KNN 0.775 0.887805 0.887805 0.903226 0.838710

Decision trees 0.925 0.873171 0.876423 0.903226 0.919355

Random Forest 0.800 0.889431 0.889431 0.935484 0.935484

Ada Boost 0.800 0.889431 0.889431 0.935484 0.935484

Gradient Boosting 0.875 0.889431 0.889431 0.919355 0.903226

SGB 0.875 0.897561 0.895935 0.935484 0.919355

XGBoost 0.825 0.855285 0.858537 0.919355 0.870968

Stacking 0.700 0.889251 0.889251 0.903226 0.903226

Cat Boost 0.900 0.895935 0.895935 0.935484 0.935484

SVM 0.825 0.884553 0.884553 0.919355 0.919355

Accuracy (AVG) 0.829 0.882 0.882 0.92 0.913

Table 8

Classiers Testing Accuracy Comparison over MC1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.990841 0.950879 0.950879 0.950879 0.933296

KNN 0.994073 0.998046 0.998046 0.998046 0.997767

Decision trees 0.991379 0.998046 0.998325 0.998046 0.996930

Random Forest 0.994612 0.996651 0.996651 0.996651 0.996651

Ada Boost 0.994612 0.996651 0.996651 0.996651 0.996651

Gradient Boosting 0.993534 0.996930 0.996930 0.996930 0.997488

SGB 0.993534 0.996372 0.997767 0.996651 0.919355

XGBoost 0.992457 0.485906 0.991627 0.975440 0.891711

Stacking 0.995690 0.996094 0.996094 0.996094 0.996094

Cat Boost 0.994612 0.996093 0.996093 0.996093 0.996093

SVM 0.992457 0.995814 0.995814 0.995814 0.996372

Accuracy (AVG) 0.992 0.945 0.992 0.99 0.973

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

confusion matrix, on the other hand, is a table that contains a collection

of characteristics that may be used to describe the performance of a

classier [55]. It may be used to visualize and compare the statistical

performance of various algorithms.

In short, a confusion matrix has four key properties that may be

utilized to derive different evaluation metrics, which are as follows:

•True positive (TP): Presents the number of defective modules that

have been classied correctly as defective modules.

•True negative (TN): Presents the number of non-defective modules

that have been classied correctly as non-defective modules.

•False positive (FP): Presents the number of misclassied non-

defective modules that are defective.

•False negative (FN): Presents the number of misclassied defective

modules that are non-defective modules.

In this study, we evaluate the proposed classiers using testing ac-

curacy and AUC. AUC, on the other hand, is a performance metric that

indicates how well a certain classier performs classication by dis-

tinguishing between target classes [56]. True positive rate (sensitivity)

and false positive rate (1-specicity) are two threshold values that have

an inverse relationship, with sensitivity increasing as specicity de-

creases and vice versa. The higher the AUC, however, the better, indi-

cating that the classier is not doing random classication. Fig. 10

shows the AUC ranges and values. 20% of the data were maintained as

testing sets to evaluate classication accuracy. The confusion matrix

may be used to extract all of the statistic measures listed above as

follows:

•Accuracy =TP+TN

TP+FP+TN+FN

•Turepositiverate(Sensitivity) = TP

TP+FN

•Falsepositiverate =1−specificity =1−FP

TN+FP

In this work, we follow the use of the Grid-Search method for tuning and

optimizing classiers’ hyperparameters. Grid-Search, on the other hand,

is a search strategy for choosing the optimum parameters by merging all

of the possible value entries in the search space [57]. Boosting ensemble

methods parameters including the number of estimators, learning rate,

loss function, and algorithms were tuned based on a set of values. Also, 5

Table 9

Classiers Testing Accuracy Comparison over MC2 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.730769 0.950879 0.542857 0.542857 0.714286

KNN 0.769231 0.998046 0.542857 0.542857 0.771429

Decision trees 0.576923 0.998046 0.657143 0.657143 0.771429

Random Forest 0.769231 0.996651 0.800000 0.857143 0.828571

Ada Boost 0.769231 0.996651 0.800000 0.857143 0.828571

Gradient Boosting 0.769231 0.996930 0.828571 0.828571 0.800000

SGB 0.692308 0.996930 0.800000 0.857143 0.857143

XGBoost 0.653846 0.973207 0.657143 0.685714 0.657143

Stacking 0.846154 0.996094 0.705882 0.705882 0.647059

Cat Boost 0.692308 0.996093 0.800000 0.800000 0.800000

SVM 0.692308 0.995814 0.685714 0.685714 0.685714

Accuracy (AVG) 0.723 0.99 0.71 0.728 0.759

Fig. 13. Shows 35 data samples contribution using AdaBoost and DT over the MC2 dataset utilizing SMOTE and PSO. A) Shows Halsted effort, call pairs and Halsted

difculty metrics as the highest discriminative features using AdaBoost. B) Shows call pairs, edge count, and Halsted difculty metrics as the most affected features

over 35 data samples using DT.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Fig. 14. Shows 35 data samples contribution using RF and SGB over the MC2 dataset utilizing SMOTE and PSO. A) Shows global data density and call pairs metrics

effects using RF. B) Shows Halsted effort, global data density, and call pairs metrics effects using SGB.

Table 10

Classiers Testing Accuracy Comparison over MW1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.924528 0.542857 0.542857 0.542857 0.824176

KNN 0.905660 0.542857 0.542857 0.542857 0.956044

Decision trees 0.886792 0.657143 0.657143 0.657143 0.879121

Random Forest 0.924528 0.828571 0.857143 0.828571 0.956044

Ada Boost 0.924528 0.828571 0.857143 0.828571 0.956044

Gradient Boosting 0.886792 0.800000 0.828571 0.800000 0.901099

SGB 0.849057 0.800000 0.800000 0.800000 0.923077

XGBoost 0.905660 0.685714 0.800000 0.742857 0.912088

Stacking 0.925926 0.705882 0.705882 0.705882 0.869565

Cat Boost 0.924528 0.800000 0.800000 0.800000 0.956044

SVM 0.905660 0.685714 0.685714 0.685714 0.934066

Accuracy (AVG) 0.905 0.715 0.733 0.72 0.915

Table 11

Classiers Testing Accuracy Comparison over PC1 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.914474 0.862816 0.862816 0.862816 0.894958

KNN 0.901316 0.945848 0.945848 0.945848 0.962185

Decision trees 0.914474 0.909747 0.916968 0.924188 0.941176

Random Forest 0.921053 0.938628 0.924188 0.938628 0.957983

Ada Boost 0.921053 0.938628 0.924188 0.938628 0.957983

Gradient Boosting 0.907895 0.942238 0.942238 0.935018 0.962185

SGB 0.901316 0.942238 0.945848 0.942238 0.957983

XGBoost 0.940789 0.837545 0.859206 0.902527 0.907563

Stacking 0.934211 0.927536 0.927536 0.927536 0.974790

Cat Boost 0.921053 0.942238 0.942238 0.942238 0.962185

SVM 0.921053 0.942238 0.942238 0.942238 0.949580

Accuracy (AVG) 0.917 0.92 0.927 0.933 0.953

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

folds cross-validation was employed. The grid-search parameter space

for boosting methods is shown in Table 3.

5. Experimental results and models explanations

The ndings of the proposed classiers for predicting software de-

fects are discussed in this section. SMOTE was utilized for data balance

after data preprocessing. Four Meta heuristics methods, on the other

hand, were tuned and used for feature selection. We employ the logistic

map function as a chaotic type and 20 initial particles with 20 iterations

in ACO. We utilized a population size of 20 with 20 iterations in PSO. In

GA, we employed a crossover probability of 6% with a maximum gen-

eration of 20. In addition, we employed the logistic map function in HA

with 20 iterations and population size.

Fig. 11 shows the number of features selected for each dataset. In

comparison to the others, GA has the most selected amount of features

whereas ACO comes next.

Tables 4-6 compare the accuracy of classiers in the CM1, JM1, and

Table 12

Classiers Testing Accuracy Comparison over PC2 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.990536 0.862816 0.862816 0.862816 0.955684

KNN 0.990536 0.945848 0.945848 0.945848 0.996146

Decision trees 0.987382 0.909747 0.916968 0.909747 0.996146

Random Forest 0.990536 0.938628 0.935018 0.938628 0.996146

Ada Boost 0.990536 0.938628 0.935018 0.938628 0.996146

Gradient Boosting 0.990536 0.938628 0.935018 0.945848 0.996146

SGB 0.993691 0.945848 0.942238 0.945848 0.998073

XGBoost 0.990536 0.898917 0.675090 0.862816 0.445087

Stacking 0.987421 0.927536 0.927536 0.927536 0.992308

Cat Boost 0.990536 0.942238 0.942238 0.942238 0.996146

SVM 0.990536 0.942238 0.942238 0.942238 0.994220

Accuracy (AVG) 0.9902 0.932 0.909 0.929 0.9406

Fig. 15. Shows the most effective features over the PC1 dataset using ACO and SMOTE.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

KC1 datasets before and after using meta-heuristic feature selection al-

gorithms. The testing accuracy average, on the other hand, was

computed to highlight the comparison of outcomes. Overall, the results

showed an improvement in testing accuracy. GB, SGB, and CatBoost, on

the other hand, outperform the others. This is not surprising given that

boosting methods enhance variance while decreasing bias. In addition,

by expanding minority class samples, SMOTE increases the number of

data samples.

However, XGBoost performs the worst when compared to others

utilizing PSO and GA in the JM1 and KC1 datasets, with testing accuracy

ranging from 45 to 57 percent. To be clear, the reason is most likely

overtting. In light of this, GridSeachCV was used to adjust XGBoost

hyperparameters. To the best of our understanding, boosting methods in

general are prone to overtting. Particularly, deep trees. SMOTE with

the selected features utilizing PSO yields the best results in the CM1

dataset, with an average accuracy of 89.6%. The results of the JM1

dataset demonstrate that SMOTE and HA outperform others with an

average testing accuracy of 92.2%. Also, in the KC1 dataset, SMOTE and

ACO outperform others with an average accuracy of 88.4%.

Fig. 12 shows SHAP plots to explain the GB, SGB, and Cat Boost re-

sults. The bee swarm plot was used to describe important features based

on SHAP values, and the local force plot was used to demonstrate how a

single observation contributed to the prediction models. We chose the

tenth observation at random over the CM1 dataset. The call pairs metric

pushes the prediction to the right in GB and SGB, with a value of 1.343

compared to the base value of 0.5.

However, the call pair’s impact on the prediction model decreases as

parameter count and design complexity metrics increase. Lines of code

comments have the strongest impact in Cat Boost at the 10th observa-

tion, at 25.17, relative to the base value of 0.6, however, the design

complexity metric has a negative impact. Overall, it can be notated that

lines of code comments, design density, and Halsted content metrics

have the highest discriminative power to the predictive models.

The comparative ndings using meta-heuristics and SMOTE across

the KC3, MC1, and MC2 datasets are shown in Tables 7-9. Aside from

general performance enhancement. SGB, Cat Boost, LR, and RF surpass

others using the HA algorithm in the KC3 dataset, with an average ac-

curacy of 92% compared to the original ndings of 82%. The MC1

Table 14

Classiers Testing Accuracy Comparison over PC4 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.925000 0.862816 0.862816 0.862816 0.860262

KNN 0.875000 0.945848 0.945848 0.945848 0.938865

Decision trees 0.964286 0.924188 0.924188 0.924188 0.978166

Random Forest 0.928571 0.935018 0.931408 0.938628 0.958515

Ada Boost 0.928571 0.935018 0.931408 0.938628 0.958515

Gradient Boosting 0.975000 0.938628 0.935018 0.942238 0.973799

SGB 0.971429 0.942238 0.938628 0.945848 0.931507

XGBoost 0.860714 0.469314 0.758123 0.866426 0.871179

Stacking 0.921429 0.927536 0.927536 0.927536 0.890830

Cat Boost 0.971429 0.942238 0.942238 0.942238 0.982533

SVM 0.871429 0.942238 0.942238 0.942238 0.901747

Accuracy (AVG) 0.926 0.89 0.917 0.931 0.938

Table 15

Classiers Testing Accuracy Comparison over PC5 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.963834 0.862816 0.862816 0.913023 0.955054

KNN 0.975889 0.945848 0.945848 0.991103 0.988802

Decision trees 0.973243 0.898917 0.909747 0.987728 0.989109

Random Forest 0.971479 0.942238 0.931408 0.978831 0.979291

Ada Boost 0.971479 0.942238 0.931408 0.978831 0.979291

Gradient Boosting 0.978536 0.938628 0.942238 0.990950 0.991256

SGB 0.976183 0.938628 0.938628 0.989262 0.931507

XGBoost 0.971479 0.862816 0.862816 0.969781 0.961190

Stacking 0.970588 0.927536 0.927536 0.987423 0.985276

Cat Boost 0.974125 0.942238 0.942238 0.986041 0.987421

SVM 0.970891 0.942238 0.942238 0.984814 0.982973

Accuracy (AVG) 0.973 0.928 0.927 0.984 0.977

Table 13

Classiers Testing Accuracy Comparison over PC3 Dataset.

Classiers Pure PSO/SMOTE GA/SMOTE HA/SMOTE AC/SMOTE

LR 0.871111 0.862816 0.862816 0.862816 0.805479

KNN 0.871111 0.945848 0.945848 0.945848 0.928767

Decision trees 0.857778 0.916968 0.916968 0.898917 0.893151

Random Forest 0.875556 0.935018 0.942238 0.931408 0.920548

Ada Boost 0.875556 0.935018 0.942238 0.931408 0.920548

Gradient Boosting 0.906667 0.938628 0.938628 0.942238 0.906849

SGB 0.880000 0.935018 0.945848 0.942238 0.931507

XGBoost 0.235556 0.884477 0.873646 0.884477 0.857534

Stacking 0.902655 0.927536 0.927536 0.927536 0.939891

Cat Boost 0.875556 0.942238 0.942238 0.942238 0.920548

SVM 0.875556 0.942238 0.942238 0.942238 0.926027

Accuracy (AVG) 0.815 0.930 0.931 0.928 0.914

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

dataset, on the other hand, revealed that DT, KNN, and GB outperform

others employing GA with an average accuracy of 92%.

However, the MC1 dataset is regarded to be one of the largest NASA

datasets, with a total of 9277 occurrences, and we were able to retain the

same accuracy by applying the GA algorithm due to the broad data

dispersion over training and testing sets. The ndings of the MC2 dataset

revealed that six classiers, including DT, KNN, GB, SGB, AdaBoost, and

RF, out of eleven, had the best accuracy by comparison. Additionally,

SMOTE and PSO yield the best outcomes, with an average accuracy of

99%.

Figs. 13 and 14 show the SHAP explanation of the DT, AdaBoost, RF,

and SGB models on the MC2 dataset using SMOTE and PSO. These

classiers have the best outcome on the KC3, MC1, and MC2 datasets.

Explaining how a single observation adds to the prediction model, on the

other hand, is useful but insufcient. As a result, we employ the SHAP

global force plot to demonstrate how a series of consecutive data sam-

ples contribute to the prediction model.

The plots show the contributions of 35 data samples. The x-axis in

global force reects the number of data samples, while the y-axis shows

SHAP predicted values. In addition, we employ a bee swarm plot to

emphasize the features with the highest discriminative power. For

clarity, the most signicant features of overall classiers are call pairs,

global data density, and essential density metrics.

The results of classiers using meta-heuristic methods on the MW1,

PC1, and PC2 datasets are shown in Tables 10-12. Overall, ACO has the

best performance improvement when compared to others. The ndings

on the MW1 dataset demonstrate that KNN, RF, AdaBoost, and Cat Boost

surpass others using ACO with an average accuracy of 91.5%. With only

264 total instances, the MW1 dataset is considered one of the smallest

NASA datasets. We demonstrate overall results improvement using all

search methods on the PC1 dataset. However, we found that ACO has the

best results when compared to others, with an average accuracy of

95.3%. It is also notable that KNN, GB, Cat Boost, and Stacking

outperform the others. ACO outperforms the others in the PC2 dataset,

with an average testing accuracy of 94%. However, unlike in PC1 and

MW1, XGBoost performs the lowest in PC2, which is most likely due to

overtting. Nonetheless, SGB surpasses others. Fig. 15 shows the most

effective features over the PC1 dataset utilizing KNN, DT, GB, RF, and

Cat boost with ACO and SMOTE.

Tables 13-15 show the ndings before and after using metaheuristics

and SMOTE on the PC3, PC4, and PC5 datasets, respectively. The nd-

ings of the PC3 dataset revealed an overall improvement. Particularly by

applying GA, which has an average testing accuracy of 93.1%. However,

it is worth noting that KNN, SGB, and stacking exceed others in terms of

accuracy. PSO also improved its accuracy, with an average accuracy of

93%. In comparison to the original, XGBoost accuracy improved

signicantly utilizing PSO and HA, with an accuracy of 88.4%. ACO has

the best results in the PC4 dataset, with an average testing accuracy of

93.8%, followed by HA, which has an accuracy of 93.1%. Also, when

comparing metaheuristics, KNN, DT, SGB, and Cat Boost outperform the

others. XGBoost and LR, on the other hand, perform poorly. Even

though the fact that the PC5 dataset is the largest NASA dataset, with

17,001 total instances, we found that using HA and SMOTE performs the

best, with an average accuracy of 98.4% when compared to the original

accuracy of 97.3%. ACO also showed a slight improvement of 0.004%.

Surprisingly, the KNN outperforms others by utilizing PSO, GA, and HA.

However, in ACO, the KNN, DT, and GB performed the best.

Fig. 16 shows the most effective features (software metrics) using

KNN, AdaBoost, CatBoost, and DT with GA on the PC3 dataset. These

classiers have the best overall performance. The SHAP explanation

plot, on the other hand, reveals that lines of code blank, parameters

count, and Halsted content metrics have the highest discriminative

power.

Fig. 17 shows the ROC-AUC performance of classiers across NASA

datasets. Notably, when compared to others, GB, SGB, AdaBoost, RF, DT,

and Cat Boost have the best performance. This is not unexpected given

that these classiers had the greatest overall accuracy scores. XGBoost

and SVM, on the other hand, performed admirably. However, with an

AUC of higher than 90%, all classiers performed best on the PC5, PC2,

MC1, MW1, and PC4 datasets. In comparison to the others, SGB and Cat

Fig. 16. Shows the Most Effective Features Utilizing KNN, AdaBoost, CatBoost, and DT with GA over the PC3 Dataset.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Fig. 17. Shows Classiers ROC-AUC Comparison.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Fig. 17. (continued).

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

Boost performed the best. SGB, on the other hand, outperforms Cat Boost

by a little percentage.

6. Conclusion

Software defects decrease software quality in terms of efciency,

testability, and maintainability. Also, it can lead to project success or

failure. Therefore, the process of software defects prediction at the early

stages of development enhance software quality, decrease development

and maintenance cost, and keeps the project on track. In this paper, we

proposed a new framework for predicting software defects. Twelve

NASA datasets were subjected to the use of eleven machine learning

classiers from the statistical, boosting, bagging, and stacking families.

Also, we investigated the use of four search-based algorithms (Particle

swarm optimization, Genetic algorithm, Harmony algorithm, and Ant

colony optimization) for feature selection. However, NASA datasets

have a large distribution gap between minority and majority classes.

Therefore, we employed the use of SMOTE as a resampling technique for

data balancing. On the other hand, we used the SHAP library to explain

the results of the models and highlight the most effective features.

Testing accuracy and ROC-AUC were employed for model evaluation.

The results revealed that GB, SGB, DT, and Cat Boost classiers have the

best accuracy and performance when compared to others. Also, we

found that PC5 and MC1 datasets have the best results. However, when

comparing the meta-heuristics algorithms, we found that ant colony

optimization has the best results improvement followed by the harmony

algorithm. Finally, we showed that call pairs, Halsted content, param-

eters count, and lines of code comments metrics had the highest

discriminative power by SHAP employment. As a future direction, we

intend to compare the meta-heuristic algorithms with lter methods for

feature selection and compare the performance of SMOTE with other

resampling techniques.

CRediT authorship contribution statement

Yazan Al-Smadi: Conceptualization, Methodology, Software, Data

curation, Writing – original draft, Investigation, Validation, Writing –

review & editing. Mohammed Eshtay: Conceptualization, Methodol-

ogy, Software, Data curation, Writing – original draft, Investigation,

Validation, Writing – review & editing. Ahmad Al-Qerem: Conceptu-

alization, Methodology, Software, Data curation, Writing – original

draft, Investigation, Validation, Writing – review & editing. Shadi

Nashwan: Conceptualization, Methodology, Software, Data curation,

Writing – original draft, Investigation, Validation, Writing – review &

editing. Osama Ouda: Conceptualization, Methodology, Software, Data

curation, Writing – original draft, Investigation, Validation, Writing –

review & editing. A.A. Abd El-Aziz: Conceptualization, Methodology,

Software, Data curation, Writing – original draft, Investigation, Vali-

dation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

References

[1] J¨

ockel L, et al. Towards a Common Testing Terminology for Software Engineering

and Data Science Experts. International Conference on Product-Focused Software

Process Improvement. Springer; 2021.

[2] Quyoum A, Dar M-U-D, Quadri SMK. Improving software reliability using software

engineering approach- a review. Int J Computer Applications 2010;10(5):41–7.

[3] Zhang D, Tsai JJJSQJ. Machine learning and software engineering. Available at

SSRN 4141236. 2003;11(2):87–119.

[4] Graham, D., R. Black, and E. Van Veenendaal, Foundations of software testing

ISTQB Certication. 2021: Cengage Learning.

[5] Asghar B, Awan I, Bhatti AM. Process Improvement through Reduction in Software

Defects using Six Sigma Methods. IEEE; 2018.

[6] Kessentini, M., et al. Search-based design defects detection by example. in

International Conference on Fundamental Approaches to Software Engineering.

2011. Springer.

[7] Yadav V, Singh R, Yadav V. Estimation Model for enhanced predictive object point

metric in OO software size estimation using deep learning. IAJIT 2023;20(3).

[8] Helm JM, et al. Machine learning and articial intelligence: denitions,

applications, and future directions. Current Reviews in Musculoskeletal Medicine

2020;13(1):69–76.

[9] ¨

Ozakıncı, R., A.J.J.o.S. Tarhan, and Software, Early software defect prediction: A

systematic map and review. 2018. Journal of Systems and Software, 144: p. 216-

239.

[10] Pachouly, J., et al., A systematic literature review on software defect prediction

using articial intelligence: Datasets, Data Validation Methods, Approaches, and

Tools. 2022. Engineering Applications of Articial Intelligence, 111: p. 104773.

[11] Chen X, et al. Software defect number prediction: unsupervised vs supervised

methods. Information and Software Technology 2019;106:161–81.

[12] Xu Z, et al. Software defect prediction based on kernel PCA and weighted extreme

learning machine. Information and Software Technology 2019;106:182–200.

[13] Moshin Reza S, et al. Performance analysis of machine learning approaches in

software complexity prediction. Springer; 2021.

[14] Huang J, et al. An empirical analysis of data preprocessing for machine learning-

based software cost estimation. Information and Software Technology 2015;67:

108–27.

[15] Liang H, et al. Seml: a semantic LSTM model for software defect prediction. IEEE

Access 2019;7:83812–24.

[16] Wu Y, Limcr,, et al. Less-informative majorities cleaning rule based on Naïve Bayes

for imbalance learning in software defect prediction. Applied Sciences 2020;10

(23):8324.

[17] Catolino G, Di Nucci D, Ferrucci F. Cross-project just-in-time bug prediction for

mobile apps: An empirical assessment. IEEE; 2019.

[18] Gao K. The use of under-and oversampling within ensemble feature selection and

classication for software quality prediction. Int J Reliability Quality and Safety

Engineering 2014;21(01):1450004.

[19] Malhotra R, Shakya A, Ranjan R, Banshi R. Software defect prediction using binary

particle swarm optimization with binary cross entropy as the tness function.

J. Phys.: Conf. Ser. 2021;1767(1):012003.

[20] Alauthman M, Aldweesh A, Al-qerem A, Aburub F, Al-Smadi Y, Abaker AM, et al.

Tabular data generation to improve classication of liver disease diagnosis. Appl

Sci 2023;13(4):2678.

[21] Balogun AO, et al. Performance analysis of feature selection methods in software

defect prediction: a search method approach. Appl Sci 2019;9(13):2764.

[22] Anbu, M. and G.J.C.C. Anandha Mala, Feature selection using rey algorithm in

software defect prediction. 2019. Cluster Computing, 22(5): p. 10925-10934.

[23] Ayon SI. Neural network based software defect prediction using genetic algorithm

and particle swarm optimization. IEEE; 2019.

[24] Balogun AO, et al. Empirical analysis of rank aggregation-based multi-lter feature

selection methods in software defect prediction. Electronics 2021;10(2):179.

[25] Ali U, Aftab S, Iqbal A, Nawaz Z, Salman Bashir M, Anwaar Saeed M. Software

defect prediction using variant based ensemble learning and feature selection

techniques. Int J Modern Education & Computer Science 2020;12(5):29–40.

[26] Alsaeedi, A., M.Z.J.J.o.S.E. Khan, and Applications, Software defect prediction

using supervised machine learning and ensemble techniques: a comparative study.

2019. Journal of Software Engineering and Applications, 12(5): p. 85-100.

[27] Balogun AO, et al. Software defect prediction using ensemble learning: an ANP

based evaluation method. FUOYE J Eng Tech 2018;3(2):50–5.

[28] Yu Q, et al. Process metrics for software defect prediction in object-oriented

programs. IET Software 2020;14(3):283–92.

[29] Ghosh S, Rana A, Kansal VJPcs. A nonlinear manifold detection based model for

software defect prediction. Procedia Computer Science 2018;132:581–94.

[30] Ghosh S, et al. A benchmarking framework using nonlinear manifold detection

techniques for software defect prediction. Int J Comput Sci Eng 2020;21(4):

593–614.

[31] Lundberg SM, et al. From local explanations to global understanding with

explainable AI for trees. Nature machine intelligence 2020;2(1):56–67.

[32] Kaur, H., H.S. Pannu, and A.K.J.A.C.S. Malhi, A systematic review on imbalanced

data challenges in machine learning: Applications and solutions. 2019. ACM

Computing Surveys (CSUR), 52(4): p. 1-36.

[33] Fern´

andez A, et al. SMOTE for learning from imbalanced data: progress and

challenges, marking the 15-year anniversary. J Articial Intelligence Res 2018;61:

863–905.

[34] Dai H-P, Chen D-D, Zheng Z-S-J-A. Effects of random values for particle swarm

optimization algorithm. Algorithms 2018;11(2):23.

[35] Katoch S, et al. A review on genetic algorithm: past, present, and future.

Multimedia Tools and Applications 2021;80(5):8091–126.

[36] Ala’a, A., et al., Comprehensive review of the development of the harmony search

algorithm and its applications. 2019. IEEE Access, 7: p. 14233-14245.

[37] Abualigah L, Diabat A, Geem ZWJAS. A comprehensive survey of the harmony

search algorithm in clustering applications. Appl Sci 2020;10(11):3827.

[38] Dorigo M, Birattari M, Stutzle TJIcim. Ant colony optimization. IEEE Comput

Intelligence Magazine 2006;1(4):28–39.

[39] Lyridis DVJOE. An improved ant colony optimization algorithm for unmanned

surface vehicle local path planning with multi-modality constraints. Ocean Eng

2021;241:109890.

Y. Al-Smadi et al.

Egyptian Informatics Journal 24 (2023) 100386

[40] Wang Q, et al. Overview of logistic regression model analysis and application. Chin

J Prev Med 2019;53(9):955–60.

[41] Abu Alfeilat HA, et al. Effects of distance measure choice on k-nearest neighbor

classier performance: a review. Big Data 2019;7(4):221–48.

[42] Patel, H.H., P.J.I.J.o.C.S. Prajapati, and Engineering, Study and analysis of decision

tree based classication algorithms. 2018. International Journal of Computer

Sciences and Engineering, 6(10): p. 74-78.

[43] Buskirk TDJSP. Surveying the forests and sampling the trees: an overview of

classication and regression trees and random forests with applications in survey

research. Survey Practice 2018;11(1):1–13.

[44] Pisner DA, Schnyer DM. Support vector machine. In: Machine learning. Academic

Press Elsevier; 2020. p. 101–21. https://doi.org/10.1016/B978-0-12-815739-

8.00006-7.

[45] Ferreira, A.J. and M.A.J.E.m.l. Figueiredo, Boosting algorithms: A review of

methods, theory, and applications. 2012. Ensemble machine learning: Methods and

applications,p. 35-85.

[46] Natekin, A. and A.J.F.i.n. Knoll, Gradient boosting machines, a tutorial. 2013.

Frontiers in neurorobotics, 7: p. 21.

[47] Friedman, J.H.J.C.s. and d. analysis, Stochastic gradient boosting. 2002.

Computational statistics & data analysis, 38(4): p. 367-378.

[48] Shin, Y.J.A.i.C.E., Application of stochastic gradient boosting approach to early

prediction of safety accidents at construction site. 2019. Advances in Civil

Engineering, 2019.

[49] Godinho S, Guiomar N, Gil A. Estimating tree canopy cover percentage in a

mediterranean silvopastoral systems using Sentinel-2A imagery and the stochastic

gradient boosting algorithm. Int J Remote Sensing 2018;39(14):4640–62.

[50] Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings

of the 22nd acm sigkdd international conference on knowledge discovery and data

mining. 2016.

[51] Babajide Mustapha I, Saeed FJM. Bioactive molecule prediction using extreme

gradient boosting. Molecules 2016;21(8):983.

[52] Dorogush, A.V., V. Ershov, and A.J.a.p.a. Gulin, CatBoost: gradient boosting with

categorical features support. arXiv preprint arXiv:1810.11363. 2018. https://doi.

org/10.48550/arXiv.1810.11363.

[53] Hong JJHFR. An application of XGBoost, LightGBM, CatBoost algorithms on house

price appraisal system. Housing Finance Research 2020;4:33–64. https://doi.

org/10.52344/hfr.2020.4.0.33.

[54] Alauthman M, Al-qerem A, Sowan B, Alsarhan A, Eshtay M, Aldweesh A, et al.

Enhancing small medical dataset classication performance using GAN.

Informatics 2023;10(1):28.

[55] Handelman GS, et al. Peering into the black box of articial intelligence:

evaluation metrics of machine learning methods. Am J Roentgenol 2019;212(1):

38–43.

[56] Al-qerem A, Al-Naymat G, Alhasan M, Al-Debei M. Default prediction model: the

signicant role of data engineering in the quality of outcomes. Int Arab J Inf

Technol 2020;17(4A):635–44.

[57] Alibrahim, H. and S.A. Ludwig. Hyperparameter optimization: comparing genetic

algorithm against grid search and bayesian optimization. in 2021 IEEE Congress on

Evolutionary Computation (CEC). 2021. IEEE.

Y. Al-Smadi et al.

Nonparametric Statistical Feature Scaling Based Quadratic Regressive Convolution Deep Neural Network for Software Fault Prediction

Article

Full-text available

Mar 2024
CMC-COMPUT MATER CON

The development of defect prediction plays a significant role in improving software quality. Such predictions are used to identify defective modules before the testing and to minimize the time and cost. The software with defects negatively impacts operational costs and finally affects customer satisfaction. Numerous approaches exist to predict software defects. However, the timely and accurate software bugs are the major challenging issues. To improve the timely and accurate software defect prediction, a novel technique called Nonparametric Statistical feature scaled QuAdratic regressive convolution Deep nEural Network (SQADEN) is introduced. The proposed SQADEN technique mainly includes two major processes namely metric or feature selection and classification. First, the SQADEN uses the nonparametric statistical Torgerson–Gower scaling technique for identifying the relevant software metrics by measuring the similarity using the dice coefficient. The feature selection process is used to minimize the time complexity of software fault prediction. With the selected metrics, software fault perdition with the help of the Quadratic Censored regressive convolution deep neural network-based classification. The deep learning classifier analyzes the training and testing samples using the contingency correlation coefficient. The softstep activation function is used to provide the final fault prediction results. To minimize the error, the Nelder–Mead method is applied to solve non-linear least-squares problems. Finally, accurate classification results with a minimum error are obtained at the output layer. Experimental evaluation is carried out with different quantitative metrics such as accuracy, precision, recall, F-measure, and time complexity. The analyzed results demonstrate the superior performance of our proposed SQADEN technique with maximum accuracy, sensitivity and specificity by 3%, 3%, 2% and 3% and minimum time and space by 13% and 15% when compared with the two state-of-the-art methods.

Evaluation Feature Selection with Using Machine Learning for Cyber-Attack Detection in Smart Grid: Review

Article

Full-text available

Jan 2024

The Smart Grid is a modern power grid that relies on advanced technologies to provide reliable and sustainable electricity. However, its integration with various communication technologies and IoT devices makes it vulnerable to cyber-attacks. Such attacks can lead to significant damage, economic losses, and public safety hazards. To ensure the security of the smart grid, increasingly strong security solutions are needed. This paper provides a comprehensive analysis of the vulnerabilities of the smart grid and the different approaches for detecting cyber-attacks. It examines the different vulnerabilities of the smart grid, including system vulnerabilities and cyber-attacks, and discusses the vulnerabilities of all its elements. The paper also investigates various approaches for detecting cyber-attacks, including rule-based, signature-based, anomaly detection, and ma-chine learning-based methods, with a focus on their effectiveness and related research. Finally, prospective cybersecurity approaches for the smart grid, such as AI approaches and blockchain, are discussed along with the challenges and future prospects of cyberattacks on the smart grid. The paper’s findings can help policymakers and stakeholders make informed decisions about the security of the smart grid and develop effective strategies to protect it from cyber-attacks.

Estimation Model for Enhanced Predictive Object Point Metric in OO Software Size Estimation Using Deep Learning

Article

Full-text available

Jan 2023

The Software industry’s rapid growth contributes to the need for new technologies. PRICE software system uses Predictive Object Point (POP) as a size measure to estimate Effort and cost. A refined POP metric value for object-oriented software written in Java can be calculated using the Automated POP Analysis tool. This research used 25 open-source Java projects. The refined POP metric improves the drawbacks of the PRICE system and gives a more accurate size measure of software. This paper uses refined POP metrics with curve-fitting neural networks and multi-layer perceptron neural network-based deep learning to estimate the software development effort. Results show that this approach gives an effort estimate closer to the actual Effort obtained through Constructive Cost Estimation Model (COCOMO) estimation models and thus validates refined POP as a better size measure of object-oriented software than POP. Therefore we consider the MLP approach to help construct the metric for the scale of the Object-Oriented (OO) model system.

Enhancing Small Medical Dataset Classification Performance Using GAN

Article

Full-text available

Mar 2023

Developing an effective classification model in the medical field is challenging due to limited datasets. To address this issue, this study proposes using a generative adversarial network (GAN) as a data-augmentation technique. The research aims to enhance the classifier’s generalization performance, stability, and precision through the generation of synthetic data that closely resemble real data. We employed feature selection and applied five classification algorithms to thirteen benchmark medical datasets, augmented using the least-square GAN (LS-GAN). Evaluation of the generated samples using different ratios of augmented data showed that the support vector machine model outperforms other methods with larger samples. The proposed data augmentation approach using a GAN presents a promising solution for enhancing the performance of classification models in the healthcare field.

Tabular Data Generation to Improve Classification of Liver Disease Diagnosis

Article

Full-text available

Feb 2023

Liver diseases are among the most common diseases worldwide. Because of the high incidence and high mortality rate, these diseases diagnoses are vital. Several elements harm the liver. For instance, obesity, undiagnosed hepatitis infection, and alcohol abuse. This causes abnormal nerve function, bloody coughing or vomiting, insufficient kidney function, hepatic failure, jaundice, and liver encephalopathy.. The diagnosis of this disease is very expensive and complex. Therefore, this work aims to assess the performance of various machine learning algorithms at decreasing the cost of predictive diagnoses of chronic liver disease. In this study, five machine learning algorithms were employed: Logistic Regression, K-Nearest Neighbor, Decision Tree, Support Vector Machine, and Artificial Neural Network (ANN) algorithm. In this work, we examined the effects of the increased prediction accuracy of Generative Adversarial Networks (GANs) and the synthetic minority oversampling technique (SMOTE). Generative opponents’ networks (GANs) are a mechanism to produce artificial data with a distribution close to real data distribution. This is achieved by training two different networks: the generator, which seeks to produce new and real samples, and the discriminator, which classifies the augmented samples using supervised classifications. Statistics show that the use of increased data slightly improves the performance of the classifier.

An Application of XGBoost, LightGBM, CatBoost Algorithms on House Price Appraisal System

Article

Full-text available

Dec 2020

Jengei Hong

Software defect prediction using Binary Particle Swarm Optimization with Binary Cross Entropy as the fitness function

Article

Full-text available

Feb 2021

Software is a consequential asset because concrete software is needed in virtually every industry, in every business, and for every function. It becomes more paramount as time goes on – if something breaks within your application portfolio and expeditious, efficient, and efficacious fine-tune needs to transpire as anon as possible. Therefore, recognizing the faults in the early phase of the software development lifecycle is essential for both, diminishing the cost in terms of efforts and money Similarly, It is important to find out features which could be redundant or features which are highly correlated with each other, as it could largely affect the model’s learning process. This analysis refers to the use of crossover Artificial Neural Network (ANN) and Binary Particle Swarm Optimization (BPSO) with Binary Cross-Entropy (BCE) loss for the fitness function. The conclusion of proposed paper is to provide the significance and opportunity of using Binary Cross-Entropy (BCE) in Binary Particle Swarm Optimization (BPSO) for the feature reduction process in order to minimize the developer’s effort and costs for software development as well as for its maintenance.

Empirical Analysis of Rank Aggregation-Based Multi-Filter Feature Selection Methods in Software Defect Prediction

Article

Full-text available

Jan 2021

Selecting the most suitable filter method that will produce a subset of features with the best performance remains an open problem that is known as filter rank selection problem. A viable solution to this problem is to independently apply a mixture of filter methods and evaluate the results. This study proposes novel rank aggregation-based multi-filter feature selection (FS) methods to address high dimensionality and filter rank selection problem in software defect prediction (SDP). The proposed methods combine rank lists generated by individual filter methods using rank aggregation mechanisms into a single aggregated rank list. The proposed methods aim to resolve the filter selection problem by using multiple filter methods of diverse computational characteristics to produce a dis-joint and complete feature rank list superior to individual filter rank methods. The effectiveness of the proposed method was evaluated with Decision Tree (DT) and Naïve Bayes (NB) models on defect datasets from NASA repository. From the experimental results, the proposed methods had a superior impact (positive) on prediction performances of NB and DT models than other experimented FS methods. This makes the combination of filter rank methods a viable solution to filter rank selection problem and enhancement of prediction models in SDP.

A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools

Article

May 2022
ENG APPL ARTIF INTEL

Delivering high-quality software products is a challenging task. It needs proper coordination from various teams in planning, execution, and testing. Many software products have high numbers of defects revealed in a production environment. Software failures are costly regarding money, time, and reputation for a business and even life-threatening if utilized in critical applications. Identifying and fixing software defects in the production system is costly, which could be a trivial task if detected before shipping the product. Binary classification is commonly used in existing software defect prediction studies. With the advancements in Artificial Intelligence techniques, there is a great potential to provide meaningful information to software development teams for producing quality software products. An extensive survey for Software Defect Prediction is necessary for exploring datasets, data validation methods, defect detection, and prediction approaches and tools. The survey infers standard datasets utilized in early studies lack adequate features and data validation techniques. According to the finding of the literature survey, the standard datasets has few labels, resulting in insufficient details regarding defects. Systematic Literature Reviews (SLR) on Software Defect Prediction are limited. Hence this SLR presents a comprehensive analysis of defect datasets, dataset validation, detection, prediction approaches, and tools for Software Defect Prediction. The survey exhibits the futuristic recommendations that will allow researchers to develop a tool for Software Defect Prediction. The survey introduces the architecture for developing a software prediction dataset with adequate features and statistical data validation techniques for multi-label classification for software defects.

Towards a Common Testing Terminology for Software Engineering and Data Science Experts

Chapter

Jan 2021

Analytical quality assurance, especially testing, is an integral part of software-intensive system development. With the increased usage of Artificial Intelligence (AI) and Machine Learning (ML) as part of such systems, this becomes more difficult as well-understood software testing approaches cannot be applied directly to the AI-enabled parts of the system. The required adaptation of classical testing approaches and the development of new concepts for AI would benefit from a deeper understanding and exchange between AI and software engineering experts. We see the different terminologies used in the two communities as a major obstacle on this way. As we consider a mutual understanding of the testing terminology a key, this paper contributes a mapping between the most important concepts from classical software testing and AI testing. In the mapping, we highlight differences in the relevance and naming of the mapped concepts.

An improved ant colony optimization algorithm for unmanned surface vehicle local path planning with multi-modality constraints

Article

Dec 2021
OCEAN ENG

Dimitrios V. Lyridis

The increase application of machine learning and artificial intelligence in the field of robotics, have emerged the need for real time algorithms for autonomous unmanned aircraft and surface vehicles. In recent years, various studies have been conducted for path planning of unmanned surface vehicle (USV), especially to maritime transportation. Most of the traditional approaches include optimization algorithms for finding the shortest or fastest path. However, USV motion in complex environments demand multi-objective optimization and multi-modality constraints to cope with dynamic environments and moving obstacles. To this end, an improved Ant Colony Optimization with Fuzzy Logic (ACO-FL) is proposed to deal with local path planning for obstacle avoidance by taking into account wind, current, wave and dynamic obstacles. The proposed algorithm was compared to original ACO and popular Bug algorithm in simulation tests. The results showed that ACO-FL reached better performance compared to the other algorithms under examination in terms of optimal solution and convergence speed. Thus, the proposed algorithm can be considered as an effective approach for path planning of USVs.

Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization

Conference Paper

Jun 2021

Reliable prediction of software defects using Shapley interpretable machine learning models

Abstract and Figures

Recommended publications

A review on Feature Selection Technique based on Particle Swarm Optimization & SVM

Gradient boosting machine with partially randomized decision trees

Gradient Boosting Machine with Partially Randomized Decision Trees

Comparing Machine Learning Techniques for House Price Prediction

An ANN based bidding strategy for resource allocation in cloud computing using IoT double auction al...