Conference PaperPDF Available

Software Risk Prediction at Requirement and Design Phase : An Ensemble Machine Learning Approach

October 2023

October 2023

DOI:10.1109/ICT4DA59526.2023.10302250

Conference: 2023 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)

Authors:

Yibeltal Assefa

Wollo University

Esubalew Alemneh

Bahir Dar University

Proposed Architecture of Our predictive model.

…

The dataset with outlier represents in box plot.

…

The dataset with out outlier represents in box plot.

…

The distribution of risk in our dataset.

…

Figures - uploaded by Yibeltal Assefa

Content may be subject to copyright.

Content uploaded by Yibeltal Assefa

Content may be subject to copyright.

Software Risk Prediction at Requirement and

Design Phase : An Ensemble Machine Learning

Approach

Yibeltal Assefa

Software Engineering

Kombolcha Institute of Technology

Wollo University

yibebdu21@gmail.com

Esubalew Alemneh

ICT4D Research Center

Bahir Dar Institute of Technology

Bahir Dar University

esubalew.alemneh@bdu.edu.et

Shegaw Nibret

ICT4D Research Center

Bahir Dar Institute of Technology

Bahir Dar University

ze.nibret12@gmail.com

Abebaw Worku

Software Engineering

College Natural and Computation Science

Mekdela University

abebawworku10@gmail.com

Abstract—Software development is a highly structured process

that involves the creation and maintenance of a particular system,

ranging from simple applications to complex enterprise software.

Despite following a well-deﬁned process, unforeseen events can

occur at any stage of the SDLC that may impact the software

development process, leading to losses or failures in software

development. Software projects inherently involve risks, and no

software development project is immune to these risks. Identify-

ing and predicting such risks accurately is a challenge in software

project development. To address this challenge, this study aims

to develop a software risk prediction model using homogenous

ensemble machine learning algorithms. These algorithms were

selected due to their proven effectiveness in handling complex

datasets and their ability to achieve high prediction accuracy.

We have used an experimental research methodology to develop

a software risk prediction model. The methodology involved

collecting datasets related to requirements and design from

publicly available websites such as zenodo and Harvard education

dataset. These datasets were then used to train and validate

the performance of the machine learning algorithms. Our study

has achieved impressive prediction scores of 98.67%, 97.3%,

96.0%, and 96.0% for the algorithms Gradient Boost, Random

Forest, AdaBoost, and bagging algorithms with their homogenous

decision tree respectively. Using the four different homogeneous

ensemble machine learning algorithms we develop software risk

predictive models. Ultimately, Gradient Boost was selected as

the algorithm to construct our risk predictive model due to

its superior performance and ability to handle complex data.

By employing this model, software development organizations

can improve their ability to identify and mitigate risks, thereby

improving the quality and reliability of their software products.

Index Terms—ensemble machine learning algorithms, require-

ments phase, design phase, software risk prediction.

I. INTRODUCTION

The process of software development is a systematic method

to develop software. It involves the development and main-

tenance of software [17]. There is always the possibility of

unexpected events occurring during the Software Development

Life Cycle (SDLC) that may result in loss or failure in software

projects [1]. Incompleteness and omissions of requirements

caused many software hazards as software risks could be

generated due to incomplete and unclear requirements [2]. The

risks burst from various risk inﬂuences that are established in

an assortment of exercises in the SDLC [3].

A risk is an uncertain event that occurs in the SDLC

process and because of which has led to the potential loss

of software in most organizations. It increases most software

projects’ failure rate [5]. Therefore, risk assessment early

in the life cycle of software development is very important

[4]. Software risks arise as a consequence of various factors

such as inadequate resources, limited skills, and insufﬁcient

information.Risk management focuses on the identiﬁcation of

risks and appropriate treatment of them. Projects have individ-

ual risks or overall risks. Risk management methods specify

search procedures for information gathering, organization and

interpretation to simplify complex decisions under conditions

of bounded rationality [6].

Therefore, developing automation systems using machine

learning is highly essential to support risk management man-

agers. Machine learning studies how to automatically learn to

make accurate predictions, classiﬁcation and clustering based

on past observations. There are different types of Machine

Learning (ML) techniques [7]. There is no software devel-

opment project which is free from risks. There are some

software products that fail due to the negligence of addressing

risks at the early stage of development. Previous research has

explored the domain of software product risk prediction but

the accuracy of previous ﬁndings was low to indicate potential

improvement. One approach to enhance the accuracy of risk

prediction results is using artiﬁcial intelligence algorithms

.Ensemble learning techniques is one such algorithm that can

be used for this purpose. The algorithm involves combining the

risk prediction models. Therefore, there is still room for

improvement in risk prediction of software products using

advanced ensemble machine learning techniques.

Khalid [9] conducted a study on predicting risk through

artiﬁcial intelligence using machine learning algorithms, fo-

cusing on nonﬁnancial ﬁrms in Pakistan. The study utilized

various techniques, including random forest, decision tree,

naive Bayes, and KNN, to assess and predict risks based on

a dataset from the ﬁnance sector. However, this research did

not consider the risks associated with software development.

The study aimed to explore the possibility of predicting risks

by applying machine learning algorithms, but its scope was

limited to the ﬁnance sector only. The study could not provide

insights into software development risk prediction, which is

essential in the current software-driven business environment.

Therefore, our research aims to address this gap by examining

the performance of various machine learning algorithms to

predict software development risks.

The scholarly article authored by Otoom [3] focuses on

developing an ensemble model for predicting risks in software

requirements. The article proposes an ensemble classiﬁer that

combines AdaBoostM1 and J48 algorithms, called the ABMJ

model. Although this model is shown to be effective in predict-

ing software risks associated with requirements, it overlooks

the potential risks that may arise during the design stage of

software development. Hence, there exists a gap in the research

on software risk prediction that aims to address the project risk

that can emerge from the design stage. This research aims

to ﬁll this gap by proposing a more comprehensive approach

that considers the risks associated with both requirement and

design stages of software development.

Filippetto [10] created a model for predicting risks in

software project management based on similarity analysis of

context histories. However, the study’s case study approach

had limitations, as it only examined a small number of cases

and did not employ automated prediction techniques. The

model relied on manual analysis of historical cases, with the

evaluation mechanism heavily reliant on expert evaluation.

This approach was limited by its dependence on individual

expertise and the lack of generalizability of the results to other

cases. Also, the study did not explain the evaluation process

used, making it difﬁcult to assess the validity and reliability

of the results. Therefore, a more systematic and automated

approach to risk prediction in software project management is

necessary.

Another study, referenced as [11], focuses on predicting

the risk percentage in software projects by training machine

learning classiﬁers. The study utilizes various machine learn-

ing algorithms, including SVM, k-NN, RF, and ANN-MLP,

and trains them using the collected data. However, the dataset

used in the study is obtained through a questionnaire, which

may not be entirely reliable since it depends on the user’s

expertise. Moreover, the accuracy of the results obtained using

this dataset is low and requires improvement

Dillibabu’s paper [12] presents an innovative approach for

software risk prediction, which combines fuzzy-based TOP-

strengths

multiple

machine

learning

algorithms

improve

overall

predictive

performance.

addition

this,

researches

[5]

[8]

[9]

[10]

are

limited

specifying

the

stages

risk

occurrence

during

dataset

collection.

These

studies

primarily

focus

risks

that

arise

during

the

requirement

stage,

neglecting

the

risks

associated

with

the

design

stage.

Therefore,

crucial

expand

the

scope

the

dataset

include

risks

occurring

both

the

requirement

and

design

stages.

II.

REL ATED

WOR K

This

chapter

aims

provide

extensive

review

relevant

literature

software

risk

prediction.

One

the

speciﬁc

objectives

this

review

identify

gaps

the

context

predicting

software

project

risk

the

early

stage

development.To

achieve

this

goal,

have

reviewed

research

and

literature

software

risk

prediction.

have

explored

wide

range

sources,

including

academic

journals,

conference

proceedings,

and

books,

gather

the

most

relevant

and

up-to-date

information

the

topic.

There

are

research

studies

have

been

conducted

the

ﬁeld

risk

prediction

for

software

products.

However,

there

exist

certain

gaps

the

existing

literature

that

require

attention.

The

following

scholarly

articles

are

the

most

relevant

and

recent

publications

that

have

been

reviewed

detail.

Akumba’s

study

[5]

aimed

predict

software

risk

during

the

requirement

stage.

However,

their

research

did

not

consider

other

critical

software

development

stages

such

the

design

stages.

Consequently,

the

risk

assessment

they

provided

was

incomplete,

leading

potential

risks

that

could

harm

the

software

project.

their

study,

Akumba

employed

the

Na¨

ıve

Bayes

machine

learning

algorithm,

but

did

not

explore

the

use

other

predictive

algorithms

that

may

yield

better

results.

Furthermore,

the

authors

did

not

explain

the

rationale

behind

their

choice

the

Na¨

ıve

Bayes

algorithm.

This

approach

has

weakness

when

there

are

occurrences

class

label

and

speciﬁc

attribute

value

together,

leading

probability

estimate

zero

based

frequency.

When

all

the

probabilities

are

multiplied,

this

problem

could

signiﬁcantly

impact

the

predictive

model.

Our

research

aims

address

these

gaps

experimenting

with

other

predictive

algorithms

improve

the

accuracy

the

predictive

model.

will

use

datasets

from

both

the

requirement

and

design

stages

software

development

identify

risks

early

stage.

The

research

paper

[8]

delved

into

Risk

Prediction

Applied

Global

Software

Development

utilizing

various

Machine

Learning

Methods

including

logistic

regression,

deci-

sion

tree

(DT),

Random

Forest

(RF),

Support

Vector

Machine

(SVM),

K-Nearest

Neighbors

(KNN),

and

Na¨

ıve

Bayes.

The

ﬁndings

showed

accuracy

89%,

which

although

impres-

sive,

could

still

further

improved.

noteworthy

that

the

paper

focused

global

software

development,

topic

that

poses

unique

challenges

and

risks

that

require

careful

consideration.

Despite

various

machine

learning

algorithms,

the

paper

did

not

explore

ensemble

methods

such

bagging,

boosting,

stacking

that

could

enhance

the

accuracy

SIS, ANFIS MCDM, and fuzzy decision-making trial and

evaluation laboratory methods. The study utilizes the NASA

93 dataset that includes 93 software project values. The results

of the research show that the proposed approach has high

accuracy in predicting software risk factors. However, the

paper notes that there is still room for accuracy improve-

ment. Therefore, the use of machine learning approaches

could potentially enhance accuracy and improve the decision-

making process for predicting software risk factors. Overall,

this integrated approach provides a promising framework for

software risk prediction and can be further developed in future

studies.

Iftikhar [13] conducted a study focused on Risk Prediction

in Global Software Development using Artiﬁcial Neural Net-

work (ANN). The research employed various ANN techniques,

namely Levenberg–Marquardt, Bayesian Regularization, and

Scaled Conjugate Gradient, to predict risks. The accuracy of

the model was measured using Mean Squared Error (MSE),

with a resulting value of 2.157 MSE. However, this value

indicates a low level of accuracy, highlighting the need for

improvement. To enhance the model’s performance, it is

recommended to expand the sample data set by including data

from multiple companies. Additionally, employing random

data collection methods would aid in generalizing the model’s

predictions.

In general, while conducting a review of the related liter-

ature, it is apparent that there exists a discernible gap that

must be addressed. Speciﬁcally, this pertains to the data set

and the techniques that are utilized to improve the accuracy

of predictions, thereby mitigating the risks associated with

various project and software products. The need to develop

a more precise approach to prediction is of paramount impor-

tance, and it is one that cannot be overstated. Failure to do so

may result in a wide range of potential risks, which may have

adverse implications for the project’s success and the resulting

software products. As such, it is imperative that researchers

and practitioners alike focus their efforts on identifying and

implementing more advanced techniques that can help to ﬁll

this critical gap in the existing literature.

III. RESEARCH METHODOLOGY

A. Research Design

The study utilizes an experimental research design, em-

ploying a scientiﬁc approach to investigate the relationship

between variables. Two sets of variables are utilized, with

the ﬁrst serving as a constant to measure the variances in

the second variable [13]. In our particular case, the dependent

variable is the software risk level. The reason behind choosing

the experimental research design is to explore potential cause-

and-effect relationships and gain a clear understanding of how

certain variables impact others. This approach allows us to

assess the effects of various factors on the software risk level

and determine causal connections between them.

B. Proposed architecture

The architecture of a system provides a high-level view

of its key components, their relationships, and the way they

interact with one another [15]. The proposed architecture of

our work is indicated below ﬁgure 1.

Fig.

Proposed

Architecture

Our

predictive

model.

Software

risk

Data

set:

are

collected

our

dataset

from

software

repositories

such

Zenodo

and

Harvard

educa-

tion

dataset.

used

requirement

risk

datasets

which

contain

risk

data

belongs

requirement

stage

such

functional,

nonfunctional

and

domain

requirement

and

design

stage

which

have

database

design,

user

interface

design

getting

from

Zen-

odo.

This

dataset

has

400

instances

and

number

instances

and

also

add

100

instances

collecting

from

Harvard

education

dataset

which

help

our

model

train

more.

Data

prepossessing:

The

data

preprocessing

phase

research

projects

often

overlooked

and

underestimated

researchers

[16].

critical

step

that

involves

examining

the

quality

the

data

before

developing

model.

perform

the

different

operations

such

data

cleaning,

data

integration,

data

transformation,

handling

data

imbalance,

and

feature

engineering

for

making

preprocessing

our

data

set.

Data

cleaning

One

common

challenge

data

cleaning

handling

data

that

contains

missing

values,

these

missing

values

can

impact

the

quality

and

accuracy

the

data

analysis.

address

this,

various

methods

can

employed

during

the

data

cleaning

process.

One

approach

avoid

including

irrelevant

data

attributes

the

analysis,

these

can

introduce

noise

and

inconsistencies

into

the

results.

our

case,

since

our

dataset

small

and

has

only

few

missing

values, we manually ﬁll in those missing values. It is worth

noting that our dataset is now free from any missing values.

4) Outliers Handling: We detect outliers using a boxplot

and address them by using the previously calculated Interquar-

tile Range (IQR) scores. The general guideline is that any data

point falling outside the range of (Q1 - 1.5 IQR) and (Q3 +

1.5 IQR) is considered an outlier. Outliers are represented as

values that lie far from the main box, as depicted in the ﬁgure

below.

Fig. 2. The dataset with outlier represents in box plot.

To eliminate the outliers, we employed the Interquartile

Range (IQR) method. This technique involves the computation

of the ﬁrst quartile (Q1), third quartile (Q3), and the IQR

for each numeric column using the quantile () function. The

cleared dataset representation is shown below ﬁgure 3.

Fig. 3. The dataset with out outlier represents in box plot.

5) Data transformation: In this study, the researchers em-

ploy data transformation techniques to mathematically alter the

values of a variable. Researchers employ data normalization

and standardization techniques as part of data transformation

to tackle modeling challenges and enhance the effectiveness

of their analyses. These techniques involve modifying the

scale and distribution of data to ensure compatibility and

comparability across different attributes.

IV. EXPERIMENTS

In this section entails various essential tasks including

data description, development of predictive models, evaluation

of predictive models, and validating the research question

responses.

A. Data Description

This research study was conducted on a dataset comprising

400 instances to make predictions about software risk. After

preprocessing the data, 14 features were considered for this

study.

The software risk levels were categorized into ﬁve distinct

classes. Out of the total 400 instances, there were 175 instances

classiﬁed as risk level 2, 95 instances classiﬁed as risk level 3,

60 instances classiﬁed as risk level 4, 50 instances classiﬁed

as risk level 1, and 15 instances classiﬁed as risk level 5. It is

important to note that these classiﬁcations were made prior to

addressing the issue of imbalanced data. The comprehensive

results of the study can be observed in the Figure below

4. From this ﬁgure can understand that risks that belong to

Fig. 4. The distribution of risk in our dataset.

the second division of severity have higher number from our

dataset and also the risks which have third risk severity has the

second the greatest number of the dataset and also the fourth

severity, second severity are order according to their amounts

from our dataset.

B. Predictive Model Development for Software Risk

Upon completing the necessary data pre-processing steps,

we proceeded to employ various suitable homogeneous en-

semble machine learning techniques to construct a predictive

model for software project risk. These techniques included

Adaboost with a decision tree, Gradient Boosting with a

decision tree, Bagging with a decision tree, and Bagging with

random forest.

In our study, we employed the following homogeneous

ensemble machine learning methods:

1) Adaboost: : This method combines multiple weak learn-

ers, speciﬁcally decision trees, to iteratively improve the

model’s predictive accuracy by focusing on misclassiﬁed in-

stances. Gradient Boosting: Like Adaboost,

2) Gradient Boosting: Gradient Boosting also utilizes de-

cision trees as weak learners. However, it differs in the way

it assigns weights to the misclassiﬁed instances, aiming to

minimize the overall loss function.

3) Bagging with Decision Trees: : Bagging, short for Boot-

strap Aggregation, involves creating multiple subsets of the

training data through bootstrap sampling. Each subset is used

to train a separate decision tree, and the ﬁnal prediction is

obtained through aggregation.

4) Random Forest with Decision Tree: : Random Forest

is an extension of Bagging that further enhances performance

by introducing randomness in the feature selection process for

each decision tree. This randomness helps to reduce overﬁtting

and improve generalization.

C. Dataset splitting

To evaluate the performance of our homogeneous ensemble

machine learning algorithms on unseen data, we utilized the

train-test split technique. In this research work we used 0.25

percent from the total dataset which led to 100 instances

for test our model. By averaging the performance metrics

across multiple iterations by different values, we obtain a more

reliable assessment of our models’ performance.

D. Experimental Setup

To create a prediction model for software risk, few ex-

periments were conducted to identify the best classiﬁcation

model and extract sample-relevant rules. Four experiments

were conducted using ensemble machine learning methods

along with 14 selected features deemed risk factors. The

goal was to utilize the most appropriate ensemble machine

learning methods and relevant features to construct an effective

prediction model for software risk.

1) Random Forest with Decision Tree: The aim of our

experiment was to create a predictive model for software

project risk using the Random Forest algorithm and decision

tree. Our results indicate that Random Forest outperformed

other algorithms in terms of accuracy, achieving an impressive

97.3%.

2) Gradient boosting with Decision Tree: In this study,

our objective was to build a predictive model for assessing

the risk of software projects. The confusion matrix generated

by gradient boosting is indicated in the following ﬁgure.

Speciﬁcally, when evaluating the accuracy metric, gradient

boosting achieved an impressive accuracy rate of 98.67%.

3) AdaBoosting with Decision Tree: In this research study,

our aim was to develop a predictive model to assess the

risk associated with software projects using the AdaBoost-

ing machine learning algorithm. The ﬁgure below indicates

our confusion matrix representation. we achieved a 96.0%

accuracy rate using AdaBoosting, indicating that our model

can effectively predict the likelihood of software project risks.

Generally, the result is indicated below table 1.

TABLE I

EXP ERI ME NTAL RE SU LT OF AL L ALG OR ITH MS

Evaluation Algorithms

metrics Random Forest AdaBoosting Gradient Boosting Bagging

Accuracy 97.3 % 96.0% 98.67% 96.0%

Precision 97.8% 96.6% 99.1% 96.4%

Recall 97.3% 96.0% 98.67% 96.0%

F1Score 97.34% 96.06% 98.8% 96.02%

E. Bagging Algorithm with Decision Tree

The aim of this study was to create a predictive model

for evaluating the risk associated with software projects using

the Bagging ensemble algorithm. The results showed that the

Bagging algorithm performed well in terms of accuracy.

F. Discussion of the Result

The predictive model for software project risk was con-

structed using 400 instances, as mentioned earlier. In this study

aimed to identify the factors that hold signiﬁcant inﬂuence over

the risk levels in software projects. By using heat map graph

we identify the priority and probability of risk consistently

emerged as the predominant features among the targeted

attributes. We conducted four experiments using homogeneous

ensemble machine learning methods: gradient Boost, Ad-

aBoost, Bagging, and Random Forest. These experiments were

to determine the most effective approach for predicting soft-

ware project risks. The results of the experiments revealed the

following accuracies: 98.67% for gradient Boost, 97.3% for

Random Forest, and 96.0% for both Bagging and AdaBoost.

Based on these ﬁndings, it can be concluded that the gradient

Boost classiﬁcation algorithm is the most appropriate homo-

geneous ensemble machine learning algorithm for developing

a predictive model capable of accurately predicting software

risk levels using a software risk dataset.These results provide

valuable insights into the selection of ensemble machine

learning methods for predicting software project risks. The

overall result is indicated in the following ﬁgure 6. Gener-

Fig.

Result

representation

using

bar

graph

for

different

algorithms

ally,

ﬁndings

these

experiments

revealed

that

the

gradient

boosting with decision tree homogeneous ensemble machine

learning algorithm achieved an exceptional accuracy rate of

98.67%. This level of accuracy in software risk prediction

surpasses the results reported in previous related work. By

selecting the gradient boosting algorithm as the foundation

for the predictive model, the study ensures a high level of

accuracy in predicting software project risks.

V. C ONCLUSION

The process of software development is a systematic method

to develop software. It involves the development and main-

tenance of software. There is always the possibility of unex-

pected events occurring during the Software Development Life

Cycle (SDLC) that may result in loss or failure in software

development. Because of the nature of risk and software

projects there are no tools to identify risk and predict risk

severity at an early stage of development. The purpose of

this study is to support risk management and managers in the

automation system required by developing predictive models

using ensembled machine learning. We used homogenous

ensembled machine learning algorithms to predict software

project risk level at requirement and design stage.

We used four different homogenous ensembled machine

learning algorithms for developing our software risk predictive

model such as random forest, AdaBoost, Gradient boost-

ing and bagging which achieves accuracy of 97.7%, 95.5%,

98.9%, 97.7% respectively. In addition, we used software risk

dataset from software repositories such as Zenodo to train our

algorithms. The dataset also contains 14 attributes or features

and 400 instances. Based on their accuracy result we select

that Gradient boosting algorithm is selected for developing

ﬁnal predictive model.

This study used experimental research methodology to come

up with different algorithms for making experiments and select

the appropriate or outperformed algorithm that could be used

for building our predictive model. Therefore, we performed

four experiments with homogenous ensembled machine al-

gorithm and ﬁnally we selected gradient boosting algorithm

which score 98.9 % of predictive performance.

REFERENCES

[1] Mahmud, Mahmudul Hoque. ”Software Risk Prediction: Systematic

Literature Review on Machine Learning Techniques.” Applied Sciences

12.22 (2022): 11694.

[2] Bhukya, Shankar Nayak, and Suresh Pabboju. ”Software engineering:

risk features in requirement engineering.” Cluster Computing 22 (2019):

14789-14801.

[3] Qureshi, Muhammad Shahroz Gul, Bilal Khan, and Muhammad Arshad.

”ML-Based Model for Risk Prediction in Software Requirements.”

International Journal of Technology Diffusion (IJTD) 13.1 (2022): 1-

17.

[4] Bhukya, Shankar Nayak, and Suresh Pabboju. ”Software engineering:

risk features in requirement engineering.” Cluster Computing 22 (2019):

14789-14801.

[5] Akumba, Beatrice O. ”A Predictive Risk Model for Software Projects’

Requirement Gathering Phase.” International Journal of Innovative Sci-

ence and Research Technology 5 (2020): 231-236.

[6] Sarigiannidis, Lazaros, and Prodromos D. Chatzoglou. ”Software devel-

opment project risk management: A new conceptual framework.” Journal

of Software Engineering and Applications 4.05 (2011): 293.

[7] Dasgupta, Ariruna, and Asoke Nath. ”Classiﬁcation of machine learning

algorithms.” International Journal of Innovative Research in Advanced

Engineering (IJIRAE) 3.3 (2016): 6-11.

[8] , . ”Risk Prediction using Machine Learning Techniques in the Domain

of Global Software Development: A Review.” 5.1 (2023): 7-15.

[9] Khalid, Shamsa. ”Predicting Risk through Artiﬁcial Intelligence Based

on Machine Learning Algorithms: A Case of Pakistani Nonﬁnancial

Firms.” Complexity 2022 (2022).

[10] Filippetto, Alexsandro Souza, Robson Lima, and Jorge Luis Vict´

oria

Barbosa. ”A risk prediction model for software project management

based on similarity analysis of context histories.” Information and

Software Technology 131 (2021): 106497.

[11] Gouthaman, P., and Suresh Sankaranarayanan. ”Prediction of Risk Per-

centage in Software Projects by Training Machine Learning Classiﬁers.”

Computers Electrical Engineering 94 (2021): 107362.

[12] Suresh, K., and R. Dillibabu. ”An integrated approach using IF-TOPSIS,

fuzzy DEMATEL, and enhanced CSA optimized ANFIS for software

risk prediction.” Knowledge and Information Systems 63.7 (2021): 1909-

1934.

[13] Iftikhar, Asim, ”Risk prediction by using artiﬁcial neural network in

global software development.” Computational Intelligence and Neuro-

science 2021 (2021).

[14] Sweet, S. A., and K. A. Grace-Martin. ”Modeling relationships of

multiple variables with linear regression.” Data analysis with SPSS: A

ﬁrst course in applied statistics (2012): 161-188.

[15] Eeles, Peter. ”What is a software architecture?.” IBM. Retrieved March

21 (2006): 2007.

[16] Ampomah, Ernest Kwame, Zhiguang Qin, and Gabriel Nyame. ”Eval-

uation of tree-based ensemble machine learning models in predicting

stock price direction of movement.” Information 11.6 (2020): 332.

[17] Assefa, Yibeltal, et al. ”Software Effort Estimation using Machine

learning Algorithm.” 2022 International Conference on Information and

Communication Technology for Development for Africa (ICT4DA).

IEEE, 2022.

Comparing machine learning techniques for software requirements risk prediction

Article

Full-text available

Feb 2024

Yasiel Pérez Vera

Software requirements are the most critical phase focused on documenting, eliciting, and maintaining the stakeholders' requirements. Risk identification and analysis are preemptive actions designed to anticipate and prepare for potential issues. Usually, this classification of risks is done manually, a practice that the personal judgment of the risk analyst or the project manager might influence. Machine learning (ML) techniques were proposed to predict the risk level in software requirements. The techniques used were logistic regression (LR), multilayer perceptron (MLP) neural network, support vector machine (SVM), decision tree (DT), naive bayes, and random forest (RF). Each model was trained and tested using cross-validation with k-folds, each with its respective parameters, to provide optimal results. Finally, they were compared based on precision, accuracy, and recall metrics. Statistical tests were performed to determine if there were significant differences between the different ML techniques used to classify risks. The results concluded that the DT and RF are the techniques that best predict the risk level in software requirements.

ML-Based Model for Risk Prediction in Software Requirements

Article

Full-text available

Jan 2022

Software risk prediction is the most sensitive and crucial activity of the SDLC. It may lead to the success or failure of the project. The requirement gathering stage is the most important and challenging stage of the SDLC. The risks should be tackled at this stage and saved to be used in future projects. However, a model is proposed for the prediction of software requirement risks using the requirement risk dataset and ML classification. This research study proposed a model for risk prediction in software requirements that will be evaluated using several evaluation measures (e.g., precision, F-measure, MCC, recall, and accuracy). For the completion of this study, the dataset is taken from Zenodo repository. The model is evaluated using ML techniques. After the finding and analysis of results, DT shows best performance with accuracy of 99%.

Software Risk Prediction: Systematic Literature Review on Machine Learning Techniques

Article

Full-text available

Nov 2022

The Software Development Life Cycle (SDLC) includes the phases used to develop software. During the phases of the SDLC, unexpected risks might arise due to a lack of knowledge, control, and time. The consequences are severe if the risks are not addressed in the early phases of SDLC. This study aims to conduct a Systematic Literature Review (SLR) and acquire concise knowledge of Software Risk Prediction (SRP) from the published scientific articles from the year 2007 to 2022. Furthermore, we conducted a qualitative analysis of published articles on SRP. Some of the key findings include: (1) 16 articles are examined in this SLR to represent the outline of SRP; (2) Machine Learning (ML)-based detection models were extremely efficient and significant in terms of performance; (3) Very few research got excellent scores from quality analysis. As part of this SLR, we summarized and consolidated previously published SRP studies to discover the practices from prior research. This SLR will pave the way for further research in SRP and guide both researchers and practitioners

Predicting Risk through Artificial Intelligence Based on Machine Learning Algorithms: A Case of Pakistani Nonfinancial Firms

Article

Full-text available

Jun 2022
COMPLEXITY

AI (artificial intelligence) is a significant technological advancement that has everyone buzzing about its incredible potential. The current research study evaluates the influence of supervised artificial intelligence techniques, i.e., machine learning techniques on the nonfinancial firms of Pakistan and focuses on the practical application of AI techniques for the accurate prediction of corporate risks which in turn will lead to the automation of corporate risk management. So, in this study, we used financial ratios for accurate risk assessment and for the automation of corporate risk management by developing machine learning algorithms using techniques, namely, random forest, decision tree, naïve Bayes, and KNN. A secondary data collection technique will be used. For this purpose, we collected annual data of nonfinancial companies in Pakistan for the period ranging from 2006 to 2020, and the data are analyzed and tested through Python software. Our results prove that AI techniques can accurately predict risk with minimum error values, and among all the techniques used, the random forest technique outperforms as compared to the rest of the techniques.

Software engineering: risk features in requirement engineering

Article

Full-text available

Dec 2021
CLUSTER COMPUT

The term risk is defined as the potential future harm that may arise due to some present actions. Risk management in software engineering is related to the various future harms that could be possible on the software due to some minor or non-noticeable mistakes in software development project or process. There are quite different types of risk analysis that can be used. Basically, risk analysis identifies the high risk elements of a project in software engineering. Also, it provides ways of detailing the impact of risk mitigation strategies. Risk analysis has also been found to be most important in the software design phase to evaluate criticality of the system. The main purpose of risk analysis understands the risks in better ways and to verify and correct the attributes. A successful risk analysis includes important elements like problem definition, problem formulation, data collection. Some of the requirement risks are Poor definition of requirements, Inadequate of requirements, Lack of testing, poor definition of requirements etc. The likelihood of the events which tends to the goal can be evaluated from the evidence of Satisfaction and denial of the goal and it can be achieved through Tropos goal model. Original Tropos model is modified to meet the risk assessment requirements in requirements engineering. The event considers as a risk which based on the likelihood values. The relations are defined between multiple goals and events, which identify the necessity of a particular goal. In order to analyze the risk in achieving some particular goals, a set of candidate solutions are generated. Based on the risk affinitive value, the candidate solutions can be evaluated. There are three risk parameters to compute the risk affinitive value, which are (1) low (2) medium (3) high. The risk parameters and cost analysis clearly evaluate the affinity of that event to a particular set of goals.

Risk Prediction by Using Artificial Neural Network in Global Software Development

Article

Full-text available

Dec 2021
Comput Intell Neurosci

The demand for global software development is growing. The nonavailability of software experts at one place or a country is the reason for the increase in the scope of global software development. Software developers who are located in different parts of the world with diversified skills necessary for a successful completion of a project play a critical role in the field of software development. Using the skills and expertise of software developers around the world, one could get any component developed or any IT-related issue resolved. The best software skills and tools are dispersed across the globe, but to integrate these skills and tools together and make them work for solving real world problems is a challenging task. The discipline of risk management gives the alternative strategies to manage risks that the software experts are facing in today’s world of competitiveness. This research is an effort to predict risks related to time, cost, and resources those are faced by distributed teams in global software development environment. To examine the relative effect of these factors, in this research, neural network approaches like Levenberg–Marquardt, Bayesian Regularization, and Scaled Conjugate Gradient have been implemented to predict the responses of risks related to project time, cost, and resources involved in global software development. Comparative analysis of these three algorithms is also performed to determine the highest accuracy algorithms. The findings of this study proved that Bayesian Regularization performed very well in terms of the MSE (validation) criterion as compared with the Levenberg–Marquardt and Scaled Conjugate Gradient approaches.

An integrated approach using IF-TOPSIS, fuzzy DEMATEL, and enhanced CSA optimized ANFIS for software risk prediction

Article

Full-text available

May 2021
KNOWL INF SYST

Successful project is determined based on its effective performance and prioritization of all unavoidable software project risks. In this paper, the risk evaluation in software projects is done through developing a new hybridized fuzzy-based risk evaluation framework. During decision making process, this proposed scheme has determined and ranked all the significant project risks. Software project risks are better assessed with the incorporation of Intuitionistic fuzzy-based TOPSIS, adaptive neuro-fuzzy inference system-based multi-criteria decision making (ANFIS MCDM), and fuzzy decision making trial and evaluation laboratory methods. In order to attain accurate software risk estimation, the ANFIS parameters are adjusted with the help of enhanced crow search algorithm (ECSA). To the ANFIS approach, the ECSA is combined to make free the solutions sticking inside the local optimum and adopting only small changes for the adjustment of ANFIS parameters. NASA 93 dataset with 93 software project values was used to conduct the experimental validation. Experimental outcomes have proved evidently that the software project risks evaluation were done accurately and effectively using proposed integrated fuzzy-based framework.

A Predictive Risk Model for Software Projects’ Requirement Gathering Phase

Article

Full-text available

Jun 2020

The initial stage of the software development lifecycle is the requirement gathering and analysis phase. Predicting risk at this phase is very crucial because cost and efforts can be saved while improving the quality and efficiency of the software to be developed. The datasets for software requirements risk prediction have been adopted in this paper to predict the risk levels across the software projects and to ascertain the attributes that contribute to the recognized risk in the software projects. A supervised machine learning technique was used to predict the risk across the projects using Naïve Bayes Classifier technique. The model was able to predict the risks across the projects and the performance metrics of the risk attributes were evaluated. The model predicted four (4) as Catastrophic, eleven (11) as High, eighteen (18) as Moderate, thirty-three (33) as Low and seven (7) as insignificant. The overall confusion matrix statistics on the risk levels prediction by the model had accuracy to be 98% with confidence interval (CI) of 95% and Kappa 97%.

Software Effort Estimation using Machine learning Algorithm

Conference Paper

Nov 2022

Prediction of Risk Percentage in Software Projects by Training Machine Learning Classifiers

Article

Sep 2021
COMPUT ELECTR ENG

Recently, software project failures have been increasing due to lack of planning and budget constraints. In this regard, identifying the suitable software model with the consideration of risk factors is imperative. Therefore, this study investigates the key software models utilized in the industry through an interaction with software development experts and literature survey. In this study, 15 standard indicators were chosen where a survey was conducted through a questionnaire. The major performance metrics which were taken into consideration are network, security, software, machine learning, internet of things and application programming interface. We proposed a novel framework for the received dataset through questionnaire in which the machine learning classifiers were applied and risk predictions for each of the identified software models were accomplished. Using this outcome, software product managers can identify the appropriate software model according to the software requirements along with risk prediction percentage.

A risk prediction model for software project management based on similarity analysis of context histories

Article

Mar 2021
INFORM SOFTWARE TECH

Context Risk event management has become strategic in Project Management, where uncertainties are inevitable. In this sense, the use of concepts of ubiquitous computing, such as contexts, context histories, and mobile computing can assist in proactive project management. Objective This paper proposes a computational model for the reduction of the probability of project failure through the prediction of risks. The purpose of the study is to show a model to assist teams to identify and monitor risks at different points in the life cycle of projects. The work presents as scientific contribution to the use of context histories to infer the recommendation of risks to new projects. Method The research conducted a case study in a software development company. The study was applied in two scenarios. The first involved two teams that assessed the use of the prototype during the implementation of 5 projects. The second scenario considered 17 completed projects to assess the recommendations made by the Átropos model comparing the recommendations with the risks in the original projects. In this scenario, Átropos used 70% of each project's execution as learning for the recommendations of risks generated to the same projects. Thus, the scenario aimed to assess whether the recommended risks are contained in the remaining 30% of the executed projects. We used as context histories, a database with 153 software projects from a financial company. Results A project team with 18 professionals assessed the recommendations, obtaining a result of 73% acceptance and 83% accuracy when compared to projects already being executed. The results demonstrated a high percentage of acceptance of the recommendation of risks compared to the other models that do not use the characteristics and similarities of projects. Conclusion The results show the applicability of the risk recommendation to new projects, based on the similarity analysis of context histories. This study applies inferences on context histories in the development and planning of projects, focusing on risk recommendation. Thus, with recommendations considering the characteristics of each new project, the manager starts with a larger set of information to make more assertive project planning.