ArticlePDF Available

A systematic literature review on student performance predictions

November 2021
International Journal of Advanced Technology and Engineering Exploration 8(84)

November 2021
8(84)

DOI:10.19101/IJATEE.2021.874521

Authors:

Hasnah Nawang

Universiti Sultan Zainal Abidin | UniSZA

Amir fazamin

Universiti Sultan Zainal Abidin | UniSZA

Prediction of student performance in educational institutions is a major topic of debate among researchers in efforts to improve teaching and learning. Effective prediction techniques and features would help educators and teachers design appropriate teaching content to help learners study according to predicted outcomes. The purpose of this paper is to present a systematic literature review on predictions of students’ performance in higher education institutions and secondary schools using Machine Learning, Educational Data Mining, and Learning Analytics methodologies. The review used in this study was designed to: i) provide an overview of techniques and algorithms used to predict students' performance; and ii) identify the features that have the greatest impact on students' performance. This paper also outlined several future insights in terms of applying hybrid techniques to educational datasets in order to improve accuracy in predicting students’ performance. © 2021, Accent Social and Welfare Society. All rights reserved.

Databases used to search for papers

…

Selected papers for study review

…

Groups of features used in the studies

…

Random forest details

…

Support vector machine details

…

Figures - uploaded by Hasnah Nawang

Content may be subject to copyright.

Content uploaded by Hasnah Nawang

Content may be subject to copyright.

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

ISSN (Print): 2394-5443 ISSN (Online): 2394-7454

http://dx.doi.org/10.19101/IJATEE.2021.874521

1441

A systematic literature review on student performance predictions

Hasnah Nawang1*, Mokhairi Makhtar2 and Wan Mohd Amir Fazamin Wan Hamzah3

Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Terengganu, Malaysia1

School of Computer Science, Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Tembila

Campus, Terengganu, Malaysia2

School of Information Technology, Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Tembila

Campus, Terengganu, Malaysia3

Received: 31-July-2021; Revised: 01-November-2021; Accepted: 06-November-2021

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.Introduction

Nowadays, many high school and higher educational

systems generate a large number of student

information through the learning management system,

examination data, students’ activities, library system,

etc. [1]. This situation leads to increases in the volume

and types of educational data in every institution.

Machine learning (ML), learning analytics (LA), and

data mining (DM) approaches have been widely used

on educational data to predict students’ performance.

These approaches have shown that several techniques

and algorithms are useful in understanding this

domain, which is poorly accessed by human

capability.

Students’ success has become an important metric to

higher educational institutes as well as secondary level

schools. In higher education institutions, students’

performance plays a vital role in determining their job

success [2]. Good academic performance assures

employers of a candidate’s quality and reliability.

*Author for correspondence

In order to build and apply a predictive model, features

that correlate with the value to be predicted must be

collected and processed. There are many features

affecting students’ performance, and they can be

divided into a few groups, such as students’ previous

education, students’ e-learning activity, demographics

features [3], students’ social network information [4],

behaviour variables [5], external assessment, extra-

curricular activities and academic performance [6],

school design [7], parental involvement, etc. [8]. As

there are numerous features and approaches are

utilised to forecast students' performance, this study

provides a thorough evaluation of student performance

predictions for high/secondary schools and higher-

level institutions in terms of the most influential

attributes and methods regularly employed by

researchers.

Either in university/college or secondary school,

students’ performance prediction had drawn much

attention not only amongst educators but the machine

learning community as well. Predicting students’

performance has become a challenging task due to the

increased amount of data contained in educational

systems [9, 10]. The students' interest in learning are

Review Article

Abstract

Prediction of student performance in educational institutions is a major topic of debate among researchers in efforts to

improve teaching and learning. Effective prediction techniques and features would help educators and teachers design

appropriate teaching content to help learners study according to predicted outcomes. The purpose of this paper is to present

a systematic literature review on predictions of students’ performance in higher education institutions and secondary

schools using Machine Learning, Educational Data Mining, and Learning Analytics methodologies. The review used in

this study was designed to: i) provide an overview of techniques and algorithms used to predict students' performance; and

ii) identify the features that have the greatest impact on students' performance. This paper also outlined several future

insights in terms of applying hybrid techniques to educational datasets in order to improve accuracy in predicting students’

performance.

Keywords

Educational data mining, Machine learning, Learning analytics, Students, performance prediction.

Hasnah Nawang et al.

1442

varies. Some may be able to learn independently using

e-learning, while others may need to study with the

assistance of educators who prepare their teaching

materials based on the needs of the students, while

others may perform adequately in their examinations

with the full facilities provided in school or

institutions, and those with a higher family income

may be able to find outside sources to fund extra-

classes for their children's educational needs. The

variety of this interest has motivated the authors to

identify what features mostly contribute to learning

outcomes together with which suitable machine

learning techniques that will later assist institutions

administration in coming up with solutions in offering

the best education environment to their students.

Educators, on the other hand, will be able to construct

instructional content based on students' acceptance

and lead students' learning progress.

In the recent years, with the advances of the

application of technologies to forecasting students’

performance, there are still gaps to be filled in order to

discover most commonly used features and

techniques. Despite of other review papers on student

performance exist, few of them emphasize on the

importance of combining multiple groups of features

and techniques in order to enhance the accuracy

percentage. Previous literature review on students’

performance [2, 11] has discussed this topic on general

and did not emphasize on the impact of hybrid

approaches that able to help in optimizing the

accuracy. Thus, we aim to comprehensively map,

analyze and review the articles that have been

published in 2016 to 2020.

2.Research methodology

The conduct of this review paper had followed the

recommended procedures on performing a systematic

review as provided by Kitchenham et al. [12]. The

suggested procedure for a review is divided into three

stages, which are:

1. Planning the review

2. Conducting the review

3. Reporting the review

2.1Planning the review

This systematic review was conducted due to the need

to summarise the latest five years of research on

student performance prediction. As the first phase of

the review is planning, it is thus necessary to identify

the importance of this review. This systematic review

was proposed in order to support the objectives of this

study which are:

1. To summarise existing methodologies used in

student performance prediction.

2. To identify the most common features used in

predicting students’ performance.

3. To identify the most common algorithms used to

predict students’ performance.

4. To identify the gaps in previous research

After the objectives of the research were identified, the

most important activity in a systematic review

protocol which is to formulate the research questions

was then conducted. Based on Kitchenham, 2009

recommendations [12], the construction of research

questions must consider three viewpoints of the study

criteria, which are population, intervention, and

outcomes. The following are details for each study

criteria: 1) population: high school and higher

educational institutes; 2) intervention: methods,

algorithms, and techniques for prediction; and 3)

outcome: best features or variables used as

performance predictors and successful prediction

techniques or approaches. Hence, the following study

criteria have led to the mapping of this study research

questions (RQ):

1. What are the features commonly used by

researchers to predict students’ performance?

2. Which methods that are commonly used to

investigate students’ performance?

3. What are the best algorithms or techniques used in

student performance prediction?

2.2Conducting the review

2.2.1Dataset

To ensure the adaptation of relevant papers in this

review, papers were collected from four databases

namely IEEE Xplore, ScienceDirect, SpringerLink,

and EBSCOhost. The four databases are as illustrated

in Table 1. The search was conducted in September

2021.

Table 1 Databases used to search for papers

No.

Database name

No of papers

IEEEXplore

ScienceDirect

Springer Link

EbcoHost

2.2.2Search strategy

The search strategy was used to identify relevant

papers to the review. The following phrases - Machine

learning, educational data mining, learning analytics,

students’ performance analysis, predictive model,

students’ performance prediction, students’ academic

performance, school, high school, higher education -

were used to ensure that the selected papers only

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1443

discussed machine learning, educational data mining,

and learning analytics approaches. The search terms

have been constructed by identifying the techniques of

information extraction, level of education, and

category of prediction. We used Boolean operators

like AND, OR, and NOT in our search strings.

2.2.3Selection criteria

In this review, the study selection criteria were

planned in order to identify the primary studies based

on the following parts: 1) studies that used machine

learning, educational data mining or learning analytics

technique(s) for predictive modelling; 2) studies in

peer-reviewed journals or conference proceedings

written in English; and 3) studies that investigated the

prediction of students’ performance at higher

educational or high school levels. This systematic

review was performed on studies published from the

years 2016 to 2020.

2.2.4The inclusion and exclusion criteria

The main objective is to include as many articles as

possible starting from 2016 to 2021 that are related to

the research questions. Then, certain criteria were set

to identify whether an article should be included or not

in the analysis.

Inclusion criteria

1. Research papers on Students Performance

Prediction

2. Papers from 2016 to 2020 era.

3. Papers written in English.

4. Not a review paper

Exclusion Criteria

1. Studies not using ML or DM techniques

2. Duplicates paper

3. Irrelevant titles, abstract and keywords

4. Not in peer-reviewed

Figure 1 and illustrates the PRISMA flowchart used

as a guide during the selection process. As discussed

in Section 2.2.1, four databases, which are IEEE

Xplore, EBSCOhost, ScienceDirect, and

SpringerLink, were used to search for papers during

this systematic review. Ten items were retrieved from

IEEE Xplore, 26 items were from EBSCOhost, 60

items were from ScienceDirect, and 50 items were

from SpringerLink, making a total of 146 records.

Firstly, published papers that match the search strings

were identified. Secondly, papers were selected based

on the title, abstract, and keywords relevant to the

eligible criteria. The final selection was done by

reading the full content of the papers. This study

selected 40 articles for the subsequent review process

as shown in Figure 1. The first screening removed 40

records due to problems such as duplicate papers,

papers not written in English, and papers that were not

peer-reviewed, leaving a total of 106 records. The

second screening removed 30 papers due to irrelevant

titles, abstract, and keywords used, bringing the new

total to 76 records. Finally, the full text paper reading

process excluded another 36 papers, making a total

number of 40 papers remaining for analysis. These

papers were tabulated based on their Paper Id and

author/s name in Table 2.

Figure 1 PRISMA flowchart for research extraction

and direction

Table 2 Selected papers for study review

Paper Id

Author

Paper Id

Author

[13]

P21

[33]

[14]

P22

[34]

[15]

P23

[35]

[16]

P24

[36]

[17]

P25

[37]

[18]

P26

[38]

[19]

P27

[39]

[20]

P28

[40]

[21]

P29

[41]

P10

[22]

P30

[42]

P11

[23]

P31

[43]

P12

[24]

P32

[44]

P13

[25]

P33

[45]

P14

[26]

P34

[46]

P15

[27]

P35

[47]

P16

[28]

P36

[48]

P17

[29]

P37

[49]

P18

[30]

P38

[50]

Hasnah Nawang et al.

1444

Paper Id

Author

Paper Id

Author

P19

[31]

P39

[51]

P20

[32]

P40

[52]

3.Results

This section will discuss the on the analysis of

features, methods and algorithms used in this review

study. There are two main factors that contribute to

success in student performance prediction, which are

the features or attributes of educational data and the

techniques or algorithms used to explore educational

data in order to make predictions and patterns [53].

3.1(RQ:i) What are the features commonly used by

researchers to predict students’ performance?

Different researchers have found different attributes

that contribute to prediction accuracy in their studies.

However, some attributes carry the same meaning but

are differently labelled, and those attributes can be

divided into a few groups, such as students’ previous

education, students’ e-learning activities,

demographic features, students’ social network

information, behaviour variables, extra-curricular

activities, school design, academic performance,

parental involvement, etc.

During the full text reading process, some features

were identified and grouped for this review. Six

groups of features were detected and consisted of

Demographic Features, E-Learning Features, Social

Network Features, School Design Features, Academic

Performance Features, and Previous Education

Features.

There are 40 primary studies included in this review.

As shown in Figure 2, amongst the findings, academic

performance group features had the highest percentage

of recurrence in terms of researchers using them to

predict students’ performance with the 87.5%

followed by previous education group with 47.5% and

demographic features with 0.5% difference that is

47%. While the recurrence of percentage for social

network features is 10% and the lowest is the school

design features with 5% of recurrence. From the

percentage, it shows that most researchers find

academic performance features [11], such as GPA to

be significant predictors of students’ performance.

Figure 3 show the percentage of studies using one or

more number and Table 3 groups the features found

into the six categories.

Figure 4 illustrated the list of attributes used in this

study based on their feature groups.

Table 4 will go detailed on the best attributes discover

by the researchers in this review.

Using more than one category of features seems

popular in predicting students’ performance as shown

in Figure 2, 47% researchers in this review used the

combination of two different groups of features in

their study, followed by three combination of features

groups with 23% and 10% of the study used four

combination of features groups. As for only one

category of feature group has been used has recorded

20% of study and none of the study in this review had

use five or six combination of features group. It can be

concluded that, the importance of using more than one

category of features cannot be neglected as 80 percent

from the studies have used more than one group

category features in forecasting students’ performance

whether in higher level institutions or high school.

Table 3 Groups of features used in the studies

Category Id

Features category description

Demographic Features

E-Learning Features

Social Network Features

School Design Features

Academic Performance Features

Previous Education Features

As tabulated in Table 4, there are different attributes

that have been identify as the best predictors in

predicting students’ performance. The academic

performance features have recorded the highest

frequency of best attributes as illustrated in Figure 2.

However, the findings differ from previous review

article [11] that stated GPA is the most influential

attribute to student performance prediction, as only

two study that are [16, 43] out of 18 studies from the

academic performance category that discovered that,

and the majority of other studies ended up finding that

students' grade and score in examinations, quizzes, and

tests are the most impacted attributes to students'

performance. This is because academic performance is

mainly measured through the GPA, grades and score.

As this review also focusing on the high school and

secondary school level, the measurement to the

students’ performance not only depends on the GPA.

Previous education category feature has become the

second most influential attributes as 8 of the studies

found that Scholastic admission test score, high school

marks and grade and National Examination score are

able to make the highest impact on students’

performance with the recurrence percentage is 47.5.

Previous education features in predicting students’

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1445

performance are essentially important as it represents

benchmark of students’ academic achievement [21]. It

is commonly defined as score or grades obtained by

students in the past level of education such as high

school or university admission score which aids in

understanding the consistency of students'

performance.

As for the e-learning features, a total of 11 studies used

the e-learning features combine with other category of

features, and the findings have shown that 7 studies

have recognized the most impacted attributes are the

number of raise hands and logins, submission status

for assignment and homework given, announcement

view and students’ participation in discussion or

forums are impacted the accuracy of students’

performance prediction. Furthermore, [19] identify

that best variable in e-learning features are those that

related to exercise and homework rather than students’

participation in forum and discussions [17, 32]. For

demographic features, although the recurrence

percentage is the quite high among others that is 45

percent, however the best attribute is not as promising

as the academic performance features. only 3 studies

found the demographic features such as gender, caste,

father and mother education, father and mother

occupation, family income and family size do have

greater impact to the prediction [16, 22, 34]. While

[49] analysis has discovered that there is no corelation

between students is first child or not in predictive

model.

Although not many studies emphasize on using the

school design features, however this review has able

to identify two studies that has discovered that the best

attributes in their research is school size [3] and the

percentage of lecturer attendance [10]. Therefore, it is

important for other researchers to look beyond the

common attributes such as academic performance as

the educators and school facilities also tend to

impacted students’ performance. While the social

network features able to identify list of attributes such

as List of webs visited; visits duration, time spend on

movies online, time spend on reading online are able

to contribute to the students’ performance prediction

[24, 30, 47].

Figure 4 illustrates the features that have been used in

this review study. The features were divided based on

six groups that have been discussed above. Some

features are redundant due to their different labels.

However, this study had identified those redundant

features and relabelled them.

Figure 2 Most common used attributes in predicting students’ performance based on groups category

Table 4 Best attributes in predicting students’ performance

Features category

Best attributes

Methods

Paper Id

Academic

performance

features

Students’ grade and score in examination, test,

quizzes and assignment, GPA, internal

assessment on courses and subjects, students’

attendance marks.

Classification

P4,P6,P8,P9,P11,P14,P15,P16,

P1,P20,P25,P26,P28,P32,P40

Classification and

Clustering

P19

Regression

P26, P31

Major course change

Classification

P30

Previous education

features

National examination score, Scholastic

admission test score, high school grades and

score

Classification

P1,P2, P8, P10, P21, P23

Regression

Clustering

P2, P34

27.5

10 5

87.5

47.5

100

C1 C2 C3 C4 C5 C6

Recurrencnce Percentage

Attribute Groups' Names

Hasnah Nawang et al.

1446

Features category

Best attributes

Methods

Paper Id

Demographic

Features

Gender, caste, father and mother education,

fathers and mother occupation, family income

Classification

P4,P14,P22,P37

E-learning features

Number of raise hand, number of logins to the

online class, announce view and participation in

forum and discussions, submission of exercises

and homework, hours spent on material.

Classification

P5, P7, P12, P13, P24

Students’ course marks and grade, attendance

marks.

Classification and

clustering

P29

Classification

P18

Social Network

Features

List of webs visited; visits duration, time spend

on movies online, time spend on reading online.

Classification

P12, P18

Clustering

P35

School Design

Features

School size

Regression

Lecturer attendance percentage

Classification

P28

Figure 3 The percentage of studies using one or more number of features categories in their study

Figure 4 Some of features used in this study based on their groupings

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1447

3.2(RQ:ii) Which methods that are commonly used

to investigate students’ performance?

Figure 5 shows different researchers have applied

various techniques such as classification, regression

and clustering to predict students’ performance. The

goal of classification is to accurately predict the target

class for each case in the data [54], while regressions

are used to identify the relationship between

dependent variables and independent variables [55].

Different from classification and regression, clustering

is an unsupervised classification process that is used to

group objects into classes of similar objects. The

classification method is the most commonly used

method in predicting student performance, as this

review discovered that 30 out of 40 studies used the

classification method rather than clustering and

regression, each of which had only 5 studies. This is

because the classification method or also known as

supervised learning use labeled data to train the

algorithms are computationally less complex compare

to other methods. These labelled data also is used to

train techniques by learn over time and finally

accurately classify data or make predictions.

Figure 5 The number of students’ performance prediction method in the review

3.3(RQ:iii) What are the best algorithms or

techniques used in student performance

prediction?

The techniques used in this review paper are

numerous. This implies that there may be multiple

options for implementing prediction algorithms.

Furthermore, several models are often used in the

same paper to make comparisons in order to find the

best model suitable with their dataset. Random Forest

(RF), Decision Tree (DT), Naïve Bayes (NB), Support

Vector Machine (SVM), Artificial Neural Network

(ANN), Logistic Regression (LR), and K-Nearest

Neighbour (KNN) are among the algorithms

frequently used by researchers to predict students’

performance. A brief explanation on results based on

algorithms or techniques to predict students’

performance will be discussed in the next section.

3.3.1Random forest (RF)

Random Forest (RF) is a supervised ensemble

machine learning approach for classification,

regression, and other tasks that operate by constructing

a number of decision trees during the training time and

producing the output of the class, which is the mode of

the classes of the individual trees [56, 57]. This review

identified 18 out of the 40 studies that had tested the

Random Forest algorithm on their dataset for

prediction. From the total of 18, 6 studies showed the

Random Forest algorithm to have the highest accuracy

beating other algorithms in the prediction of students

at risk [17, 22, 24, 25] or students’ dropout [43, 44].

Details of the feature categories and accuracy levels

achieved according to paper ID are shown in Table 5.

Table 5 Random forest details

Paper Id

Features category

Accuracy

94%

P12

C1, C2, C6

71.6%

P13

C2, C5

84%

P14

C1, C5

99%

P31

73%

P32

C1, C5

86.6%

3.3.2Support vector machine (SVM)

Social network analysts argue that causation is not

located in individuals, but in the social structure.

Social network analysis examines the structure and

classification regression clustering

The number of mehods used in the study

Machine learning Methods

Hasnah Nawang et al.

1448

composition of ties in a given network and provides

insights into its structural characteristics [58]. In

education, SVM algorithms have been proven to be

helpful in monitoring students’ interactions and

participation in online courses. It has been

acknowledged to be among the most reliable and

accurate algorithms in most Machine Learning

applications [54]. SVM comes in second for the

number of highest accuracy rates achieved in this

study. 5 over 19 papers used SVM to predict academic

success [13], academic performance [19, 34, 45, 50]

and students’ pass rate [35]. For details on the

accuracy rates achieved and feature categories used in

these studies, refer Table 6.

Table 6 Support vector machine details

Paper Id

Features Category

Accuracy

C1, C3, C5, C6

70.4%

C5, C6

P22

C1, C5. C6

76.3%

P23

C1, C3, C5, C6

92.6%

P33

C1, C5

75.4%

3.3.3Artificial neural network (ANN)

Artificial Neural Network is a mathematical model

inspired by biological neural networks. A neural

network consists of an interconnected group of

artificial neurons, and it processes information using a

connectionist approach to computation. The network

usually learns the connection weights from available

training patterns. Performance is improved over time

by iteratively updating the weights in the network [59].

From the total of 11 papers reviewed that used the

ANN algorithm, 3 of them used ANN to forecast

students’ performance [48, 51, 52] and predict student

course assessment course as tabular in Table 7.

Recorded that ANN was able to achieve the highest

accuracy [41].

Table 7 Artificial neural network details

Paper Id

Features category

Accuracy

P29

80.4%

P36

75%

P39

C5, C6

P40

80%

3.3.4Logistic regression (LR) and linear regression

(LNR)

Linear Regression predicts a continuous numeric

output from a linear combination of attributes, while

Logistic Regression predicts the odds of two or more

outcomes, allowing for categorical predictions [8]. 4

among 11 papers were recorded in Table 8 that have

been used Logistic Regression techniques with their

accuracy achieved. While Table 9 shows the details of

accuracy and features category used using Linear

Regression algorithm.

Table 8 Logistic regression details

Paper Id

Features category

Accuracy

C1, C4, C5, C6

67%

P11

C5, C6

51.9%

P15

89.15%

P24

C1, C5

88.8%

Table 9 Linear regression details

Paper Id

Features category

Accuracy

P20

C1, C2, C5

50%

P21

C4, C6

60.24%

P30

C1, C5.C6

3.3.5Decision tree

The Decision Tree classification technique is

performed in two stages, which are 1) tree building,

and 2) pruning [60]. The internal nodes of the tree

represent conditions, the external nodes or the leaves

represent class labels, while branches from the internal

nodes represent outcomes of the tests or conditions

[61]. Decision Tree (DT) is one of the famous

algorithms used for predictive modelling on

educational data [62]. 19 studies reviewed used the

Decision Tree algorithm, and 3 of them succeeded to

achieve the highest rate of accuracy when competing

with other algorithms. For example, [26] used

academic performance and previous education

features to predict students’ performance in

intermediate and secondary schools. Decision Tree

also gave good accuracy in identifying high-risk

students who need timely help to complete their

studies as discovered by [30] who used the

combination of e-learning, social network, and

academic performance features to conduct their work.

Table 10 Decision tree details

Paper Id

Features category

Accuracy

P10

C5, C6

96.6%

P18

C2, C3, C5

91.9%

P37

C1, C5

94%

3.3.6Naïve bayes (NB)

The Naive Bayes classifier simplifies learning by

assuming that features are independent of given class

and provide probabilistic interpretations of

classifications [63]. Although independence is

generally a poor assumption, in practice, Naive Bayes

often competes well with other sophisticated

classifiers. [18] identified that the Naïve Bayes

algorithm has better accuracy in predicting the

performance of junior high school students, while [21]

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1449

used Naïve Bayes to analyse undergraduate students’

performance.

Table 11 Naïve bayes details

Paper Id

Features category

Accuracy

69%

C5, C6

83.65%

3.3.7Hybrid algorithms

In prediction models, the challenging task is to choose

the effective techniques that could produce satisfying

predictive accuracy [64]. Some researchers introduced

the hybrid approach of combining a few machine

learning algorithms together to achieve maximum

accuracy. Hybrid approaches are defined as

incorporating a number of possible machine learning

algorithms in order to achieve better performance than

any examined single learning algorithms [65]. In this

review, five papers used hybrid algorithm approaches

to predict student performance [15, 38, 20, 50], and to

analyse students at risk in a course [28]. Details of the

algorithm integrations are illustrated in Table 12.

Table 12 Hybrid algorithms details

Paper Id

Features category

Accuracy

C1, C4, C5, C6

C1, C6

82.3%

P16

85%

P26

C1, C5

98.96%

P38

C1,C5

97.93

Other than the algorithms mentioned above, the ID3,

K-Nearest Neighbour (KNN), Gradient Boosting

(GB), Association Rule Mining (ARM), J48, RMSR,

and Fuzzy Network (FN) algorithms were found to

have achieved high accuracy rates in only one paper

reviewed. Details on the accuracy rates based on the

different techniques used are tabulated in Table 13.

Table 13 Other algorithms details

Paper Id

Features category

Algorithms

Accuracy

C1, C5, C6

RMSR

94%

P17

C5, C6

ID3

80%

P25

C5, C6

80%

P28

C2, C5

KNN

89%

P27

ARM

67.3%

P34

C1, C5, C6

GBoost

87.9%

P35

C2, C5, C6

J48

94.7%

Details of the algorithms frequency in terms of as a

best predictor or frequency used for comparison in

order to find more accurate results in each study are

illustrated in Figure 6 below. Figure 6 shows that

Decision Tree is the most commonly used algorithm

in the study, with 19 out of 40 studies using it to

determine the best accuracy for educational datasets.

However, Random Forest placed first in terms of

highest percentage obtained because it provided the

highest accuracy in 6 over 18 studies or a total of

33.5% when compared to others. When it comes to the

highest accuracy that each algorithm can achieve,

Random Forest once again demonstrated the best

performance with 99% accuracy results [26], putting

this algorithm in first place among others. However,

SVM also performs as good as Random Forest as 19

papers have tested SVM as a single or comparison

algorithms and SVM able to gain 5 over 19 best

accuracies compare to Random Forest. While other

algorithms that obtained high frequency in the chosen

algorithms are ANN, NB, KNN and LR.

Figure 6 The Algorithms frequency details as the best predictor or used as a comparisons techniques in finding the

best accuracy

Frequency

Algorithms

Frequency as the best predictor

Frequency used

Hasnah Nawang et al.

1450

4.Conclusion

Performance prediction has evolved into a useful

research topic that assists educators, academics,

policymakers, and management in improving the

teaching and learning process. This paper presents a 5-

year systematic review of attributes used in student

performance prediction made via machine learning,

educational data mining, and learning analytics

approaches, as well as their applicability in the context

of student performance prediction. Analysis on 40

papers included in this review brought about great

discussions on the varieties of features and algorithms

that can impact student performance prediction.

The majority of features used in predicting students'

performance are from the academic feature category,

which includes students' grade and score in

examinations, tests, quizzes, and assignments, GPA,

internal assessment on courses and subjects, and

attendance marks. According to the findings of this

review, academic performance features able to

outperform other categories of features in terms of the

best attributes for students’ performance prediction.

Another finding from this review is that the

examination marks and score are the best attributes

compare to GPA as highlighted by [11]. This is due to

the fact that prediction of student performance has

been widely adopted in the educational sector, not only

at the higher levels of educational attainment, but also

in high school, middle school, and secondary school.

This will be a strong foundation for educators and

administrators to be able to monitor their students'

academic progress while also establishing a suitable

approach based on their students' strengths.

Looking deeper into the frequency of techniques used

by researchers in predicting student performance,

Decision Tree, Random Forest, SVM, ANN, and NB

show a strong competition of most common used

techniques. However, among the 40 papers reviewed,

the Random Forest algorithm had the best

performance at 99 percent when using demographic

and academic performance features [26], as well as the

majority frequency in terms of highest accuracy in

student performance prediction. RF could be used for

both regression and classification method and even

though it is handling high-dimensional data, the lesser

time for processing make it better than Decision Tree

algorithm [51].

The hybrid approach which combined three

algorithms: AODE, IBK, and J48, scored the second

highest prediction accuracy at 98.96%. Previous

research on student performance prediction suggest

that algorithm combinations can help gain better

accuracy, but they do not emphasize on the importance

of incorporating algorithms to improve accuracy

results. Not only has the integration of multiple

features resulted in improved accuracy prediction, but

the combination of different techniques has also

contributed to improved accuracy results [66]. These

results show that the selection of combined features

categories together with suitable algorithms can affect

the accuracy levels obtained when analysing students’

performance. However, the significance of using

hybrid algorithms in student performance prediction

requires further investigation as most hybrid approach

has the capacity to produce competitive performance

when compared with related methods. Finally, it is

hoped that academic forecasting research on the

educational system can help students in improving

their academic performance, and educators to

understand their students' needs. Additionally, the

findings can help the educational management to

design more efficient curricula for better education

adaptation.

Limitations

Since there is no evidence in favour of smaller classes,

most articles are concerned about overfitting. As a

result, the use of data sampling in the application of

data mining to educational datasets should indeed be

outlined in order to overcome the overfitting and

underfitting problem. Other issues that need to be

highlighted in this review, there are too limited papers

that study on the school design and social network

features as mostly researchers are emphasizing on the

academic performance and demographic features.

Hopefully, more researchers will implement more

categories of features in the future in order to find the

best attributes suitable for adaptation to specific

algorithms.

Acknowledgment

This research was supported by the Research Management

and Innovation Centre, Universiti Sultan Zainal Abidin

(UniSZA).

Conflicts of interest

The authors have no conflicts of interest to declare.

References

[1] Gaftandzhieva S, Docheva M, Doneva R. A

comprehensive approach to learning analytics in

Bulgarian school education. Education and Information

Technologies. 2021; 26(1):145-63.

[2] Alyahyan E, Düştegör D. Predicting academic success

in higher education: literature review and best practices.

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1451

International Journal of Educational Technology in

Higher Education. 2020; 17(1):1-21.

[3] Ferreira SA, Andrade A. Academic analytics: Anatomy

of an exploratory essay. Education and Information

Technologies. 2016; 21(1):229-43.

[4] Najafabadi MM, Villanustre F, Khoshgoftaar TM,

Seliya N, Wald R, Muharemagic E. Deep learning

applications and challenges in big data analytics.

Journal of Big Data. 2015; 2(1):1-21.

[5] Khatib KC, Kamble TD, Chendake BR, Sonavane GN.

Social media data mining for sentiment analysis.

International Research Journal of Engineering and

Technology. 2016; 3(04):373-6.

[6] Ozdemir D, Opseth HM, Taylor H. Leveraging learning

analytics for student reflection and course evaluation.

Journal of Applied Research in Higher Education.

2019; 12(1):27-37.

[7] Nuutila K, Tuominen H, Tapola A, Vainikainen MP,

Niemivirta M. Consistency, longitudinal stability, and

predictions of elementary school students' task interest,

success expectancy, and performance in mathematics.

Learning and Instruction. 2018; 56:73-83.

[8] Lang C, Siemens G, Wise A, Gasevic D. Handbook of

learning analytics. New York, NY, USA: SOLAR,

Society for Learning Analytics and Research; 2017.

[9] Nawang H, Makhtar M, Shamsudin SN. Classification

model and analysis on students’ performance. Journal

of Fundamental and Applied Sciences. 2017;

9(6S):869-85.

[10] Tsai YS, Gasevic D. Learning analytics in higher

education---challenges and policies: a review of eight

learning analytics policies. In proceedings of the

seventh international learning analytics & knowledge

conference 2017(pp. 233-42).

[11] Shahiri AM, Husain W. A review on predicting

student's performance using data mining techniques.

Procedia Computer Science. 2015; 72:414-22.

[12] Kitchenham B, Brereton OP, Budgen D, Turner M,

Bailey J, Linkman S. Systematic literature reviews in

software engineering–a systematic literature review.

Information and Software Technology. 2009; 51(1):7-

15.

[13] Gil PD, da Cruz Martins S, Moro S, Costa JM. A data-

driven approach to predict first-year students’ academic

success in higher education institutions. Education and

Information Technologies. 2021; 26(2):2165-90.

[14] Qazdar A, Er-Raha B, Cherkaoui C, Mammass D. A

machine learning algorithm framework for predicting

students performance: a case study of baccalaureate

students in Morocco. Education and Information

Technologies. 2019; 24(6):3577-89.

[15] Costa-Mendes R, Oliveira T, Castelli M, Cruz-Jesus F.

A machine learning approximation of the 2015

Portuguese high school student grades:a hybrid

approach. Education and Information Technologies.

2021; 26(2):1527-47.

[16] Baars GJ, Stijnen T, Splinter TA. A model to predict

student failure in the first year of the undergraduate

medical curriculum. Health Professions Education.

2017; 3(1):5-14.

[17] Youssef M, Mohammed S, Hamada EK, Wafaa BF. A

predictive approach based on efficient feature selection

and learning algorithms’ competition: Case of learners’

dropout in MOOCs. Education and Information

Technologies. 2019; 24(6):3591-618.

[18] Kostopoulos G, Kotsiantis S, Verykios VS. A prognosis

of junior high school students’ performance based on

active learning methods. In international conference on

brain function assessment in learning 2017 (pp. 67-76).

Springer, Cham.

[19] Moreno-Marcos PM, Pong TC, Munoz-Merino PJ,

Kloos CD. Analysis of the factors influencing learners’

performance prediction with learning analytics. IEEE

Access. 2020; 8:5264-82.

[20] Al-Obeidat F, Tubaishat A, Dillon A, Shah B.

Analyzing students’ performance using multi-criteria

classification. Cluster Computing. 2018; 21(1):623-32.

[21] Asif R, Merceron A, Ali SA, Haider NG. Analyzing

undergraduate students' performance using educational

data mining. Computers & Education. 2017; 113:177-

94.

[22] Yousafzai BK, Hayat M, Afzal S. Application of

machine learning and data mining in predicting the

performance of intermediate and secondary education

level student. Education and Information Technologies.

2020; 25(6):4677-97.

[23] Adekitan AI, Noma-Osaghae E. Data mining approach

to predicting the performance of first year student in a

university using the admission requirements. Education

and Information Technologies. 2019; 24(2):1527-43.

[24] Azcona D, Hsiao IH, Smeaton AF. Detecting students-

at-risk in computer programming classes with learning

analytics from students’ digital footprints. User

Modeling and User-Adapted Interaction. 2019;

29(4):759-88.

[25] Akçapınar G, Hasnine MN, Majumdar R, Flanagan B,

Ogata H. Developing an early-warning system for

spotting at-risk students by using eBook interaction

logs. Smart Learning Environments. 2019; 6(1):1-15.

[26] Hussain S, Dahan NA, Ba-Alwib FM, Ribata N.

Educational data mining and analysis of students’

academic performance using WEKA. Indonesian

Journal of Electrical Engineering and Computer

Science. 2018; 9(2):447-59.

[27] Adekitan AI, Salau O. The impact of engineering

students' performance in the first three years on their

graduation result using educational data mining.

Heliyon. 2019; 5(2):e01250.

[28] Marbouti F, Diefes-Dux HA, Madhavan K. Models for

early prediction of at-risk students in a course using

standards-based grading. Computers & Education.

2016; 103:1-5.

[29] Altujjar Y, Altamimi W, Al-Turaiki I, Al-Razgan M.

Predicting critical courses affecting students

performance: a case study. Procedia Computer Science.

2016; 82:65-71.

[30] Zhou Q, Quan W, Zhong Y, Xiao W, Mou C, Wang Y.

Predicting high-risk students using Internet access logs.

Knowledge and Information Systems. 2018; 55(2):393-

413.

Hasnah Nawang et al.

1452

[31] Aydoğdu Ş. Predicting student final performance using

artificial neural networks in online learning

environments. Education and Information

Technologies. 2020; 25(3):1913-27.

[32] Karlos S, Kostopoulos G, Kotsiantis S. Predicting and

interpreting students’ grades in distance higher

education through a semi-regression method. Applied

Sciences. 2020; 10(23):8413.

[33] Iqbal MS, Luo B. Prediction of educational institution

using predictive analytic techniques. Education and

Information Technologies. 2019; 24(2):1469-83.

[34] Zohair LM. Prediction of student’s performance by

modelling small dataset size. International Journal of

Educational Technology in Higher Education. 2019;

16(1):1-8.

[35] Ma X, Zhou Z. Student pass rates prediction using

optimized support vector machine and decision tree. In

8th annual computing and communication workshop

and conference (CCWC) 2018 (pp. 209-15). IEEE.

[36] Hashim AS, Awadh WA, Hamoud AK. Student

performance prediction model based on supervised

machine learning algorithms. In IOP conference series:

materials science and engineering 2020 (pp. 1-19). IOP

Publishing.

[37] Hamsa H, Indiradevi S, Kizhakkethottam JJ. Student

academic performance prediction model using decision

tree and fuzzy genetic algorithm. Procedia Technology.

2016; 25:326-32.

[38] Pandey M, Taruna S. Towards the integration of

multiple classifier pertaining to the Student's

performance prediction. Perspectives in Science. 2016;

8:364-6.

[39] Badr G, Algobail A, Almutairi H, Almutery M.

Predicting students’ performance in university courses:

a case study and tool in KSU mathematics department.

Procedia Computer Science. 2016; 82:80-9.

[40] Akçapınar G, Altun A, Aşkar P. Using learning

analytics to develop early-warning system for at-risk

students. International Journal of Educational

Technology in Higher Education. 2019; 16(1):1-20.

[41] Hussain M, Zhu W, Zhang W, Abidi SM, Ali S. Using

machine learning to predict student difficulties from

learning session data. Artificial Intelligence Review.

2019; 52(1):381-407.

[42] Zheng G, Fancsali SE, Ritter S, Berman S. Using

instruction-embedded formative assessment to predict

state summative test scores and achievement levels in

mathematics. Journal of Learning Analytics. 2019;

6(2):153-74.

[43] Rovira S, Puertas E, Igual L. Data-driven system to

predict academic grades and dropout. PLoS One. 2017;

12(2):e0171207.

[44] Rodríguez-Muñiz LJ, Bernardo AB, Esteban M, Díaz I.

Dropout and transfer paths: What are the risky profiles

when analyzing university persistence with machine

learning techniques?. Plos One. 2019; 14(6):1-21.

[45] Francis BK, Babu SS. Predicting academic

performance of students using a hybrid data mining

approach. Journal of Medical Systems. 2019; 43(6):1-

[46] Aiken JM, De Bin R, Hjorth-Jensen M, Caballero MD.

Predicting time to graduation at a large enrollment

American university. Plos One. 2020;

15(11):e0242334.

[47] Hussain M, Zhu W, Zhang W, Abidi SM. Student

engagement predictions in an e-learning system and

their impact on student course assessment scores.

Computational Intelligence and Neuroscience. 2018.

[48] Czibula G, Mihai A, Crivei LM. S PRAR: a novel

relational association rule mining classification model

applied for academic performance prediction. Procedia

Computer Science. 2019; 159:20-9.

[49] Matzavela V, Alepis E. Decision tree learning through

a predictive model for student academic performance in

intelligent M-learning environments. Computers and

Education: Artificial Intelligence. 2021; 2:100035.

[50] Viloria A, López JR, Leyva DM, Vargas-Mercado C,

Hernández-Palma H, Llinas NO, et al. Data mining

techniques and multivariate analysis to discover

patterns in university final researches. Procedia

Computer Science. 2019; 155:581-6.

[51] Deng H, Wang X, Guo Z, Decker A, Duan X, Wang C,

et al. Performancevis: visual analytics of student

performance data from an introductory chemistry

course. Visual Informatics. 2019; 3(4):166-76.

[52] Çetinkaya A, Baykan ÖK. Prediction of middle school

students' programming talent using artificial neural

networks. Engineering Science and Technology, an

International Journal. 2020; 23(6):1301-7.

[53] Mokhairi M, Nawang H, Wan SN. Analysis on students

performance using naïve. J. Theor. Appl. Inf. Technol.

2017; 31(16):3993-4000.

[54] Hu H, Zhang G, Gao W, Wang M. Big data analytics

for MOOC video watching behavior based on Spark.

Neural Computing and Applications. 2020;

32(11):6481-9.

[55] Slater S, Joksimović S, Kovanovic V, Baker RS,

Gasevic D. Tools for educational data mining: a review.

Journal of Educational and Behavioral Statistics. 2017;

42(1):85-106.

[56] Breiman L. Random forests. Machine Learning. 2001;

45(1):5-32.

[57] YUSUF A. Prediction of students’performance in E-

learning environment using random forest (Doctoral

dissertation, Universiti Teknologi Malaysia).

[58] Noble WS. What is a support vector machine?. Nature

biotechnology. 2006; 24(12):1565-7.

[59] Gupta N. Artificial neural network. Network and

Complex Systems. 2013; 3(1):24-8.

[60] Hamoud A, Hashim AS, Awadh WA. Predicting

student performance in higher education institutions

using decision tree analysis. International Journal of

Interactive Multimedia and Artificial Intelligence.

2018; 5:26-31.

[61] Zulfiker MS, Kabir N, Biswas AA, Chakraborty P,

Rahman MM. Predicting students’ performance of the

private universities of Bangladesh using machine

learning approaches. International Journal of Advanced

Computer Science and Applications. 2020; 11(3):672-

International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)

1453

[62] Sivakumar S, Venkataraman S, Selvaraj R. Predictive

modeling of student dropout indicators in educational

data mining using improved decision tree. Indian

Journal of Science and Technology. 2016; 9(4):1-5.

[63] Rish I. An empirical study of the naive Bayes classifier.

In IJCAI workshop on empirical methods in artificial

intelligence 2001 (pp. 41-6).

[64] Sokkhey P, Okazaki T. Hybrid machine learning

algorithms for predicting academic performance.

International Journal of Advanced Computer Science

and Applications. 2020; 11(1):32-41.

[65] Dole L, Rajurkar J. A Decision support system for

predicting student performance. International Journal

of Innovative Research in Computer and

Communication Engineering. 2014; 2(12):7232-7.

[66] Mohamad M, Makhtar M, Abd Rahman MN. The

reconstructed heterogeneity to enhance ensemble

neural network for large data. In international

conference on soft computing and data mining 2016

(pp. 447-55). Springer, Cham.

Hasnah Nawang earned a BSc in

Computer Science from Universiti Putra

Malaysia in 2006 and an MSc in

Computer Science from Universiti

Sultan Zainal Abidin in Terengganu,

Malaysia, in 2018. She is currently a

PhD candidate at the Department of

Computer Science in the Faculty of

Computing and Informatics at Universiti Sultan Zainal

Abidin in Terengganu, Malaysia. Since 2007, she has also

worked as a secondary school teacher in the Department of

Mathematics and Computer Science. Machine Learning,

Data Mining, and Deep Learning are among her current

research interests.

Email: hasnah.nawang@gmail.com

Dr. Mokhairi Makhtar received his

PhD from University of Bradford,

United Kingdom in 2012. He is

currently a Professor in the Department

of Computer Science, Universiti Sultan

Zainal Abidin, Terengganu, Malaysia.

His current research interests include

Machine Learning, Ensemble Method,

Data Mining, Soft Computing, Timetabling and

Optimisation, Natural Languange Processing, E-Learning

and Deep Learning.

Email: mokhairi@unisza.edu.my

Dr. Wan Mohd Amir Fazamin Wan

Hamzah received his PhD from

Universiti Malaysia Terengganu,

Malaysia. He is currently a lecturer in

Universiti Sultan Zainal Abidin,

Terengganu, Malaysia. His research

interests include Learning Analytics,

Gamification, e-Learning and Cloud

Computing.

Email: amirfazamin@unisza.edu.my

Appendix I

S. No.

Abbreviations

Descriptions

ANN

Artificial Neural Network

ARM

Association Rule Mining

Data Mining

Decision Tree

Fuzzy Network

GBoost

Gradient Boosting

ID3

Iterative Dichotomiser 3

J48

J48 algorithm

KNN

Nearest Neighbor

Learning Analytics

LNR

Linear Regression

Logistic Regression

Naïve Bayes

Machine Learning

Random Forest

SVM

Support Vector Machine

Authos Photo

Authors Photo

Auth’s Photo

Comparative analysis of classification algorithm evaluations to predict secondary school students’ achievement in core and elective subjects

Article

Full-text available

Apr 2022

Many researchers in educational data mining (EDM) have explored various machine learning techniques in order to predict students’ performance. However, the most daunting challenge in classification modelling is selecting the most effective algorithm with the highest accuracy. A study was conducted using datasets from two Malaysian premier secondary schools, Maktab Rendah Sains Mara (MRSM) Kuala Berang and Kuala Terengganu. The purpose of this study is to respond to two key questions; the first is to examine which algorithm is the best in predicting secondary students’ achievement in core and elective subjects, while the second is to study whether the same features and algorithms are capable of predicting academic performance based on students’ first semester achievement. To do so, this study analysed the effectiveness of six different classification algorithms, which are naïve Bayes (NB), random forest (RF), k-nearest neighbour (kNN), support vector machine (SVM), sequential minimal optimization (SMO), and logistic regression (LGR). Each model’s prediction accuracy was evaluated using 10-fold cross validation in order to identify the best model. The results showed that the RF model outperformed other models in terms of accuracy, precision, recall, and F1-Measure. With most algorithms achieving significant accuracy levels for both core and elective subjects’ dataset. It is concluded that the prediction of secondary school students' achievement can begin as early as the first semester using RF for core and elective subjects with biology dataset. The accuracy obtained was 96.7% and 97.5%, respectively for the core and elective subjects.

Comparative analysis of classification algorithm evaluations to predict secondary school students’ achievement in core and elective subjects

Article

May 2022

Hasnah Nawang

Many researchers in educational data mining (EDM) have explored various machine learning techniques in order to predict students’ performance. However, the most daunting challenge in classification modelling is selecting the most effective algorithm with the highest accuracy. A study was conducted using datasets from two Malaysian premier secondary schools, Maktab Rendah Sains Mara (MRSM) Kuala Berang and Kuala Terengganu. The purpose of this study is to respond to two key questions; the first is to examine which algorithm is the best in predicting secondary students’ achievement in core and elective subjects, while the second is to study whether the same features and algorithms are capable of predicting academic performance based on students’ first semester achievement. To do so, this study analysed the effectiveness of six different classification algorithms, which are naïve Bayes (NB), random forest (RF), knearest neighbour (kNN), support vector machine (SVM), sequential minimal optimization (SMO), and logistic regression (LGR). Each model’s prediction accuracy was evaluated using 10-fold cross validation in order to identify the best model. The results showed that the RF model outperformed other models in terms of accuracy, precision, recall, and F1- Measure. With most algorithms achieving significant accuracy levels for both core and elective subjects’ dataset. It is concluded that the prediction of secondary school students' achievement can begin as early as the first semester using RF for core and elective subjects with biology dataset. The accuracy obtained was 96.7% and 97.5%, respectively for the core and elective subjects.

Predicting the impact of internet usage on students’ academic performance using machine learning techniques in Bangladesh perspective

Article

Full-text available

Mar 2024

Education systems have significantly changed with the emergence of the internet. It has a significant impact on how students learn things. Nevertheless, its impact can also be contradicting. Internet addiction can slowly poison the minds of our youths and stand in the way of pursuing their goals. Although Bangladesh has internet connectivity across the country, its potential could be more utilized, particularly in the educational sector. Therefore, proper analysis of the effects of the internet on students, as well as determining the prominent factors relevant to the internet, is a necessary task. In addition, predicting students' academic performance can help determine the changes that must be incorporated to improve the educational system. Hence, this research analyzes the effects of internet usage on students' academic progress and then predicts the students' performance using distinct machine learning (ML) algorithms. The data were collected through an offline survey from Noakhali, Bangladesh. The collected data is preprocessed to select the most relevant features. The preprocessed data were fed into ML algorithms to investigate their behaviors. We have employed logistic regression, decision tree, random forest, and Naïve Bayes algorithms to see their classification performance on our dataset. To minimize the overfitting issue, k-fold cross-validation and hyperparameter optimization have been applied. The results were presented in two parts—exploratory data analysis and classification. Exploratory data analysis shows that the main purpose of internet usage is education and entertainment for school students, social media and entertainment for college students, and education and social media for university students. School and university students browse the internet mainly for academic purposes, whereas college students browse mainly for non-academic purposes. Students prefer to browse the internet at night. For all schools, colleges, and universities, students with better results generally visited websites like Google and YouTube. Students with moderate or bad results generally spent time on social media platforms (mainly Facebook and WhatsApp). Then, the results of the numerical analysis performed with classification algorithms are presented. Results indicate that random forest gives the maximum score in our dataset in all sectors, like accuracy, precision, recall, and f1 score. It gives a maximum of 85% accuracy on the test set. Logistic regression gives the second-best score of 69%. The practical applications and policy recommendations for Bangladesh's education sector are also discussed. The output of this work can contribute to building a policy on internet usage. In this way, it is possible to make the students more concentrative on their education and learning.

Deep Learning Techniques for Predicting Student’s Academic Performance on Virtual Learning Environments: A Review

Preprint

Full-text available

Jan 2024

With advances in Artificial Intelligence (AI) and the increasing volume of online educational data, Deep Learning techniques have played a critical role in predicting student’s academic performance. Recent developments have assisted instructors in determining the strengths and weaknesses of student’s academic achievement. This understanding will benefit from adopting the necessary interventions to assist students in improving their performance, helping at-risk students, and preventing dropout rates. In this review, 46 studies between 2019 and 2023 that apply one or more Deep Learning (DL) techniques, either alone or in combination with Machine Learning (ML) or Ensemble Learning techniques, have been reviewed and analyzed. Moreover, the review has examined utilizing datasets from public (MOOCs), private (LMSs), and other platforms. Four categories were used to group the features: demographic, previous and current academic performance, and learning behavior/activity features. Therefore, the analysis has demonstrated that the DNNs and CNN-LSTM models were the most commonly used techniques. Moreover, it has been noted that the studies that have used DL techniques such as CNNs, DNNs, and LSTMs, performed well by achieving high prediction accuracy above 90%; other studies achieved accuracy ranging between 60% and 90%. For datasets used within the reviewed studies, even though 44% of the studies have used LMSs datasets, it is found that OULAD was the most used dataset from MOOCs. For the grouped features, the results of the analysis indicate that the best features for prediction are the learning behavior and activity features, which outperform other feature categories.

How Machine Learning (ML) is Transforming Higher Education: A Systematic Literature Review

Article

Full-text available

Apr 2023

In the last decade, artificial intelligence (AI), machine learning (ML) and learning data analytics have been introduced with great effect in the field of higher education. However, despite the potential benefits for higher education institutions (HIE´s) of these emerging technologies, most of them are still in the early stages of adoption of these technologies. Thus, a systematic literature review (SLR) on the literature published over the last 5 years on potential applications of machine learning in higher education is necessary. Following the PRISMA guidelines, out of the 1887 initially identified SCOPUS-indexed publications on the topic, 171 articles were selected for review. To screen the abstracts and titles of each citation, Rayyan QCRI was used. VOSViewer, a software tool for constructing and visualizing bibliometric networks, and Microsoft Excel were used to generate charts and figures. The findings show that the most widely researched application of ML in higher education is related to the prediction of academic performance and employability of students. The implications will be invaluable for researchers and practitioners to explore how ML and AI technologies ,in the era of ChatGPT, can be used in universities without jeopardizing academic integrity.

Deep Biased Matrix Factorization for Student Performance Prediction

Article

Full-text available

Apr 2023

In universities that use the academic credit system, selecting elective courses is a crucial task that can have a significant impact on a student's academic performance. Students who perform poorly in their courses may receive formal warnings or even face expulsion from the university. Thus, a well-designed study plan from a course recommendation system can play an essential role in achieving good academic performance. Additionally, early warnings regarding challenging courses can help students better prepare and improve their chances of success. Therefore, predicting student performance is a vital component of both the course recommendation system and the academic advisor's role. To this end, numerous studies have addressed the prediction of student performance using various approaches such as association rules, machine learning, and recommender systems. More recently, personalized machine learning approaches, particularly the matrix factorization technique, have been used in the course recommendation system. However, the accuracy of these approaches in predicting student performance still needs improvement. To address this issue, this study proposes an approach called Deep Biased Matrix Factorization, which carries out deep factorization via multi-layer to enhance prediction accuracy. Experimental results on an educational dataset have demonstrated that the proposed approach can significantly improve the accuracy of student performance prediction. By using this approach, universities can better recommend elective courses to their students as well as predict student performance, which can help them make informed decisions and achieve better academic outcomes.

The power of Deep Learning techniques for predicting student performance in Virtual Learning Environments: A systematic literature review

Article

Jun 2024

Predicting Student Grade Point Average: Comparison of Machine Learning Regression Algorithms

Conference Paper

Dec 2023

Student Course Grade Prediction Using the Random Forest Algorithm: Analysis of Predictors' Importance

Article

Sep 2023

A systematic review of the literature on machine learning application of determining the attributes influencing academic performance

Article

Full-text available

Mar 2023

Machine learning (ML) is one way that can help decipher the intricate relationship between students' data and their performance. When implemented correctly in learning environments, machine learning will improve our knowledge of fundamental processes by simplifying the identification, extraction, and evaluation of underlying factors that affect student learning and levels of achievement. This study employed the experimental research approach using binary classification techniques based on the six-step knowledge discovery process (KDP) model. Five classifiers were used within the Rapid Miner's 9.10.010 educational environment as both experimental and analytical tool. The dataset comprised of 2334 records, 17 attributes with one class variable (students’ semester average score) inclusive. Twenty different tests were conducted. The experiments' results were evaluated using 10-fold cross-validation and ratio split validation with bootstrap sampling. The Random Forest algorithm (RF), Rule Induction methods (RI), Naive Bayes (NB), Logistic Regression (LR) and Deep Learning (DL) algorithms were used in the experiment. The experimental results demonstrated that the RF method outperforms the other four techniques in all six-evaluation metrics that were employed for the selection process with the accuracy being 93.96%. According to the RF classifier model, the mother's and father's education levels are two recognized demographic factors per this study that significantly influence pre-tertiary students’ academic achievement. This study has significantly reduced the gap in practical knowledge observed in the literature by introducing an intervention scheme for respective student's requiring intensive or minimal academic interventions in its prediction procedure.

Decision tree learning through a Predictive Model for Student Academic Performance in Intelligent M-Learning environments

Article

Full-text available

Oct 2021
COMPUT EDUC

In the area of machine learning and data science, decision tree learning is considered as one of the most popular classification techniques. Therefore, a decision tree algorithm generates a classification and predictive model, which is simple to understand and interpret, easy to display graphically, and capable to handle both numerical and categorical data. The intelligent m-learning systems, enjoy recently an explosive growth of interest, for more effective education and adaptive learning tailored to each student's learning abilities. The goal of this paper is to further improve personalization in student academic performance, that includes dynamic tests with a predictive model. One major objective of this research is to create adaptive dynamic tests for assessing student academic performance, while constantly comparing the results of the assessment which exhibits the individual student profile, with the results of the decision tree's algorithm which formulates a predictive model for students' knowledge level, according to the weights of the decision tree.

Predicting and Interpreting Students’ Grades in Distance Higher Education through a Semi-Regression Method

Article

Full-text available

Nov 2020

Multi-view learning is a machine learning app0roach aiming to exploit the knowledge retrieved from data, represented by multiple feature subsets known as views. Co-training is considered the most representative form of multi-view learning, a very effective semi-supervised classification algorithm for building highly accurate and robust predictive models. Even though it has been implemented in various scientific fields, it has not adequately used in educational data mining and learning analytics, since the hypothesis about the existence of two feature views cannot be easily implemented. Some notable studies have emerged recently dealing with semi-supervised classification tasks, such as student performance or student dropout prediction, while semi-supervised regression is uncharted territory. Therefore, the present study attempts to implement a semi-regression algorithm for predicting the grades of undergraduate students in the final exams of a one-year online course, which exploits three independent and naturally formed feature views, since they are derived from different sources. Moreover, we examine a well-established framework for interpreting the acquired results regarding their contribution to the final outcome per student/instance. To this purpose, a plethora of experiments is conducted based on data offered by the Hellenic Open University and representative machine learning algorithms. The experimental results demonstrate that the early prognosis of students at risk of failure can be accurately achieved compared to supervised models, even for a small amount of initially collected data from the first two semesters. The robustness of the applying semi-supervised regression scheme along with supervised learners and the investigation of features’ reasoning could highly benefit the educational domain.

Predicting time to graduation at a large enrollment American university

Article

Full-text available

Nov 2020
PLOS ONE

The time it takes a student to graduate with a university degree is mitigated by a variety of factors such as their background, the academic performance at university, and their integration into the social communities of the university they attend. Different universities have different populations, student services, instruction styles, and degree programs, however, they all collect institutional data. This study presents data for 160,933 students attending a large American research university. The data includes performance, enrollment, demographics, and preparation features. Discrete time hazard models for the time-to-graduation are presented in the context of Tinto’s Theory of Drop Out. Additionally, a novel machine learning method: gradient boosted trees, is applied and compared to the typical maximum likelihood method. We demonstrate that enrollment factors (such as changing a major) lead to greater increases in model predictive performance of when a student graduates than performance factors (such as grades) or preparation (such as high school GPA).

A data-driven approach to predict first-year students' academic success in higher education institutions

Article

Full-text available

Mar 2021
Educ Inform Tech

This study presents a data mining approach to predict academic success of the first-year students. A dataset of 10 academic years for first-year bachelor's degrees from a Portuguese Higher Institution (N = 9652) has been analysed. Features' selection resulted in a characterising set of 68 features, encompassing socio-demographic, social origin, previous education, special statutes and educational path dimensions. We proposed and tested three distinct course stage data models based on entrance date, end of the first and second curricular semesters. A support vector machines (SVM) model achieved the best overall performance and was selected to conduct a data-based sensitivity analysis. The previous evaluation performance, study gaps and age-related features play a major role in explaining failures at entrance stage. For subsequent stages, current evaluation performance features unveil their predictive power. Suggested guidelines include to provide study support groups to risk profiles and to create monitoring frameworks. From a practical standpoint, a data-driven decision-making framework based on these models can be used to promote academic success.

A machine learning approximation of the 2015 Portuguese high school student grades: A hybrid approach

Article

Full-text available

Mar 2021
Educ Inform Tech

This article uses an anonymous 2014–15 school year dataset from the Directorate-General for Statistics of Education and Science (DGEEC) of the Portuguese Ministry of Education as a means to carry out a predictive power comparison between the classic multilinear regression model and a chosen set of machine learning algorithms. A multilinear regression model is used in parallel with random forest, support vector machine, artificial neural network and extreme gradient boosting machine stacking ensemble implementations. Designing a hybrid analysis is intended where classical statistical analysis and artificial intelligence algorithms are blended to augment the ability to retain valuable conclusions and well-supported results. The machine learning algorithms attain a higher level of predictive ability. In addition, the stacking appropriateness increases as the base learner output correlation matrix determinant increases and the random forest feature importance empirical distributions are correlated with the structure of p-values and the statistical significance test ascertains of the multiple linear model. An information system that supports the nationwide education system should be designed and further structured to collect meaningful and precise data about the full range of academic achievement antecedents. The article concludes that no evidence is found in favour of smaller classes.

Prediction of middle school students' programming talent using artificial neural networks

Article

Full-text available

Aug 2020

Nowadays, the softwarization and virtualization of resources and services rapidly continue, and along with reading and writing, programming is going to be one of the basic human ability. Thus, the detection of skilled programmers at an early age has become important for economies to strengthen their workforce and compete globally. The current technological momentum shows that when the middle school students of today reach the 2030s, the demand for advanced programming skills will be rapidly increased, expanding as high as 90% between 2016 and 2030. Thus, the identification of these skilled people at an early age is important. Accordingly, this study focused on predicting middle school students’ programming aptitude using artificial neural network (ANN) algorithms. A participant survey was developed and applied to middle school students consisting of fifth, sixth, and seventh graders from Konya Science Center, Turkey. After the completion of the survey, the participants then took the 20-level Classic Maze course (CMC) on Code.org. The participants’ final scores in the CMC were calculated based on the level they completed and the lines of codes they wrote. The best results were obtained using the Bayesian regularization algorithm: Training-R = 9.72284e−1; Test-R = 9.12687e−1, and All-R = 9.597e−1. The results show that ANN is an appropriate machine learning method that can forecast participants’ skills, such as analytical thinking, problem-solving, and programming aptitude.

A comprehensive approach to learning analytics in Bulgarian school education

Article

Full-text available

Jan 2021
Educ Inform Tech

Many educational institutions use a large number of information systems to automate their activities for different stakeholders’ groups – learning management systems, student diary, library system, digital repositories, management system, etc. This leads to a significant increase in the volume and variety of data that can be captured, stored, and harnessed to improve student learning and school effectiveness. Educational institutions are realising that, with the help of technology, they are collecting data that could be very valuable when properly analysed, aligned with learning outcomes, and integrated into a tighter feedback loop with stakeholders. In addition, the analysis of data can help their managers to take data-driven decision making at all levels of educational institutions. The paper presents a comprehensive approach to Learning Analytics in the field of Secondary Education from the perspective of all different stakeholders, which aims to improve its methods of approaching and analysing learning data. On the basis of a literature review in the field and an investigation of requirements for quality evaluation of learning in school education, the corresponding stakeholder groups are identified - students, teachers, class teachers, managers, parents, inspectors and 6 models (1 model per each stakeholder group) for data collection and personalized and meaningful analysis are proposed for the needs of Learning Analytics. Each model consists of measurable indicators allowing the relevant stakeholder to track data for students’ learning or training for different purposes, e.g. monitoring, analysis, forecast, intervention, recommendations, etc., but finally to improve the quality of learning and teaching processes. The proposed models are evaluated by the representatives of 4 stakeholder groups – students, teachers, class teachers, parents.

Application of machine learning and data mining in predicting the performance of intermediate and secondary education level student

Article

Full-text available

Nov 2020
Educ Inform Tech

The presented work is a student marks and grade prediction system using supervised machine learning techniques, the system is developed on the historic performance of students. The data used in this research is collected from Federal Board of Intermediate and Secondary Education Islamabad Pakistan, there are 7 regions in FBISE i.e. Punjab, Sindh, Khyber Pakhtunkhwa, Balochistan, Azad Jammu and Kashmir and overseas. The aims of this work is to analyze the education quality which is closely tightened with the sustainable development goals. The implementation of the system has produced an excess of data which must be processed suitably to gain more valuable information that can be more useful for future development and planning. Student marks and grade prediction from their historic academic data is a popular and useful application in educational data mining, so it is becoming a valuable source of information which can be used in different manners to improve the education quality in the country. Related work shows that several method for academic grade prediction are developed for the betterment of teaching and administrative staff of an educational organizational system. In our proposed methodology, the obtained data is preprocessed to improve the quality of data, the labeled student historic data (29 optimal attributes) is used to train decision tree classifier and regression model. The classification system will predict the grade while the regression model will predict the marks, finally the results obtained from both the model are analyzed. The obtain results show the effectiveness and importance of machine learning technology in predicating the students performance.

Predicting Students' Performance of the Private Universities of Bangladesh using Machine Learning Approaches

Article

Full-text available

Mar 2020

Every year thousands of students get admitted into different universities in Bangladesh. Among them, a large number of students complete their graduation with low scoring results which affect their careers. By predicting their grades before the final examination, they can take essential measures to ameliorate their grades. This article has proposed different machine learning approaches for predicting the grade of a student in a course, in the context of the private universities of Bangladesh. Using different features that affect the result of a student, seven different classifiers have been trained, namely: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression, Decision Tree, AdaBoost, Multilayer Perceptron (MLP), and Extra Tree Classifier for classifying the students' final grades into four quality classes: Excellent, Good, Poor, and Fail. Afterwards, the outputs of the base classifiers have been aggregated using the weighted voting approach to attain better results. And here this study has achieved an accuracy of 81.73%, where the weighted voting classifier outperforms the base classifiers.

Hybrid Machine Learning Algorithms for Predicting Academic Performance

Article

Full-text available

Jan 2020

The large volume of data and its complexity in educational institutions require the sakes from informative technologies. In order to facilitate this task, many researchers have focused on using machine learning to extract knowledge from the education database to support students and instructors in getting better performance. In prediction models, the challenging task is to choose the effective techniques which could produce satisfying predictive accuracy. Hence, in this work, we introduced a hybrid approach of principal component analysis (PCA) as conjunction with four machines learning (ML) algorithms: random forest (RF), C5.0 of decision tree (DT), and naïve Bayes (NB) of Bayes network and support vector machine (SVM), to improve the performances of classification by solving the misclassification problem. Three datasets were used to confirm the robustness of the proposed models. Through the given datasets, we evaluated the classification accuracy and root mean square error (RSME) as evaluation metrics of the proposed models. In this classification problem, 10-fold cross-validation was proposed to evaluate the predictive performance. The proposed hybrid models produced very prediction results which shown itself as the optimal prediction and classification algorithms.

A systematic literature review on student performance predictions

Abstract and Figures

Recommended publications

Big data framework for students' academic performance prediction: A systematic literature review

Comparative analysis of classification algorithm evaluations to predict secondary school students’ a...

Adaptive Deep Learning Technique to Predict Student’s Graduation Results

Global challenges of students dropout: A prediction model development using machine learning algorit...

RHEM: A Robust Hybrid Ensemble Model for Students’ Performance Assessment on Cloud Computing Course