ArticlePDF Available

A systematic literature review on student performance predictions

Authors:

Abstract and Figures

Prediction of student performance in educational institutions is a major topic of debate among researchers in efforts to improve teaching and learning. Effective prediction techniques and features would help educators and teachers design appropriate teaching content to help learners study according to predicted outcomes. The purpose of this paper is to present a systematic literature review on predictions of students’ performance in higher education institutions and secondary schools using Machine Learning, Educational Data Mining, and Learning Analytics methodologies. The review used in this study was designed to: i) provide an overview of techniques and algorithms used to predict students' performance; and ii) identify the features that have the greatest impact on students' performance. This paper also outlined several future insights in terms of applying hybrid techniques to educational datasets in order to improve accuracy in predicting students’ performance. © 2021, Accent Social and Welfare Society. All rights reserved.
Content may be subject to copyright.
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
ISSN (Print): 2394-5443 ISSN (Online): 2394-7454
http://dx.doi.org/10.19101/IJATEE.2021.874521
1441
A systematic literature review on student performance predictions
Hasnah Nawang1*, Mokhairi Makhtar2 and Wan Mohd Amir Fazamin Wan Hamzah3
Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Terengganu, Malaysia1
School of Computer Science, Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Tembila
Campus, Terengganu, Malaysia2
School of Information Technology, Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Tembila
Campus, Terengganu, Malaysia3
Received: 31-July-2021; Revised: 01-November-2021; Accepted: 06-November-2021
©2021 Hasnah Nawang et al. This is an open access article distributed under the Creative Commons Attribution (CC BY) License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.Introduction
Nowadays, many high school and higher educational
systems generate a large number of student
information through the learning management system,
examination data, students’ activities, library system,
etc. [1]. This situation leads to increases in the volume
and types of educational data in every institution.
Machine learning (ML), learning analytics (LA), and
data mining (DM) approaches have been widely used
on educational data to predict students’ performance.
These approaches have shown that several techniques
and algorithms are useful in understanding this
domain, which is poorly accessed by human
capability.
Students’ success has become an important metric to
higher educational institutes as well as secondary level
schools. In higher education institutions, students’
performance plays a vital role in determining their job
success [2]. Good academic performance assures
employers of a candidate’s quality and reliability.
*Author for correspondence
In order to build and apply a predictive model, features
that correlate with the value to be predicted must be
collected and processed. There are many features
affecting students’ performance, and they can be
divided into a few groups, such as students’ previous
education, students’ e-learning activity, demographics
features [3], students’ social network information [4],
behaviour variables [5], external assessment, extra-
curricular activities and academic performance [6],
school design [7], parental involvement, etc. [8]. As
there are numerous features and approaches are
utilised to forecast students' performance, this study
provides a thorough evaluation of student performance
predictions for high/secondary schools and higher-
level institutions in terms of the most influential
attributes and methods regularly employed by
researchers.
Either in university/college or secondary school,
students’ performance prediction had drawn much
attention not only amongst educators but the machine
learning community as well. Predicting students’
performance has become a challenging task due to the
increased amount of data contained in educational
systems [9, 10]. The students' interest in learning are
Review Article
Abstract
Prediction of student performance in educational institutions is a major topic of debate among researchers in efforts to
improve teaching and learning. Effective prediction techniques and features would help educators and teachers design
appropriate teaching content to help learners study according to predicted outcomes. The purpose of this paper is to present
a systematic literature review on predictions of students’ performance in higher education institutions and secondary
schools using Machine Learning, Educational Data Mining, and Learning Analytics methodologies. The review used in
this study was designed to: i) provide an overview of techniques and algorithms used to predict students' performance; and
ii) identify the features that have the greatest impact on students' performance. This paper also outlined several future
insights in terms of applying hybrid techniques to educational datasets in order to improve accuracy in predicting students’
performance.
Keywords
Educational data mining, Machine learning, Learning analytics, Students, performance prediction.
Hasnah Nawang et al.
1442
varies. Some may be able to learn independently using
e-learning, while others may need to study with the
assistance of educators who prepare their teaching
materials based on the needs of the students, while
others may perform adequately in their examinations
with the full facilities provided in school or
institutions, and those with a higher family income
may be able to find outside sources to fund extra-
classes for their children's educational needs. The
variety of this interest has motivated the authors to
identify what features mostly contribute to learning
outcomes together with which suitable machine
learning techniques that will later assist institutions
administration in coming up with solutions in offering
the best education environment to their students.
Educators, on the other hand, will be able to construct
instructional content based on students' acceptance
and lead students' learning progress.
In the recent years, with the advances of the
application of technologies to forecasting students’
performance, there are still gaps to be filled in order to
discover most commonly used features and
techniques. Despite of other review papers on student
performance exist, few of them emphasize on the
importance of combining multiple groups of features
and techniques in order to enhance the accuracy
percentage. Previous literature review on students’
performance [2, 11] has discussed this topic on general
and did not emphasize on the impact of hybrid
approaches that able to help in optimizing the
accuracy. Thus, we aim to comprehensively map,
analyze and review the articles that have been
published in 2016 to 2020.
2.Research methodology
The conduct of this review paper had followed the
recommended procedures on performing a systematic
review as provided by Kitchenham et al. [12]. The
suggested procedure for a review is divided into three
stages, which are:
1. Planning the review
2. Conducting the review
3. Reporting the review
2.1Planning the review
This systematic review was conducted due to the need
to summarise the latest five years of research on
student performance prediction. As the first phase of
the review is planning, it is thus necessary to identify
the importance of this review. This systematic review
was proposed in order to support the objectives of this
study which are:
1. To summarise existing methodologies used in
student performance prediction.
2. To identify the most common features used in
predicting students’ performance.
3. To identify the most common algorithms used to
predict students’ performance.
4. To identify the gaps in previous research
After the objectives of the research were identified, the
most important activity in a systematic review
protocol which is to formulate the research questions
was then conducted. Based on Kitchenham, 2009
recommendations [12], the construction of research
questions must consider three viewpoints of the study
criteria, which are population, intervention, and
outcomes. The following are details for each study
criteria: 1) population: high school and higher
educational institutes; 2) intervention: methods,
algorithms, and techniques for prediction; and 3)
outcome: best features or variables used as
performance predictors and successful prediction
techniques or approaches. Hence, the following study
criteria have led to the mapping of this study research
questions (RQ):
1. What are the features commonly used by
researchers to predict students’ performance?
2. Which methods that are commonly used to
investigate students’ performance?
3. What are the best algorithms or techniques used in
student performance prediction?
2.2Conducting the review
2.2.1Dataset
To ensure the adaptation of relevant papers in this
review, papers were collected from four databases
namely IEEE Xplore, ScienceDirect, SpringerLink,
and EBSCOhost. The four databases are as illustrated
in Table 1. The search was conducted in September
2021.
Table 1 Databases used to search for papers
No.
Database name
No of papers
1
IEEEXplore
10
2
ScienceDirect
60
3
Springer Link
50
4
EbcoHost
26
2.2.2Search strategy
The search strategy was used to identify relevant
papers to the review. The following phrases - Machine
learning, educational data mining, learning analytics,
students’ performance analysis, predictive model,
students’ performance prediction, students’ academic
performance, school, high school, higher education -
were used to ensure that the selected papers only
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1443
discussed machine learning, educational data mining,
and learning analytics approaches. The search terms
have been constructed by identifying the techniques of
information extraction, level of education, and
category of prediction. We used Boolean operators
like AND, OR, and NOT in our search strings.
2.2.3Selection criteria
In this review, the study selection criteria were
planned in order to identify the primary studies based
on the following parts: 1) studies that used machine
learning, educational data mining or learning analytics
technique(s) for predictive modelling; 2) studies in
peer-reviewed journals or conference proceedings
written in English; and 3) studies that investigated the
prediction of students’ performance at higher
educational or high school levels. This systematic
review was performed on studies published from the
years 2016 to 2020.
2.2.4The inclusion and exclusion criteria
The main objective is to include as many articles as
possible starting from 2016 to 2021 that are related to
the research questions. Then, certain criteria were set
to identify whether an article should be included or not
in the analysis.
Inclusion criteria
1. Research papers on Students Performance
Prediction
2. Papers from 2016 to 2020 era.
3. Papers written in English.
4. Not a review paper
Exclusion Criteria
1. Studies not using ML or DM techniques
2. Duplicates paper
3. Irrelevant titles, abstract and keywords
4. Not in peer-reviewed
Figure 1 and illustrates the PRISMA flowchart used
as a guide during the selection process. As discussed
in Section 2.2.1, four databases, which are IEEE
Xplore, EBSCOhost, ScienceDirect, and
SpringerLink, were used to search for papers during
this systematic review. Ten items were retrieved from
IEEE Xplore, 26 items were from EBSCOhost, 60
items were from ScienceDirect, and 50 items were
from SpringerLink, making a total of 146 records.
Firstly, published papers that match the search strings
were identified. Secondly, papers were selected based
on the title, abstract, and keywords relevant to the
eligible criteria. The final selection was done by
reading the full content of the papers. This study
selected 40 articles for the subsequent review process
as shown in Figure 1. The first screening removed 40
records due to problems such as duplicate papers,
papers not written in English, and papers that were not
peer-reviewed, leaving a total of 106 records. The
second screening removed 30 papers due to irrelevant
titles, abstract, and keywords used, bringing the new
total to 76 records. Finally, the full text paper reading
process excluded another 36 papers, making a total
number of 40 papers remaining for analysis. These
papers were tabulated based on their Paper Id and
author/s name in Table 2.
Figure 1 PRISMA flowchart for research extraction
and direction
Table 2 Selected papers for study review
Paper Id
Paper Id
Author
1
P21
[33]
P2
P22
[34]
P3
P23
[35]
P4
P24
[36]
P5
P25
[37]
P6
P26
[38]
P7
P27
[39]
P8
P28
[40]
P9
P29
[41]
P10
P30
[42]
P11
P31
[43]
P12
P32
[44]
P13
P33
[45]
P14
P34
[46]
P15
P35
[47]
P16
P36
[48]
P17
P37
[49]
P18
P38
[50]
Hasnah Nawang et al.
1444
Paper Id
Paper Id
Author
P19
P39
[51]
P20
P40
[52]
3.Results
This section will discuss the on the analysis of
features, methods and algorithms used in this review
study. There are two main factors that contribute to
success in student performance prediction, which are
the features or attributes of educational data and the
techniques or algorithms used to explore educational
data in order to make predictions and patterns [53].
3.1(RQ:i) What are the features commonly used by
researchers to predict students’ performance?
Different researchers have found different attributes
that contribute to prediction accuracy in their studies.
However, some attributes carry the same meaning but
are differently labelled, and those attributes can be
divided into a few groups, such as students’ previous
education, students’ e-learning activities,
demographic features, students’ social network
information, behaviour variables, extra-curricular
activities, school design, academic performance,
parental involvement, etc.
During the full text reading process, some features
were identified and grouped for this review. Six
groups of features were detected and consisted of
Demographic Features, E-Learning Features, Social
Network Features, School Design Features, Academic
Performance Features, and Previous Education
Features.
There are 40 primary studies included in this review.
As shown in Figure 2, amongst the findings, academic
performance group features had the highest percentage
of recurrence in terms of researchers using them to
predict students’ performance with the 87.5%
followed by previous education group with 47.5% and
demographic features with 0.5% difference that is
47%. While the recurrence of percentage for social
network features is 10% and the lowest is the school
design features with 5% of recurrence. From the
percentage, it shows that most researchers find
academic performance features [11], such as GPA to
be significant predictors of students’ performance.
Figure 3 show the percentage of studies using one or
more number and Table 3 groups the features found
into the six categories.
Figure 4 illustrated the list of attributes used in this
study based on their feature groups.
Table 4 will go detailed on the best attributes discover
by the researchers in this review.
Using more than one category of features seems
popular in predicting students’ performance as shown
in Figure 2, 47% researchers in this review used the
combination of two different groups of features in
their study, followed by three combination of features
groups with 23% and 10% of the study used four
combination of features groups. As for only one
category of feature group has been used has recorded
20% of study and none of the study in this review had
use five or six combination of features group. It can be
concluded that, the importance of using more than one
category of features cannot be neglected as 80 percent
from the studies have used more than one group
category features in forecasting students’ performance
whether in higher level institutions or high school.
Table 3 Groups of features used in the studies
Category Id
Features category description
C1
Demographic Features
C2
E-Learning Features
C3
Social Network Features
C4
School Design Features
C5
Academic Performance Features
C6
Previous Education Features
As tabulated in Table 4, there are different attributes
that have been identify as the best predictors in
predicting students’ performance. The academic
performance features have recorded the highest
frequency of best attributes as illustrated in Figure 2.
However, the findings differ from previous review
article [11] that stated GPA is the most influential
attribute to student performance prediction, as only
two study that are [16, 43] out of 18 studies from the
academic performance category that discovered that,
and the majority of other studies ended up finding that
students' grade and score in examinations, quizzes, and
tests are the most impacted attributes to students'
performance. This is because academic performance is
mainly measured through the GPA, grades and score.
As this review also focusing on the high school and
secondary school level, the measurement to the
students’ performance not only depends on the GPA.
Previous education category feature has become the
second most influential attributes as 8 of the studies
found that Scholastic admission test score, high school
marks and grade and National Examination score are
able to make the highest impact on students’
performance with the recurrence percentage is 47.5.
Previous education features in predicting students’
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1445
performance are essentially important as it represents
benchmark of students’ academic achievement [21]. It
is commonly defined as score or grades obtained by
students in the past level of education such as high
school or university admission score which aids in
understanding the consistency of students'
performance.
As for the e-learning features, a total of 11 studies used
the e-learning features combine with other category of
features, and the findings have shown that 7 studies
have recognized the most impacted attributes are the
number of raise hands and logins, submission status
for assignment and homework given, announcement
view and students’ participation in discussion or
forums are impacted the accuracy of students’
performance prediction. Furthermore, [19] identify
that best variable in e-learning features are those that
related to exercise and homework rather than students’
participation in forum and discussions [17, 32]. For
demographic features, although the recurrence
percentage is the quite high among others that is 45
percent, however the best attribute is not as promising
as the academic performance features. only 3 studies
found the demographic features such as gender, caste,
father and mother education, father and mother
occupation, family income and family size do have
greater impact to the prediction [16, 22, 34]. While
[49] analysis has discovered that there is no corelation
between students is first child or not in predictive
model.
Although not many studies emphasize on using the
school design features, however this review has able
to identify two studies that has discovered that the best
attributes in their research is school size [3] and the
percentage of lecturer attendance [10]. Therefore, it is
important for other researchers to look beyond the
common attributes such as academic performance as
the educators and school facilities also tend to
impacted students’ performance. While the social
network features able to identify list of attributes such
as List of webs visited; visits duration, time spend on
movies online, time spend on reading online are able
to contribute to the students’ performance prediction
[24, 30, 47].
Figure 4 illustrates the features that have been used in
this review study. The features were divided based on
six groups that have been discussed above. Some
features are redundant due to their different labels.
However, this study had identified those redundant
features and relabelled them.
Figure 2 Most common used attributes in predicting students’ performance based on groups category
Table 4 Best attributes in predicting students’ performance
Features category
Best attributes
Methods
Paper Id
Academic
performance
features
Students’ grade and score in examination, test,
quizzes and assignment, GPA, internal
assessment on courses and subjects, students’
attendance marks.
Classification
P4,P6,P8,P9,P11,P14,P15,P16,
P1,P20,P25,P26,P28,P32,P40
Classification and
Clustering
P19
Regression
P26, P31
Major course change
Classification
P30
Previous education
features
National examination score, Scholastic
admission test score, high school grades and
score
Classification
P1,P2, P8, P10, P21, P23
Regression
P2
Clustering
P2, P34
45
27.5
10 5
87.5
47.5
0
20
40
60
80
100
C1 C2 C3 C4 C5 C6
Recurrencnce Percentage
Attribute Groups' Names
Hasnah Nawang et al.
1446
Features category
Best attributes
Methods
Paper Id
Demographic
Features
Gender, caste, father and mother education,
fathers and mother occupation, family income
Classification
P4,P14,P22,P37
E-learning features
Number of raise hand, number of logins to the
online class, announce view and participation in
forum and discussions, submission of exercises
and homework, hours spent on material.
Classification
P5, P7, P12, P13, P24
Students’ course marks and grade, attendance
marks.
Classification and
clustering
P29
Classification
P18
Social Network
Features
List of webs visited; visits duration, time spend
on movies online, time spend on reading online.
Classification
P12, P18
Clustering
P35
School Design
Features
School size
Regression
P3
Lecturer attendance percentage
Classification
P28
Figure 3 The percentage of studies using one or more number of features categories in their study
Figure 4 Some of features used in this study based on their groupings
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1447
3.2(RQ:ii) Which methods that are commonly used
to investigate students’ performance?
Figure 5 shows different researchers have applied
various techniques such as classification, regression
and clustering to predict students’ performance. The
goal of classification is to accurately predict the target
class for each case in the data [54], while regressions
are used to identify the relationship between
dependent variables and independent variables [55].
Different from classification and regression, clustering
is an unsupervised classification process that is used to
group objects into classes of similar objects. The
classification method is the most commonly used
method in predicting student performance, as this
review discovered that 30 out of 40 studies used the
classification method rather than clustering and
regression, each of which had only 5 studies. This is
because the classification method or also known as
supervised learning use labeled data to train the
algorithms are computationally less complex compare
to other methods. These labelled data also is used to
train techniques by learn over time and finally
accurately classify data or make predictions.
Figure 5 The number of students’ performance prediction method in the review
3.3(RQ:iii) What are the best algorithms or
techniques used in student performance
prediction?
The techniques used in this review paper are
numerous. This implies that there may be multiple
options for implementing prediction algorithms.
Furthermore, several models are often used in the
same paper to make comparisons in order to find the
best model suitable with their dataset. Random Forest
(RF), Decision Tree (DT), Naïve Bayes (NB), Support
Vector Machine (SVM), Artificial Neural Network
(ANN), Logistic Regression (LR), and K-Nearest
Neighbour (KNN) are among the algorithms
frequently used by researchers to predict students’
performance. A brief explanation on results based on
algorithms or techniques to predict students’
performance will be discussed in the next section.
3.3.1Random forest (RF)
Random Forest (RF) is a supervised ensemble
machine learning approach for classification,
regression, and other tasks that operate by constructing
a number of decision trees during the training time and
producing the output of the class, which is the mode of
the classes of the individual trees [56, 57]. This review
identified 18 out of the 40 studies that had tested the
Random Forest algorithm on their dataset for
prediction. From the total of 18, 6 studies showed the
Random Forest algorithm to have the highest accuracy
beating other algorithms in the prediction of students
at risk [17, 22, 24, 25] or students’ dropout [43, 44].
Details of the feature categories and accuracy levels
achieved according to paper ID are shown in Table 5.
Table 5 Random forest details
Paper Id
Features category
Accuracy
P5
C2
94%
P12
C1, C2, C6
71.6%
P13
C2, C5
84%
P14
C1, C5
99%
P31
C5
73%
P32
C1, C5
86.6%
3.3.2Support vector machine (SVM)
Social network analysts argue that causation is not
located in individuals, but in the social structure.
Social network analysis examines the structure and
0
5
10
15
20
25
30
35
classification regression clustering
The number of mehods used in the study
Machine learning Methods
Hasnah Nawang et al.
1448
composition of ties in a given network and provides
insights into its structural characteristics [58]. In
education, SVM algorithms have been proven to be
helpful in monitoring students’ interactions and
participation in online courses. It has been
acknowledged to be among the most reliable and
accurate algorithms in most Machine Learning
applications [54]. SVM comes in second for the
number of highest accuracy rates achieved in this
study. 5 over 19 papers used SVM to predict academic
success [13], academic performance [19, 34, 45, 50]
and students’ pass rate [35]. For details on the
accuracy rates achieved and feature categories used in
these studies, refer Table 6.
Table 6 Support vector machine details
Paper Id
Features Category
Accuracy
P1
C1, C3, C5, C6
70.4%
P7
C5, C6
-
P22
C1, C5. C6
76.3%
P23
C1, C3, C5, C6
92.6%
P33
C1, C5
75.4%
3.3.3Artificial neural network (ANN)
Artificial Neural Network is a mathematical model
inspired by biological neural networks. A neural
network consists of an interconnected group of
artificial neurons, and it processes information using a
connectionist approach to computation. The network
usually learns the connection weights from available
training patterns. Performance is improved over time
by iteratively updating the weights in the network [59].
From the total of 11 papers reviewed that used the
ANN algorithm, 3 of them used ANN to forecast
students’ performance [48, 51, 52] and predict student
course assessment course as tabular in Table 7.
Recorded that ANN was able to achieve the highest
accuracy [41].
Table 7 Artificial neural network details
Paper Id
Features category
Accuracy
P29
C2
80.4%
P36
C5
75%
P39
C5, C6
76
P40
C2
80%
3.3.4Logistic regression (LR) and linear regression
(LNR)
Linear Regression predicts a continuous numeric
output from a linear combination of attributes, while
Logistic Regression predicts the odds of two or more
outcomes, allowing for categorical predictions [8]. 4
among 11 papers were recorded in Table 8 that have
been used Logistic Regression techniques with their
accuracy achieved. While Table 9 shows the details of
accuracy and features category used using Linear
Regression algorithm.
Table 8 Logistic regression details
Paper Id
Features category
Accuracy
P4
C1, C4, C5, C6
67%
P11
C5, C6
51.9%
P15
C5
89.15%
P24
C1, C5
88.8%
Table 9 Linear regression details
Paper Id
Features category
Accuracy
P20
C1, C2, C5
50%
P21
C4, C6
60.24%
P30
C1, C5.C6
-
3.3.5Decision tree
The Decision Tree classification technique is
performed in two stages, which are 1) tree building,
and 2) pruning [60]. The internal nodes of the tree
represent conditions, the external nodes or the leaves
represent class labels, while branches from the internal
nodes represent outcomes of the tests or conditions
[61]. Decision Tree (DT) is one of the famous
algorithms used for predictive modelling on
educational data [62]. 19 studies reviewed used the
Decision Tree algorithm, and 3 of them succeeded to
achieve the highest rate of accuracy when competing
with other algorithms. For example, [26] used
academic performance and previous education
features to predict students’ performance in
intermediate and secondary schools. Decision Tree
also gave good accuracy in identifying high-risk
students who need timely help to complete their
studies as discovered by [30] who used the
combination of e-learning, social network, and
academic performance features to conduct their work.
Table 10 Decision tree details
Paper Id
Features category
Accuracy
P10
C5, C6
96.6%
P18
C2, C3, C5
91.9%
P37
C1, C5
94%
3.3.6Naïve bayes (NB)
The Naive Bayes classifier simplifies learning by
assuming that features are independent of given class
and provide probabilistic interpretations of
classifications [63]. Although independence is
generally a poor assumption, in practice, Naive Bayes
often competes well with other sophisticated
classifiers. [18] identified that the Naïve Bayes
algorithm has better accuracy in predicting the
performance of junior high school students, while [21]
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1449
used Naïve Bayes to analyse undergraduate students’
performance.
Table 11 Naïve bayes details
Paper Id
Features category
Accuracy
P6
C5
69%
P9
C5, C6
83.65%
3.3.7Hybrid algorithms
In prediction models, the challenging task is to choose
the effective techniques that could produce satisfying
predictive accuracy [64]. Some researchers introduced
the hybrid approach of combining a few machine
learning algorithms together to achieve maximum
accuracy. Hybrid approaches are defined as
incorporating a number of possible machine learning
algorithms in order to achieve better performance than
any examined single learning algorithms [65]. In this
review, five papers used hybrid algorithm approaches
to predict student performance [15, 38, 20, 50], and to
analyse students at risk in a course [28]. Details of the
algorithm integrations are illustrated in Table 12.
Table 12 Hybrid algorithms details
Paper Id
Features category
Accuracy
P3
C1, C4, C5, C6
-
P8
C1, C6
82.3%
P16
C5
85%
P26
C1, C5
98.96%
P38
C1,C5
97.93
Other than the algorithms mentioned above, the ID3,
K-Nearest Neighbour (KNN), Gradient Boosting
(GB), Association Rule Mining (ARM), J48, RMSR,
and Fuzzy Network (FN) algorithms were found to
have achieved high accuracy rates in only one paper
reviewed. Details on the accuracy rates based on the
different techniques used are tabulated in Table 13.
Table 13 Other algorithms details
Paper Id
Features category
Algorithms
Accuracy
P2
C1, C5, C6
RMSR
94%
P17
C5, C6
ID3
80%
P25
C5, C6
FN
80%
P28
C2, C5
KNN
89%
P27
C5
ARM
67.3%
P34
C1, C5, C6
GBoost
87.9%
P35
C2, C5, C6
J48
94.7%
Details of the algorithms frequency in terms of as a
best predictor or frequency used for comparison in
order to find more accurate results in each study are
illustrated in Figure 6 below. Figure 6 shows that
Decision Tree is the most commonly used algorithm
in the study, with 19 out of 40 studies using it to
determine the best accuracy for educational datasets.
However, Random Forest placed first in terms of
highest percentage obtained because it provided the
highest accuracy in 6 over 18 studies or a total of
33.5% when compared to others. When it comes to the
highest accuracy that each algorithm can achieve,
Random Forest once again demonstrated the best
performance with 99% accuracy results [26], putting
this algorithm in first place among others. However,
SVM also performs as good as Random Forest as 19
papers have tested SVM as a single or comparison
algorithms and SVM able to gain 5 over 19 best
accuracies compare to Random Forest. While other
algorithms that obtained high frequency in the chosen
algorithms are ANN, NB, KNN and LR.
Figure 6 The Algorithms frequency details as the best predictor or used as a comparisons techniques in finding the
best accuracy
0
5
10
15
20
25
Frequency
Algorithms
Frequency as the best predictor
Frequency used
Hasnah Nawang et al.
1450
4.Conclusion
Performance prediction has evolved into a useful
research topic that assists educators, academics,
policymakers, and management in improving the
teaching and learning process. This paper presents a 5-
year systematic review of attributes used in student
performance prediction made via machine learning,
educational data mining, and learning analytics
approaches, as well as their applicability in the context
of student performance prediction. Analysis on 40
papers included in this review brought about great
discussions on the varieties of features and algorithms
that can impact student performance prediction.
The majority of features used in predicting students'
performance are from the academic feature category,
which includes students' grade and score in
examinations, tests, quizzes, and assignments, GPA,
internal assessment on courses and subjects, and
attendance marks. According to the findings of this
review, academic performance features able to
outperform other categories of features in terms of the
best attributes for students’ performance prediction.
Another finding from this review is that the
examination marks and score are the best attributes
compare to GPA as highlighted by [11]. This is due to
the fact that prediction of student performance has
been widely adopted in the educational sector, not only
at the higher levels of educational attainment, but also
in high school, middle school, and secondary school.
This will be a strong foundation for educators and
administrators to be able to monitor their students'
academic progress while also establishing a suitable
approach based on their students' strengths.
Looking deeper into the frequency of techniques used
by researchers in predicting student performance,
Decision Tree, Random Forest, SVM, ANN, and NB
show a strong competition of most common used
techniques. However, among the 40 papers reviewed,
the Random Forest algorithm had the best
performance at 99 percent when using demographic
and academic performance features [26], as well as the
majority frequency in terms of highest accuracy in
student performance prediction. RF could be used for
both regression and classification method and even
though it is handling high-dimensional data, the lesser
time for processing make it better than Decision Tree
algorithm [51].
The hybrid approach which combined three
algorithms: AODE, IBK, and J48, scored the second
highest prediction accuracy at 98.96%. Previous
research on student performance prediction suggest
that algorithm combinations can help gain better
accuracy, but they do not emphasize on the importance
of incorporating algorithms to improve accuracy
results. Not only has the integration of multiple
features resulted in improved accuracy prediction, but
the combination of different techniques has also
contributed to improved accuracy results [66]. These
results show that the selection of combined features
categories together with suitable algorithms can affect
the accuracy levels obtained when analysing students’
performance. However, the significance of using
hybrid algorithms in student performance prediction
requires further investigation as most hybrid approach
has the capacity to produce competitive performance
when compared with related methods. Finally, it is
hoped that academic forecasting research on the
educational system can help students in improving
their academic performance, and educators to
understand their students' needs. Additionally, the
findings can help the educational management to
design more efficient curricula for better education
adaptation.
Limitations
Since there is no evidence in favour of smaller classes,
most articles are concerned about overfitting. As a
result, the use of data sampling in the application of
data mining to educational datasets should indeed be
outlined in order to overcome the overfitting and
underfitting problem. Other issues that need to be
highlighted in this review, there are too limited papers
that study on the school design and social network
features as mostly researchers are emphasizing on the
academic performance and demographic features.
Hopefully, more researchers will implement more
categories of features in the future in order to find the
best attributes suitable for adaptation to specific
algorithms.
Acknowledgment
This research was supported by the Research Management
and Innovation Centre, Universiti Sultan Zainal Abidin
(UniSZA).
Conflicts of interest
The authors have no conflicts of interest to declare.
References
[1] Gaftandzhieva S, Docheva M, Doneva R. A
comprehensive approach to learning analytics in
Bulgarian school education. Education and Information
Technologies. 2021; 26(1):145-63.
[2] Alyahyan E, Düştegör D. Predicting academic success
in higher education: literature review and best practices.
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1451
International Journal of Educational Technology in
Higher Education. 2020; 17(1):1-21.
[3] Ferreira SA, Andrade A. Academic analytics: Anatomy
of an exploratory essay. Education and Information
Technologies. 2016; 21(1):229-43.
[4] Najafabadi MM, Villanustre F, Khoshgoftaar TM,
Seliya N, Wald R, Muharemagic E. Deep learning
applications and challenges in big data analytics.
Journal of Big Data. 2015; 2(1):1-21.
[5] Khatib KC, Kamble TD, Chendake BR, Sonavane GN.
Social media data mining for sentiment analysis.
International Research Journal of Engineering and
Technology. 2016; 3(04):373-6.
[6] Ozdemir D, Opseth HM, Taylor H. Leveraging learning
analytics for student reflection and course evaluation.
Journal of Applied Research in Higher Education.
2019; 12(1):27-37.
[7] Nuutila K, Tuominen H, Tapola A, Vainikainen MP,
Niemivirta M. Consistency, longitudinal stability, and
predictions of elementary school students' task interest,
success expectancy, and performance in mathematics.
Learning and Instruction. 2018; 56:73-83.
[8] Lang C, Siemens G, Wise A, Gasevic D. Handbook of
learning analytics. New York, NY, USA: SOLAR,
Society for Learning Analytics and Research; 2017.
[9] Nawang H, Makhtar M, Shamsudin SN. Classification
model and analysis on students’ performance. Journal
of Fundamental and Applied Sciences. 2017;
9(6S):869-85.
[10] Tsai YS, Gasevic D. Learning analytics in higher
education---challenges and policies: a review of eight
learning analytics policies. In proceedings of the
seventh international learning analytics & knowledge
conference 2017(pp. 233-42).
[11] Shahiri AM, Husain W. A review on predicting
student's performance using data mining techniques.
Procedia Computer Science. 2015; 72:414-22.
[12] Kitchenham B, Brereton OP, Budgen D, Turner M,
Bailey J, Linkman S. Systematic literature reviews in
software engineeringa systematic literature review.
Information and Software Technology. 2009; 51(1):7-
15.
[13] Gil PD, da Cruz Martins S, Moro S, Costa JM. A data-
driven approach to predict first-year students’ academic
success in higher education institutions. Education and
Information Technologies. 2021; 26(2):2165-90.
[14] Qazdar A, Er-Raha B, Cherkaoui C, Mammass D. A
machine learning algorithm framework for predicting
students performance: a case study of baccalaureate
students in Morocco. Education and Information
Technologies. 2019; 24(6):3577-89.
[15] Costa-Mendes R, Oliveira T, Castelli M, Cruz-Jesus F.
A machine learning approximation of the 2015
Portuguese high school student grades:a hybrid
approach. Education and Information Technologies.
2021; 26(2):1527-47.
[16] Baars GJ, Stijnen T, Splinter TA. A model to predict
student failure in the first year of the undergraduate
medical curriculum. Health Professions Education.
2017; 3(1):5-14.
[17] Youssef M, Mohammed S, Hamada EK, Wafaa BF. A
predictive approach based on efficient feature selection
and learning algorithms’ competition: Case of learners’
dropout in MOOCs. Education and Information
Technologies. 2019; 24(6):3591-618.
[18] Kostopoulos G, Kotsiantis S, Verykios VS. A prognosis
of junior high school students’ performance based on
active learning methods. In international conference on
brain function assessment in learning 2017 (pp. 67-76).
Springer, Cham.
[19] Moreno-Marcos PM, Pong TC, Munoz-Merino PJ,
Kloos CD. Analysis of the factors influencing learners’
performance prediction with learning analytics. IEEE
Access. 2020; 8:5264-82.
[20] Al-Obeidat F, Tubaishat A, Dillon A, Shah B.
Analyzing students’ performance using multi-criteria
classification. Cluster Computing. 2018; 21(1):623-32.
[21] Asif R, Merceron A, Ali SA, Haider NG. Analyzing
undergraduate students' performance using educational
data mining. Computers & Education. 2017; 113:177-
94.
[22] Yousafzai BK, Hayat M, Afzal S. Application of
machine learning and data mining in predicting the
performance of intermediate and secondary education
level student. Education and Information Technologies.
2020; 25(6):4677-97.
[23] Adekitan AI, Noma-Osaghae E. Data mining approach
to predicting the performance of first year student in a
university using the admission requirements. Education
and Information Technologies. 2019; 24(2):1527-43.
[24] Azcona D, Hsiao IH, Smeaton AF. Detecting students-
at-risk in computer programming classes with learning
analytics from students’ digital footprints. User
Modeling and User-Adapted Interaction. 2019;
29(4):759-88.
[25] Akçapınar G, Hasnine MN, Majumdar R, Flanagan B,
Ogata H. Developing an early-warning system for
spotting at-risk students by using eBook interaction
logs. Smart Learning Environments. 2019; 6(1):1-15.
[26] Hussain S, Dahan NA, Ba-Alwib FM, Ribata N.
Educational data mining and analysis of students’
academic performance using WEKA. Indonesian
Journal of Electrical Engineering and Computer
Science. 2018; 9(2):447-59.
[27] Adekitan AI, Salau O. The impact of engineering
students' performance in the first three years on their
graduation result using educational data mining.
Heliyon. 2019; 5(2):e01250.
[28] Marbouti F, Diefes-Dux HA, Madhavan K. Models for
early prediction of at-risk students in a course using
standards-based grading. Computers & Education.
2016; 103:1-5.
[29] Altujjar Y, Altamimi W, Al-Turaiki I, Al-Razgan M.
Predicting critical courses affecting students
performance: a case study. Procedia Computer Science.
2016; 82:65-71.
[30] Zhou Q, Quan W, Zhong Y, Xiao W, Mou C, Wang Y.
Predicting high-risk students using Internet access logs.
Knowledge and Information Systems. 2018; 55(2):393-
413.
Hasnah Nawang et al.
1452
[31] Aydoğdu Ş. Predicting student final performance using
artificial neural networks in online learning
environments. Education and Information
Technologies. 2020; 25(3):1913-27.
[32] Karlos S, Kostopoulos G, Kotsiantis S. Predicting and
interpreting students’ grades in distance higher
education through a semi-regression method. Applied
Sciences. 2020; 10(23):8413.
[33] Iqbal MS, Luo B. Prediction of educational institution
using predictive analytic techniques. Education and
Information Technologies. 2019; 24(2):1469-83.
[34] Zohair LM. Prediction of student’s performance by
modelling small dataset size. International Journal of
Educational Technology in Higher Education. 2019;
16(1):1-8.
[35] Ma X, Zhou Z. Student pass rates prediction using
optimized support vector machine and decision tree. In
8th annual computing and communication workshop
and conference (CCWC) 2018 (pp. 209-15). IEEE.
[36] Hashim AS, Awadh WA, Hamoud AK. Student
performance prediction model based on supervised
machine learning algorithms. In IOP conference series:
materials science and engineering 2020 (pp. 1-19). IOP
Publishing.
[37] Hamsa H, Indiradevi S, Kizhakkethottam JJ. Student
academic performance prediction model using decision
tree and fuzzy genetic algorithm. Procedia Technology.
2016; 25:326-32.
[38] Pandey M, Taruna S. Towards the integration of
multiple classifier pertaining to the Student's
performance prediction. Perspectives in Science. 2016;
8:364-6.
[39] Badr G, Algobail A, Almutairi H, Almutery M.
Predicting students’ performance in university courses:
a case study and tool in KSU mathematics department.
Procedia Computer Science. 2016; 82:80-9.
[40] Akçapınar G, Altun A, Aşkar P. Using learning
analytics to develop early-warning system for at-risk
students. International Journal of Educational
Technology in Higher Education. 2019; 16(1):1-20.
[41] Hussain M, Zhu W, Zhang W, Abidi SM, Ali S. Using
machine learning to predict student difficulties from
learning session data. Artificial Intelligence Review.
2019; 52(1):381-407.
[42] Zheng G, Fancsali SE, Ritter S, Berman S. Using
instruction-embedded formative assessment to predict
state summative test scores and achievement levels in
mathematics. Journal of Learning Analytics. 2019;
6(2):153-74.
[43] Rovira S, Puertas E, Igual L. Data-driven system to
predict academic grades and dropout. PLoS One. 2017;
12(2):e0171207.
[44] Rodríguez-Muñiz LJ, Bernardo AB, Esteban M, Díaz I.
Dropout and transfer paths: What are the risky profiles
when analyzing university persistence with machine
learning techniques?. Plos One. 2019; 14(6):1-21.
[45] Francis BK, Babu SS. Predicting academic
performance of students using a hybrid data mining
approach. Journal of Medical Systems. 2019; 43(6):1-
5.
[46] Aiken JM, De Bin R, Hjorth-Jensen M, Caballero MD.
Predicting time to graduation at a large enrollment
American university. Plos One. 2020;
15(11):e0242334.
[47] Hussain M, Zhu W, Zhang W, Abidi SM. Student
engagement predictions in an e-learning system and
their impact on student course assessment scores.
Computational Intelligence and Neuroscience. 2018.
[48] Czibula G, Mihai A, Crivei LM. S PRAR: a novel
relational association rule mining classification model
applied for academic performance prediction. Procedia
Computer Science. 2019; 159:20-9.
[49] Matzavela V, Alepis E. Decision tree learning through
a predictive model for student academic performance in
intelligent M-learning environments. Computers and
Education: Artificial Intelligence. 2021; 2:100035.
[50] Viloria A, López JR, Leyva DM, Vargas-Mercado C,
Hernández-Palma H, Llinas NO, et al. Data mining
techniques and multivariate analysis to discover
patterns in university final researches. Procedia
Computer Science. 2019; 155:581-6.
[51] Deng H, Wang X, Guo Z, Decker A, Duan X, Wang C,
et al. Performancevis: visual analytics of student
performance data from an introductory chemistry
course. Visual Informatics. 2019; 3(4):166-76.
[52] Çetinkaya A, Baykan ÖK. Prediction of middle school
students' programming talent using artificial neural
networks. Engineering Science and Technology, an
International Journal. 2020; 23(6):1301-7.
[53] Mokhairi M, Nawang H, Wan SN. Analysis on students
performance using naïve. J. Theor. Appl. Inf. Technol.
2017; 31(16):3993-4000.
[54] Hu H, Zhang G, Gao W, Wang M. Big data analytics
for MOOC video watching behavior based on Spark.
Neural Computing and Applications. 2020;
32(11):6481-9.
[55] Slater S, Joksimović S, Kovanovic V, Baker RS,
Gasevic D. Tools for educational data mining: a review.
Journal of Educational and Behavioral Statistics. 2017;
42(1):85-106.
[56] Breiman L. Random forests. Machine Learning. 2001;
45(1):5-32.
[57] YUSUF A. Prediction of students’performance in E-
learning environment using random forest (Doctoral
dissertation, Universiti Teknologi Malaysia).
[58] Noble WS. What is a support vector machine?. Nature
biotechnology. 2006; 24(12):1565-7.
[59] Gupta N. Artificial neural network. Network and
Complex Systems. 2013; 3(1):24-8.
[60] Hamoud A, Hashim AS, Awadh WA. Predicting
student performance in higher education institutions
using decision tree analysis. International Journal of
Interactive Multimedia and Artificial Intelligence.
2018; 5:26-31.
[61] Zulfiker MS, Kabir N, Biswas AA, Chakraborty P,
Rahman MM. Predicting students’ performance of the
private universities of Bangladesh using machine
learning approaches. International Journal of Advanced
Computer Science and Applications. 2020; 11(3):672-
9.
International Journal of Advanced Technology and Engineering Exploration, Vol 8(84)
1453
[62] Sivakumar S, Venkataraman S, Selvaraj R. Predictive
modeling of student dropout indicators in educational
data mining using improved decision tree. Indian
Journal of Science and Technology. 2016; 9(4):1-5.
[63] Rish I. An empirical study of the naive Bayes classifier.
In IJCAI workshop on empirical methods in artificial
intelligence 2001 (pp. 41-6).
[64] Sokkhey P, Okazaki T. Hybrid machine learning
algorithms for predicting academic performance.
International Journal of Advanced Computer Science
and Applications. 2020; 11(1):32-41.
[65] Dole L, Rajurkar J. A Decision support system for
predicting student performance. International Journal
of Innovative Research in Computer and
Communication Engineering. 2014; 2(12):7232-7.
[66] Mohamad M, Makhtar M, Abd Rahman MN. The
reconstructed heterogeneity to enhance ensemble
neural network for large data. In international
conference on soft computing and data mining 2016
(pp. 447-55). Springer, Cham.
Hasnah Nawang earned a BSc in
Computer Science from Universiti Putra
Malaysia in 2006 and an MSc in
Computer Science from Universiti
Sultan Zainal Abidin in Terengganu,
Malaysia, in 2018. She is currently a
PhD candidate at the Department of
Computer Science in the Faculty of
Computing and Informatics at Universiti Sultan Zainal
Abidin in Terengganu, Malaysia. Since 2007, she has also
worked as a secondary school teacher in the Department of
Mathematics and Computer Science. Machine Learning,
Data Mining, and Deep Learning are among her current
research interests.
Email: hasnah.nawang@gmail.com
Dr. Mokhairi Makhtar received his
PhD from University of Bradford,
United Kingdom in 2012. He is
currently a Professor in the Department
of Computer Science, Universiti Sultan
Zainal Abidin, Terengganu, Malaysia.
His current research interests include
Machine Learning, Ensemble Method,
Data Mining, Soft Computing, Timetabling and
Optimisation, Natural Languange Processing, E-Learning
and Deep Learning.
Email: mokhairi@unisza.edu.my
Dr. Wan Mohd Amir Fazamin Wan
Hamzah received his PhD from
Universiti Malaysia Terengganu,
Malaysia. He is currently a lecturer in
Universiti Sultan Zainal Abidin,
Terengganu, Malaysia. His research
interests include Learning Analytics,
Gamification, e-Learning and Cloud
Computing.
Email: amirfazamin@unisza.edu.my
Appendix I
S. No.
Abbreviations
Descriptions
1
ANN
Artificial Neural Network
2
ARM
Association Rule Mining
3
DM
Data Mining
4
DT
Decision Tree
5
FN
Fuzzy Network
6
GBoost
Gradient Boosting
7
ID3
Iterative Dichotomiser 3
8
J48
J48 algorithm
9
KNN
Nearest Neighbor
10
LA
Learning Analytics
11
LNR
Linear Regression
12
LR
Logistic Regression
13
NB
Naïve Bayes
14
ML
Machine Learning
15
RF
Random Forest
16
SVM
Support Vector Machine
Authos Photo
Authors Photo
Auth’s Photo
... The limitations of previous research focused on which algorithm may better predict secondary students' achievement, specifically in core and elective subjects. Motivated researchers to explore this scenario, so that their teachers can cater to their needs by grouping them in different groups based on their profiling [14]. This study also aims to look at whether the same variables and algorithms used in the classification of core and elective subjects could be able to predict academic performance based on students' first semester achievement. ...
... This study discovered that the aforesaid features that are widely employed in students' performance prediction are separated into many groups; demographic, e-learning, social network, school design, academic performance and previous education features [14]. However, there has been little research previously conducted to study either core subjects or elective subjects that are designed to contribute to the students' performance prediction. ...
Article
Full-text available
Many researchers in educational data mining (EDM) have explored various machine learning techniques in order to predict students’ performance. However, the most daunting challenge in classification modelling is selecting the most effective algorithm with the highest accuracy. A study was conducted using datasets from two Malaysian premier secondary schools, Maktab Rendah Sains Mara (MRSM) Kuala Berang and Kuala Terengganu. The purpose of this study is to respond to two key questions; the first is to examine which algorithm is the best in predicting secondary students’ achievement in core and elective subjects, while the second is to study whether the same features and algorithms are capable of predicting academic performance based on students’ first semester achievement. To do so, this study analysed the effectiveness of six different classification algorithms, which are naïve Bayes (NB), random forest (RF), k-nearest neighbour (kNN), support vector machine (SVM), sequential minimal optimization (SMO), and logistic regression (LGR). Each model’s prediction accuracy was evaluated using 10-fold cross validation in order to identify the best model. The results showed that the RF model outperformed other models in terms of accuracy, precision, recall, and F1-Measure. With most algorithms achieving significant accuracy levels for both core and elective subjects’ dataset. It is concluded that the prediction of secondary school students' achievement can begin as early as the first semester using RF for core and elective subjects with biology dataset. The accuracy obtained was 96.7% and 97.5%, respectively for the core and elective subjects.
... The limitations of previous research focused on which algorithm may better predict secondary students' achievement, specifically in core and elective subjects. Motivated researchers to explore this scenario, so that their teachers can cater to their needs by grouping them in different groups based on their profiling [14]. This study also aims to look at whether the same variables and algorithms used in the classification of core and elective subjects could be able to predict academic performance based on students' first semester achievement. ...
... This study discovered that the aforesaid features that are widely employed in students' performance prediction are separated into many groups; demographic, e-learning, social network, school design, academic performance and previous education features [14]. However, there has been little research previously conducted to study either core subjects or elective subjects that are designed to contribute to the students' performance prediction. ...
Article
Many researchers in educational data mining (EDM) have explored various machine learning techniques in order to predict students’ performance. However, the most daunting challenge in classification modelling is selecting the most effective algorithm with the highest accuracy. A study was conducted using datasets from two Malaysian premier secondary schools, Maktab Rendah Sains Mara (MRSM) Kuala Berang and Kuala Terengganu. The purpose of this study is to respond to two key questions; the first is to examine which algorithm is the best in predicting secondary students’ achievement in core and elective subjects, while the second is to study whether the same features and algorithms are capable of predicting academic performance based on students’ first semester achievement. To do so, this study analysed the effectiveness of six different classification algorithms, which are naïve Bayes (NB), random forest (RF), knearest neighbour (kNN), support vector machine (SVM), sequential minimal optimization (SMO), and logistic regression (LGR). Each model’s prediction accuracy was evaluated using 10-fold cross validation in order to identify the best model. The results showed that the RF model outperformed other models in terms of accuracy, precision, recall, and F1- Measure. With most algorithms achieving significant accuracy levels for both core and elective subjects’ dataset. It is concluded that the prediction of secondary school students' achievement can begin as early as the first semester using RF for core and elective subjects with biology dataset. The accuracy obtained was 96.7% and 97.5%, respectively for the core and elective subjects.
... It has been proved that using ML approaches is essential for identifying students who are at risk of experiencing academic difficulties and dropping out, as well as for enhancing overall student performance. The goal of Nawang et al. (2021) is to provide an SLR on student achievement predictions in higher secondary schools and colleges of higher study utilizing ML, EDM, and learning analytic approaches. The study also presented several potential future findings on the application of hybrid approaches to academic datasets in order to enhance the precision of performance prediction for students. ...
Article
Full-text available
Education systems have significantly changed with the emergence of the internet. It has a significant impact on how students learn things. Nevertheless, its impact can also be contradicting. Internet addiction can slowly poison the minds of our youths and stand in the way of pursuing their goals. Although Bangladesh has internet connectivity across the country, its potential could be more utilized, particularly in the educational sector. Therefore, proper analysis of the effects of the internet on students, as well as determining the prominent factors relevant to the internet, is a necessary task. In addition, predicting students' academic performance can help determine the changes that must be incorporated to improve the educational system. Hence, this research analyzes the effects of internet usage on students' academic progress and then predicts the students' performance using distinct machine learning (ML) algorithms. The data were collected through an offline survey from Noakhali, Bangladesh. The collected data is preprocessed to select the most relevant features. The preprocessed data were fed into ML algorithms to investigate their behaviors. We have employed logistic regression, decision tree, random forest, and Naïve Bayes algorithms to see their classification performance on our dataset. To minimize the overfitting issue, k-fold cross-validation and hyperparameter optimization have been applied. The results were presented in two parts—exploratory data analysis and classification. Exploratory data analysis shows that the main purpose of internet usage is education and entertainment for school students, social media and entertainment for college students, and education and social media for university students. School and university students browse the internet mainly for academic purposes, whereas college students browse mainly for non-academic purposes. Students prefer to browse the internet at night. For all schools, colleges, and universities, students with better results generally visited websites like Google and YouTube. Students with moderate or bad results generally spent time on social media platforms (mainly Facebook and WhatsApp). Then, the results of the numerical analysis performed with classification algorithms are presented. Results indicate that random forest gives the maximum score in our dataset in all sectors, like accuracy, precision, recall, and f1 score. It gives a maximum of 85% accuracy on the test set. Logistic regression gives the second-best score of 69%. The practical applications and policy recommendations for Bangladesh's education sector are also discussed. The output of this work can contribute to building a policy on internet usage. In this way, it is possible to make the students more concentrative on their education and learning.
... In Nawang et al (2021) systematic literature review, the authors discussed the use of the EDM, LA, and ML to predict student's academic performance in secondary schools and higher educational institutions. This review provided an overview of the methods and algorithms that are used in the prediction and identified the features that have the most significant effects on their performance. ...
Preprint
Full-text available
With advances in Artificial Intelligence (AI) and the increasing volume of online educational data, Deep Learning techniques have played a critical role in predicting student’s academic performance. Recent developments have assisted instructors in determining the strengths and weaknesses of student’s academic achievement. This understanding will benefit from adopting the necessary interventions to assist students in improving their performance, helping at-risk students, and preventing dropout rates. In this review, 46 studies between 2019 and 2023 that apply one or more Deep Learning (DL) techniques, either alone or in combination with Machine Learning (ML) or Ensemble Learning techniques, have been reviewed and analyzed. Moreover, the review has examined utilizing datasets from public (MOOCs), private (LMSs), and other platforms. Four categories were used to group the features: demographic, previous and current academic performance, and learning behavior/activity features. Therefore, the analysis has demonstrated that the DNNs and CNN-LSTM models were the most commonly used techniques. Moreover, it has been noted that the studies that have used DL techniques such as CNNs, DNNs, and LSTMs, performed well by achieving high prediction accuracy above 90%; other studies achieved accuracy ranging between 60% and 90%. For datasets used within the reviewed studies, even though 44% of the studies have used LMSs datasets, it is found that OULAD was the most used dataset from MOOCs. For the grouped features, the results of the analysis indicate that the best features for prediction are the learning behavior and activity features, which outperform other feature categories.
... On topic of student performance predictions which includes predicting students grades (tests, exams, quizzes, assignments, GPA) and attendance, (Nawang et al., 2021) analyzed 40 papers and found that this application of ML has been widely adopted in higher education. Regarding the most widely used techniques by researchers, the authors found that while Decision Tree, Random Forest, SVM, Artificial Neural Network (ANN) and Naïve Bayes (NB) were the most widely used techniques and it was Random Forest that proved to be the most accurate technique in predicting student performance. ...
Article
Full-text available
In the last decade, artificial intelligence (AI), machine learning (ML) and learning data analytics have been introduced with great effect in the field of higher education. However, despite the potential benefits for higher education institutions (HIE´s) of these emerging technologies, most of them are still in the early stages of adoption of these technologies. Thus, a systematic literature review (SLR) on the literature published over the last 5 years on potential applications of machine learning in higher education is necessary. Following the PRISMA guidelines, out of the 1887 initially identified SCOPUS-indexed publications on the topic, 171 articles were selected for review. To screen the abstracts and titles of each citation, Rayyan QCRI was used. VOSViewer, a software tool for constructing and visualizing bibliometric networks, and Microsoft Excel were used to generate charts and figures. The findings show that the most widely researched application of ML in higher education is related to the prediction of academic performance and employability of students. The implications will be invaluable for researchers and practitioners to explore how ML and AI technologies ,in the era of ChatGPT, can be used in universities without jeopardizing academic integrity.
... A wide range of data mining techniques has been employed for this purpose, each with its own strengths and limitations. A survey was conducted to provide a comprehensive overview of the intelligent models and paradigms used in education [12]. The survey identifies various challenges in predicting student performance, such as the high dimensionality of educational data, class imbalance, and the lack of labelled data. ...
Article
Full-text available
In universities that use the academic credit system, selecting elective courses is a crucial task that can have a significant impact on a student's academic performance. Students who perform poorly in their courses may receive formal warnings or even face expulsion from the university. Thus, a well-designed study plan from a course recommendation system can play an essential role in achieving good academic performance. Additionally, early warnings regarding challenging courses can help students better prepare and improve their chances of success. Therefore, predicting student performance is a vital component of both the course recommendation system and the academic advisor's role. To this end, numerous studies have addressed the prediction of student performance using various approaches such as association rules, machine learning, and recommender systems. More recently, personalized machine learning approaches, particularly the matrix factorization technique, have been used in the course recommendation system. However, the accuracy of these approaches in predicting student performance still needs improvement. To address this issue, this study proposes an approach called Deep Biased Matrix Factorization, which carries out deep factorization via multi-layer to enhance prediction accuracy. Experimental results on an educational dataset have demonstrated that the proposed approach can significantly improve the accuracy of student performance prediction. By using this approach, universities can better recommend elective courses to their students as well as predict student performance, which can help them make informed decisions and achieve better academic outcomes.
Article
Full-text available
Machine learning (ML) is one way that can help decipher the intricate relationship between students' data and their performance. When implemented correctly in learning environments, machine learning will improve our knowledge of fundamental processes by simplifying the identification, extraction, and evaluation of underlying factors that affect student learning and levels of achievement. This study employed the experimental research approach using binary classification techniques based on the six-step knowledge discovery process (KDP) model. Five classifiers were used within the Rapid Miner's 9.10.010 educational environment as both experimental and analytical tool. The dataset comprised of 2334 records, 17 attributes with one class variable (students’ semester average score) inclusive. Twenty different tests were conducted. The experiments' results were evaluated using 10-fold cross-validation and ratio split validation with bootstrap sampling. The Random Forest algorithm (RF), Rule Induction methods (RI), Naive Bayes (NB), Logistic Regression (LR) and Deep Learning (DL) algorithms were used in the experiment. The experimental results demonstrated that the RF method outperforms the other four techniques in all six-evaluation metrics that were employed for the selection process with the accuracy being 93.96%. According to the RF classifier model, the mother's and father's education levels are two recognized demographic factors per this study that significantly influence pre-tertiary students’ academic achievement. This study has significantly reduced the gap in practical knowledge observed in the literature by introducing an intervention scheme for respective student's requiring intensive or minimal academic interventions in its prediction procedure.
Article
Full-text available
In the area of machine learning and data science, decision tree learning is considered as one of the most popular classification techniques. Therefore, a decision tree algorithm generates a classification and predictive model, which is simple to understand and interpret, easy to display graphically, and capable to handle both numerical and categorical data. The intelligent m-learning systems, enjoy recently an explosive growth of interest, for more effective education and adaptive learning tailored to each student's learning abilities. The goal of this paper is to further improve personalization in student academic performance, that includes dynamic tests with a predictive model. One major objective of this research is to create adaptive dynamic tests for assessing student academic performance, while constantly comparing the results of the assessment which exhibits the individual student profile, with the results of the decision tree's algorithm which formulates a predictive model for students' knowledge level, according to the weights of the decision tree.
Article
Full-text available
Multi-view learning is a machine learning app0roach aiming to exploit the knowledge retrieved from data, represented by multiple feature subsets known as views. Co-training is considered the most representative form of multi-view learning, a very effective semi-supervised classification algorithm for building highly accurate and robust predictive models. Even though it has been implemented in various scientific fields, it has not adequately used in educational data mining and learning analytics, since the hypothesis about the existence of two feature views cannot be easily implemented. Some notable studies have emerged recently dealing with semi-supervised classification tasks, such as student performance or student dropout prediction, while semi-supervised regression is uncharted territory. Therefore, the present study attempts to implement a semi-regression algorithm for predicting the grades of undergraduate students in the final exams of a one-year online course, which exploits three independent and naturally formed feature views, since they are derived from different sources. Moreover, we examine a well-established framework for interpreting the acquired results regarding their contribution to the final outcome per student/instance. To this purpose, a plethora of experiments is conducted based on data offered by the Hellenic Open University and representative machine learning algorithms. The experimental results demonstrate that the early prognosis of students at risk of failure can be accurately achieved compared to supervised models, even for a small amount of initially collected data from the first two semesters. The robustness of the applying semi-supervised regression scheme along with supervised learners and the investigation of features’ reasoning could highly benefit the educational domain.
Article
Full-text available
The time it takes a student to graduate with a university degree is mitigated by a variety of factors such as their background, the academic performance at university, and their integration into the social communities of the university they attend. Different universities have different populations, student services, instruction styles, and degree programs, however, they all collect institutional data. This study presents data for 160,933 students attending a large American research university. The data includes performance, enrollment, demographics, and preparation features. Discrete time hazard models for the time-to-graduation are presented in the context of Tinto’s Theory of Drop Out. Additionally, a novel machine learning method: gradient boosted trees, is applied and compared to the typical maximum likelihood method. We demonstrate that enrollment factors (such as changing a major) lead to greater increases in model predictive performance of when a student graduates than performance factors (such as grades) or preparation (such as high school GPA).
Article
Full-text available
This study presents a data mining approach to predict academic success of the first-year students. A dataset of 10 academic years for first-year bachelor's degrees from a Portuguese Higher Institution (N = 9652) has been analysed. Features' selection resulted in a characterising set of 68 features, encompassing socio-demographic, social origin, previous education, special statutes and educational path dimensions. We proposed and tested three distinct course stage data models based on entrance date, end of the first and second curricular semesters. A support vector machines (SVM) model achieved the best overall performance and was selected to conduct a data-based sensitivity analysis. The previous evaluation performance, study gaps and age-related features play a major role in explaining failures at entrance stage. For subsequent stages, current evaluation performance features unveil their predictive power. Suggested guidelines include to provide study support groups to risk profiles and to create monitoring frameworks. From a practical standpoint, a data-driven decision-making framework based on these models can be used to promote academic success.
Article
Full-text available
This article uses an anonymous 2014–15 school year dataset from the Directorate-General for Statistics of Education and Science (DGEEC) of the Portuguese Ministry of Education as a means to carry out a predictive power comparison between the classic multilinear regression model and a chosen set of machine learning algorithms. A multilinear regression model is used in parallel with random forest, support vector machine, artificial neural network and extreme gradient boosting machine stacking ensemble implementations. Designing a hybrid analysis is intended where classical statistical analysis and artificial intelligence algorithms are blended to augment the ability to retain valuable conclusions and well-supported results. The machine learning algorithms attain a higher level of predictive ability. In addition, the stacking appropriateness increases as the base learner output correlation matrix determinant increases and the random forest feature importance empirical distributions are correlated with the structure of p-values and the statistical significance test ascertains of the multiple linear model. An information system that supports the nationwide education system should be designed and further structured to collect meaningful and precise data about the full range of academic achievement antecedents. The article concludes that no evidence is found in favour of smaller classes.
Article
Full-text available
Nowadays, the softwarization and virtualization of resources and services rapidly continue, and along with reading and writing, programming is going to be one of the basic human ability. Thus, the detection of skilled programmers at an early age has become important for economies to strengthen their workforce and compete globally. The current technological momentum shows that when the middle school students of today reach the 2030s, the demand for advanced programming skills will be rapidly increased, expanding as high as 90% between 2016 and 2030. Thus, the identification of these skilled people at an early age is important. Accordingly, this study focused on predicting middle school students’ programming aptitude using artificial neural network (ANN) algorithms. A participant survey was developed and applied to middle school students consisting of fifth, sixth, and seventh graders from Konya Science Center, Turkey. After the completion of the survey, the participants then took the 20-level Classic Maze course (CMC) on Code.org. The participants’ final scores in the CMC were calculated based on the level they completed and the lines of codes they wrote. The best results were obtained using the Bayesian regularization algorithm: Training-R = 9.72284e−1; Test-R = 9.12687e−1, and All-R = 9.597e−1. The results show that ANN is an appropriate machine learning method that can forecast participants’ skills, such as analytical thinking, problem-solving, and programming aptitude.
Article
Full-text available
Many educational institutions use a large number of information systems to automate their activities for different stakeholders’ groups – learning management systems, student diary, library system, digital repositories, management system, etc. This leads to a significant increase in the volume and variety of data that can be captured, stored, and harnessed to improve student learning and school effectiveness. Educational institutions are realising that, with the help of technology, they are collecting data that could be very valuable when properly analysed, aligned with learning outcomes, and integrated into a tighter feedback loop with stakeholders. In addition, the analysis of data can help their managers to take data-driven decision making at all levels of educational institutions. The paper presents a comprehensive approach to Learning Analytics in the field of Secondary Education from the perspective of all different stakeholders, which aims to improve its methods of approaching and analysing learning data. On the basis of a literature review in the field and an investigation of requirements for quality evaluation of learning in school education, the corresponding stakeholder groups are identified - students, teachers, class teachers, managers, parents, inspectors and 6 models (1 model per each stakeholder group) for data collection and personalized and meaningful analysis are proposed for the needs of Learning Analytics. Each model consists of measurable indicators allowing the relevant stakeholder to track data for students’ learning or training for different purposes, e.g. monitoring, analysis, forecast, intervention, recommendations, etc., but finally to improve the quality of learning and teaching processes. The proposed models are evaluated by the representatives of 4 stakeholder groups – students, teachers, class teachers, parents.
Article
Full-text available
The presented work is a student marks and grade prediction system using supervised machine learning techniques, the system is developed on the historic performance of students. The data used in this research is collected from Federal Board of Intermediate and Secondary Education Islamabad Pakistan, there are 7 regions in FBISE i.e. Punjab, Sindh, Khyber Pakhtunkhwa, Balochistan, Azad Jammu and Kashmir and overseas. The aims of this work is to analyze the education quality which is closely tightened with the sustainable development goals. The implementation of the system has produced an excess of data which must be processed suitably to gain more valuable information that can be more useful for future development and planning. Student marks and grade prediction from their historic academic data is a popular and useful application in educational data mining, so it is becoming a valuable source of information which can be used in different manners to improve the education quality in the country. Related work shows that several method for academic grade prediction are developed for the betterment of teaching and administrative staff of an educational organizational system. In our proposed methodology, the obtained data is preprocessed to improve the quality of data, the labeled student historic data (29 optimal attributes) is used to train decision tree classifier and regression model. The classification system will predict the grade while the regression model will predict the marks, finally the results obtained from both the model are analyzed. The obtain results show the effectiveness and importance of machine learning technology in predicating the students performance.
Article
Full-text available
Every year thousands of students get admitted into different universities in Bangladesh. Among them, a large number of students complete their graduation with low scoring results which affect their careers. By predicting their grades before the final examination, they can take essential measures to ameliorate their grades. This article has proposed different machine learning approaches for predicting the grade of a student in a course, in the context of the private universities of Bangladesh. Using different features that affect the result of a student, seven different classifiers have been trained, namely: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression, Decision Tree, AdaBoost, Multilayer Perceptron (MLP), and Extra Tree Classifier for classifying the students' final grades into four quality classes: Excellent, Good, Poor, and Fail. Afterwards, the outputs of the base classifiers have been aggregated using the weighted voting approach to attain better results. And here this study has achieved an accuracy of 81.73%, where the weighted voting classifier outperforms the base classifiers.
Article
Full-text available
The large volume of data and its complexity in educational institutions require the sakes from informative technologies. In order to facilitate this task, many researchers have focused on using machine learning to extract knowledge from the education database to support students and instructors in getting better performance. In prediction models, the challenging task is to choose the effective techniques which could produce satisfying predictive accuracy. Hence, in this work, we introduced a hybrid approach of principal component analysis (PCA) as conjunction with four machines learning (ML) algorithms: random forest (RF), C5.0 of decision tree (DT), and naïve Bayes (NB) of Bayes network and support vector machine (SVM), to improve the performances of classification by solving the misclassification problem. Three datasets were used to confirm the robustness of the proposed models. Through the given datasets, we evaluated the classification accuracy and root mean square error (RSME) as evaluation metrics of the proposed models. In this classification problem, 10-fold cross-validation was proposed to evaluate the predictive performance. The proposed hybrid models produced very prediction results which shown itself as the optimal prediction and classification algorithms.