ArticlePDF Available

Predict Students’ Academic Performance based on their Assessment Grades and Online Activity Data

May 2020
International Journal of Advanced Computer Science and Applications 11(4)

May 2020
11(4)

DOI:10.14569/IJACSA.2020.0110425

License
CC BY 4.0

Authors:

Bassam Zafar

King Abdulaziz University

Amal mahdi Alhassan

King Abdulaziz University

Ahmed Mueen

King Abdulaziz University

Content uploaded by Bassam Zafar

Content may be subject to copyright.

Content uploaded by Bassam Zafar

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

185 | P a g e

www.ijacsa.thesai.org

Predict Students’ Academic Performance based on

their Assessment Grades and Online Activity Data

Amal Alhassan1, Bassam Zafar2

Information Systems Department, FCIT

King Abdulaziz University, Jeddah, Saudi Arabia

Ahmed Mueen3

CIT Department, Faculty of Applied Studies

King Abdulaziz University, Jeddah, Saudi Arabia

Abstract—The ability to predict students’ academic

performance is critical for any educational institution that aims

to improve their students' learning process and achievement.

Although students’ performance prediction problem is studied

widely, it still represents a challenge and complex issue for

educational institutions due to the different features that affect

students learning process and achievement in courses. Moreover,

the utilization of web-based learning systems in education

provides opportunities to study how students learning and what

learning behavior leading them to success. The main objective of

this research was to investigate the impact of assessment grades

and online activity data in the Learning Management System

(LMS) on students’ academic performance. Based on one of the

commonly used data mining techniques for prediction, called

classification. Five classification algorithms were applied that

decision tree, random forest, sequential minimal optimization,

multilayer perceptron, and logistic regression. Experimental

results revealed that assessment grades are the most important

features affecting students' academic performance. Moreover,

prediction models that included assessment grades alone or in

combination with activity data perform better than models based

on activity data alone. Also, random forest algorithm performs

well for predicting student a cademic performance, followed by

decision tree.

Keywords—Predict student performance; learning management

system; data mining; educational data mining; classification model

I. INTRODUCTION

Educational data mining (EDM) is an emerging field in

data mining; aims to transform data accumulated in the

educational system into information help educational

institution to make informed decisions [1]. EDM uses data

mining tools and techniques in education field to analyze

student performance, predict their outcomes to help students at

risk of academic failure, and provide feedback for the faculties

and instructors to improve student outcomes and their learning

process [2]. Most of the previous works have proved the

effectiveness of data mining to address various educational

issues. Student performance prediction is one of the most

important issues studied by data mining techniques.

Moreover, the growing use of the internet in education

produced a new context called web-based learning or learning

management system (LMS). LMS is a web-based application

for managing online learning. LMS allows an educational

institution to manage students, monitor their participation and

tracking their progress via the system [3]. LMS can provide

accurate insight into student online activity and their learning

behavior because all data related to students’ actions and

events are monitored and recorded [4]. These data can be

useful to analyze students learning behavior and create

prediction models for their performance.

Predicting student performance is a crucial issue for each

educational institution aims to improve students' performance

and their learning process. Based on the prediction output, an

educational institution can support those identified as low

performing students. Although predicting students'

performance is widely studied, it still a challenge and complex

process because students' performance influenced by different

features such as demographic, social, academic, economic, and

other environmental features [5, 6]. Cognition of these features

contributes to control their impact on student performance.

The main objective of this research is to investigate the

impact of assessment and activity features from LMS on

students' academic performance. Based on one of the most

common data mining techniques for prediction, namely

classification. Five classification algorithms are applied include

decision tree (J48), random forest (RF), sequential minimal

optimization (SMO), multilayer perceptron (MLP), and logistic

regression (Logistic) for predicting students' performance.

The rest of this paper is structured in six sections. In

Section 2, a review of the related work is presented. In

Section 3, concepts and definitions related to this research are

introduced. Section 4 explains the methodology followed to

predict students' academic performance and identify the

important features that affect their performance. Section 5

discusses the experimental results with previous works.

Section 6 presents the conclusion and limitations of this study.

The insights about future work are provided in Section 7.

II. LITERATURE REVIEW

In recent years, many researchers studied features extracted

from Learning Management System (LMS) and whether can be

used as predictors for students’ academic performance. As in

[7] researcher investigated the important behavioral indicators

from LMS to predict student outcomes in online courses. The

researcher considered indicators that reflect regular study as

important features can be used to predict student outcomes.

Other researchers investigated the impact of students’

participation in an online discussion forum on their academic

performance [8, 9]. In [8] the authors used qualitative,

quantitative, and social network forum indicators to predict

student performance. While in [9] students' performance was

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

186 | P a g e

www.ijacsa.thesai.org

predicted based on participation in the discussion forum and

their academic records (e.g. assignments, quizzes, and exams).

Moreover, the impact of student online activity on their

academic performance have been studied in different forms. In

[10] researcher looked into four features of student activity on

Moodle that are the number of viewed files, exchanged

messages, completed quizzes, and content created by the

student. While in [12] performance of 22 students was

predicted based on their academic records and time spent by a

student on Moodle. However, in [11] researchers considered

online assessment data as indicators for student activity. They

examined activity data from LMS in the form of assessments

and exams to improve student engagement in a blended

learning. Another study has predicted the performance of

students based on enrollment data and activity on LMS [6].

They considered the heterogeneity of different students' sub-

groups to predict their performance based on important

enrollment data (e.g. gender and attendance type) and the level

of their online activity.

Other studies looked into the different feature sets to

predict student academic performance rather than all features in

the dataset. As in [13] authors investigated the influence of

different feature sets such as course features, student features,

behavioral features, and past performance in the course on

students’ performance. Also, in [14,15] they examined the

impact of different feature sets such as demographic, academic,

behavioral, and extra features related to parents’ participation

in the learning process and student absence days. Furthermore,

other researchers proposed the use of sub-groups (or sub-

datasets) to construct effective prediction models, as in [6] they

divided students' dataset into sub-datasets using enrollment and

activity data to predict their academic performance. Moreover,

sub-datasets is used in [16] to predict student dropout at

academic institutions using enrollment data, first-term, and

second-term data.

Many works employed feature selection algorithms to

create an effective classification model by excluding irrelevant

and redundant features from the dataset [9, 3, 17, 18]. Feature

selection algorithms can be divided into two basic groups are

filter and wrapper methods. Different feature selection

algorithms have been applied and compared in past works, as

in [19] comparative study conducted using filter-based

methods to evaluate the performance of the classification

algorithm before and after feature selection application.

Moreover, in [20] the performance of different filter-based

methods was compared for predicting students’ performance in

the final exam. Also, in [21] researchers evaluated and

compared the performance of different filter and wrapper

methods on the dataset that has been gathered for predicting

students' grades in the final examination.

Classification is one of the most common data mining

techniques for prediction. Classification is a supervised

learning process that predicts the class label of the target

variable for a given dataset. In the classification model, the

dataset is partitioned into two sets are training set for the

learning process and test set to implement the classification

process. Several classification algorithms have been used in

previous works to predict students’ academic performance

[22]. In this research, five classification algorithms are used

include Decision Tree (J48) [23], Random Forest (RF)

[24,2,12], Sequential Minimal Optimization (SMO) [13],

Multilayer Perceptron (MLP) [25] and Logistic Regression

(Logistic) [26] for predicting students’ academic performance.

These algorithms are used depending on their effectiveness in

previous works for predicting students’ performance.

Decision Tree (J48) is widely used for classification. J48

uses the C4.5 algorithm for constructing a decision tree. It

similar to a tree structure and consists of three types of nodes

are root, internal, and leaf nodes. This method partition the

training set into several subsets recursively using the best

features selected by merit criteria until it reaches termination,

the termination occurs when all features values belong to a

class label [27]. Random Forest (RF) constructs multiple

decision trees instead of a single decision tree. Trees are

constructed based on different samples and features selected

randomly from the dataset to form the forest. It gets the result

of the prediction from each created decision tree and selects the

best prediction result based on the voting process [28, 29].

Sequential Minimal Optimization (SMO) uses an

optimization technique for training support vector machine

(SVM) [6]. SMO performs the classification process by finding

the linear hyper-plane that can differentiate between classes

very well. It can also deal with non-linear classification

problems using “kernel” technique to convert low dimensional

data space to a higher dimension that allows classifying the

data [30]. Logistic regression (Logistic) is used in classification

problems for prediction based on probability concept. This

algorithm differs from linear regression by using “logistic

function” instead of linear function for mapping the values of

the prediction to probabilities. The probability of a dependent

variable that has a binary value is predicted using a set of

different independent values [30, 29]. Multilayer Perceptron

(MLP) is a multilayer network of interconnected neurons.

Neurons are represented in three layers include input, hidden,

and output layer. MLP uses “sigmoid function” in hidden and

output layers to predict probability [30]. This algorithm learns

in the training process by adjusting the weights iteratively

using a backpropagation function to attain sufficiently good

output [31].

Previous works have studied the impact of different

features on students' academic performance, but few works

have focused on the impact of assessment and activity data

together. Moreover, most of the previous works have used the

whole dataset to construct the prediction models. These

comprehensive models may unuseful to identify the effect of

different features on student performance. However, this work

contributes by investigating the impact of assessment and

activity data jointly and separately. Sub-datasets is used to

create prediction sub-models instead of the whole dataset; sub-

datasets have been used in [6]. This work differs from [6] by

studying other features related to students’ assessment data and

their online activity in the form of course access and mobile

course access measurements. Additionally, feature selection

using two different methods are applied to identify the most

important features that affect students' academic performance.

Finally, the performance of created prediction models is

evaluated and compared.

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

187 | P a g e

www.ijacsa.thesai.org

III. BACKGROUND

A. Imbalanced Class Distribution

The imbalanced class distribution occurs in the

classification model when the number of instances in one class

is significantly lower than the number of instances in the other

class. A class with a small number of instances called minority

class, while the class with a large number of instances known

as the majority class. The performance of machine learning

algorithms is best when the classes are almost balanced in the

dataset. Hence, the application of machine learning algorithms

on an unbalanced dataset leads to bias the result into the

majority class [30]. Several solutions have been proposed in

previous works to handle an imbalance in the dataset [32]. This

research looked into feature selection and sampling algorithms

for solving the imbalance problem.

B. Feature Selection

Feature selection considered one of the most important data

pre-processing steps, is used frequently in previous works to

identify the relevant features as a subset of the original features

in the dataset [3]. The subset produced by feature selection

allows classifiers to reach the optimal performance and can be

a good solution for imbalanced classes’ distribution [32, 33].

This research looked into two methods of feature selection are

filter and wrapper methods. Filter method uses a ranking

technique to rank the features; the highly ranked features are

applied to the classifier, while other features are excluded from

the dataset [3]. While wrapper method selects a subset of

features using an induction algorithm as a "black box" method

to search for a good subset of features [20]. The accuracy of

the inducted algorithm is estimated using the techniques of

accuracy estimation.

C. Sampling

Sampling (or Resampling) is a technique used to resample

the dataset artificially for balancing the number of instances in

the classes [34]. It is considered a data pre-processing step and

can be achieved by two ways are under-sampling the majority

class and over-sampling the minority class.

D. Environment

All algorithms used in this research are implemented using

the Waikato Environment for Knowledge Analysis (WEKA),

which has been developed by Waikato University in New

Zealand [31]. WEKA is a software tool based on java

language, provides several algorithms for machine learning and

data mining application.

E. Performance Evaluation Measures

This research used different evaluation measures that have

been used in the literature to evaluate and compare the

performance of classification models. These measures are (1),

(2), (3), (4), (5), (6), and (7):

 Accuracy [35]: is the common measure used to evaluate

the performance of classifiers, calculates the ratio of

correctly classified instances to the total number of

instances.

     (1)

 Precision [36]: is used to evaluate model exactness. It

represents the ratio of true positive instances from all

instances classified as positive by a classifier.

    (2)

 Recall [36]: is used to evaluate model completeness. It

represents the ratio of true positive instances classified

correctly by the classifier.

    (3)

 F-measure [36]: is used to get the average value of

precision and recall. It used commonly by researchers

to compare different classifiers performance.

     

 (4)

 Area under ROC curve (AUC) [35]: is used to evaluate

the capability of classification model to distinguish

between classes. Its value figures out the tradeoff

between true positive rate (TPR) and false positive rate

(FPR) for a given classification model.

     (5)

 Kappa value [6, 29]: is used to measure the accuracy of

the classifier compared to the expected random

classifier accuracy.

  

 (6)

 Root Mean Squared Error (RMSE) [17]: is used to

compare prediction errors by evaluating the difference

between the actual value and prediction value.

  



  (7)

IV. RESEARCH METHODOLOGY

To predict students’ academic performance, the

methodology suggested in this research follows five main

phases include data collection, data pre-processing, sub-

datasets generation, classification algorithms application, and

evaluation (see Fig. 1).

A. Dataset

Student data used in this research was obtained from the

Deanship of E-Learning and Distance Education at King

Abdulaziz University. Data include 241 records for

undergraduate students were gathered from six different

courses delivered from 2017 to 2019 in the Department of

Information Systems, Faculty of Computing and Information

Technology. Students' data include assessment grades and

activity data on the blackboard. All students' data were

extracted from the Learning Management System (LMS) into

several Excel files. One file for students' activities on the

blackboard and 26 files dedicated to the assessment grades

data.

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

188 | P a g e

www.ijacsa.thesai.org

Fig. 1. Method Suggested for Predicting Students’ Academic Performance.

During the data collection process, the file that contains

students' activity and their IDs is merged with the files that

include student IDs and corresponding assessment grades.

Then, data cleaned by deleting fields that have few entries and

zero values. After that, data transformation is performed. Data

transformation is a critical step to convert the data from the

format of the source file to the format of the destination file. In

this case, the created Excel file converted into (CSV) format

then to (ARFF) format to be compatible with the WEKA

software tool for data mining application.

The features extracted from the LMS include students'

assessment grades and measurements of their online activity on

the blackboard. These features are categorized into three major

groups that are assessment grades, course access

measurements, and mobile course access measurements. The

description of these features and their type are shown in

Table I.

In King Abdulaziz University (KAU), student performance

is assessed using the course grading system. In this system,

each course is given a sum of 100 marks distributed for the

midterm exams, final exam, and course-work (e.g. quizzes,

assignments, projects, and labs work). The final mark earned

by a student in the course is corresponding to a letter symbol

for the grade [37]. Hence, in this classification problem,

students are classified into low-performing students who

earned grades D+, D, and F, and high-performing students who

earned grades A+, A, B+, B, C+, and C in the course.

B. Data Pre-Processing

Data pre-processing is an essential phase before

classification algorithms application. In this research, pre-

processing phase includes two steps are feature selection and

sampling. After that, the results of feature selection and

sampling algorithms are compared to find better algorithm to

deal with the imbalance in the dataset and enhance the

accuracy of classification algorithms.

Feature selection is applied to select a subset of features

that have a greater impact on student academic performance.

Moreover, the subset produced by feature selection allows

classifiers to reach optimal performance and can be a helpful

solution for imbalanced class distribution in the dataset [32,

33]. Therefore, six different filter and wrapper methods are

applied on student dataset. Three filter methods are applied

include Correlation Attribute Evaluation, Information Gain

Attribute Evaluation, and CFS Subset Evaluation [19, 20].

Besides three popular machine-learning algorithms include

Decision Tree (J48), Naive Bayes (NB), and K-Nearest

Neighbor (IBK in WEKA) are used to implement wrapper

method [21]. The results of these six feature selection

algorithms show that assessment grades are the most important

features that affect student academic performance.

Correlation and Information Gain algorithms give the same

high ranking for six features that are assignments mark, final

exam, second midterm exam, lab mark, quizzes mark, and first

midterm exam. While CFS subset selects four features that are

assignments mark, quizzes mark, second midterm exam, and

final exam as highly influential features.

TABLE I. DATASET DESCRIPTION WITH FEATURES AND THEIR TYPE

category

Feature

Description

Type

Assessment data

Assign_Mark

Assignments mark

Numeric

Quiz_Mark

Quizzes mark

Numeric

MidTerm_1

First midterm exam

mark

Numeric

MidTerm_2

Second midterm exam

mark

Numeric

Lab_Mark

Lab work mark

Numeric

Final_Exam

Final exam mark

Numeric

Course access data

Crse_Access

Number of times course

access

Numeric

Crse_Item_Access

Number of times a

course item access

Numeric

Crse_Interaction

Number of course

interactions

Numeric

Crse_Item_Interaction

Number of a course

item interactions

Numeric

Crse_Item_ Mins

Time spent on course

item in minutes

Numeric

Content_Access

Number of times

content access

Numeric

Assessment_Access

Number of times

assessment access

Numeric

Mobile course access data

Mob_Crse_Access

Number of times a

course access via

mobile

Numeric

Mob_Crse_Item_Access

Number of times a

course item access via

mobile

Numeric

Mob_Crse_Interaction

Number of course

interactions via mobile

Numeric

Mob_Crse_Access_Mins

Time spent on a course

access via mobile in

minutes

Numeric

Mob_Content_Access

Number of times a

content access via

mobile

Numeric

Mob_Assessment_Access

Number of times an

assessment access via

mobile

Numeric

Student

dataset

Resampling

Feature

selection

Classification

algorithms

Evaluation

Generate sub-

datasets

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

189 | P a g e

www.ijacsa.thesai.org

The subsets produced by wrapper methods show that

Wrapper-J48 algorithm selects two features are assignments

mark and final exam. While Wrapper-NB subset includes four

features that are assignments mark, first midterm exam, final

exam, and assessment access. The Wrapper-IBK algorithm

determines only one feature is the assignment mark as the most

important feature.

For sampling, three algorithms are applied on students

dataset include random over-sampling of the minority class

(Resample), random under-sampling of majority class

(SpreadSubsample), and synthetic minority over-sampling

technique (SMOTE), which have been used in [30, 34].

1) Comparison and evaluation results: To compare

feature selection and sampling algorithms, these algorithms

are applied along with five classification algorithms include

J48, RF, SMO, MLP, and Logistic. The performance of these

algorithms is evaluated and compared using 10-folds cross-

validation and accuracy metric. Evaluation and comparison

results of feature selection and sampling algorithms are

presented in Table II and Fig. 2.

Results in Table II and Fig. 2 show that both feature

selection and sampling algorithms improve the performance of

classifiers. For feature selection algorithms, the subset

produced by Wrapper-J48 attains the highest accuracy value of

98.42 when classified also using J48 algorithm. While the CFS

subset obtains the second highest accuracy value of 97.38

when classified using MLP.

Also, Wrapper-IBK achieves the highest accuracy value of

97.10 for RF algorithm, CFS subset achieves the highest

accuracy value of 95.27 for Logistic algorithm, Wrapper-NB

achieves the highest accuracy value of 93.53 for SMO.

TABLE II. ACCURACY RESULTS FOR FEATURE SELECTION AND

SAMPLING ALGORITHMS

Algorith

Produced

dataset

J48

Logisti

Original dataset

95.5

96.3

93.3

95.3

93.16

Feature selection

Correlation

96.0

96.7

93.4

96.8

95.02

InfoGain

96.0

96.7

93.4

96.8

95.02

CFS Subset

96.0

96.4

93.2

97.3

95.27

Wrapper-J48

98.4

97.0

93.2

97.0

94.19

Wrapper-NB

97.2

96.6

93.5

95.8

94.65

Wrapper-IBK

95.3

97.1

92.8

95.4

92.83

Sampling

Resample



















SpreadSubsampl



















SMOTE



















Fig. 2. Accuracy Rsults for Feature Selection and Sampling Algorithms.

Thus, there is no one feature selection algorithm obtains

better accuracy results for all classifiers. However, it observed

that the subset produced by the Wrapper-J48 performs better

than other subsets, by achieving accuracy results above 97.00

when classified using J48, RF, and MLP.

Results of sampling algorithm in Table II and Fig. 2 show

that Resample algorithm obtains the highest accuracy values of

98.75, 99.17, 98.08, and 97.04 for J48, RF, MLP, and Logistic

respectively. While SMO obtains its highest accuracy value of

96.17 with SMOTE. Furthermore, SpreadSubsample algorithm

obtains the worst results of accuracy, even worse than the

original dataset.

For both feature selection and sampling algorithms,

Resample algorithm achieves the highest accuracy results with

all classifiers, except SMO obtains the best accuracy with

SMOTE. However, the use of SMO with Resample algorithm

does not result in poor performance, but its performance

considered better with SMOTE.

Therefore, Resample algorithm is used to balance the

dataset and create more accurate prediction models for

students’ performance. Hence, SMO is excluded and used the

remaining four classification algorithms that are J48, RF, MLP,

and Logistic for creating prediction models of students’

academic performance.

C. Generate Sub-Datasets

To investigate the impact of student assessment grades and

activity data jointly and separately. Students’ dataset is

partitioned into six sub-datasets based on the three major

groups of features (in Table I). The generated sub-datasets are

described in Table III.

D. Predicting Students’ Academic Performance

After resampling and generate sub-datasets, sub-models are

constructed in each sub-dataset displayed in Table III.

Additionally, the base model is constructed using "All

features" to evaluate the performance of the sub-models

compared to the base model. These prediction models are

created using four classification algorithms include J48, RF,

MLP, and Logistic.

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

190 | P a g e

www.ijacsa.thesai.org

TABLE III. STUDENTS’ DATASET AND SUB-DATASETS DESCRIPTION

Dataset

Sub-dataset

Description

All

features

Full students’ dataset include assessment

grades and blackboard activity data.

Assessment only

Assessment grades without blackboard

activity data.

Assessment +

Course access

Assessment grades with course access

measurements.

Assessment +

Mobile course

access

Assessment grades with mobile course

access measurements.

Course access +

Mobile course

access

All blackboard activity data without

assessment grades.

Course access only

Course access measurements only.

Mobile course

access only

Mobile course access measurements

only.

The performance of base models and sub-models is

evaluated and compared using different evaluation measures,

which have been used in [6]. In this research, models are

trained and tested using 10-folds cross-validation method [3].

In this method, the dataset is divided into ten equal subsets for

training and testing. Each subset is run ten times, in each time

90% of instances are trained while 10% of instances are used

for testing the model, tested instances in each iteration are

different. Then the average of results is computed as the final

result.

E. Results

To evaluate and compare the performance of prediction

models, first, precision, recall, F-measure, Kappa value, and

area under the ROC curve (AUC) are measured. Second, as a

complement for previous measures, accuracy and the root

mean squared error (RMSE) are measured [6]. All these

evaluation measures are computed using 10-folds cross-

validation method for all classifiers, where the better results are

boldfaced.

1) Evaluate and compare the performance of sub-models

to the base model based on Precision, Recall, F-measure,

Kappa value, and Area under ROC curve (AUC): Table IV

shows the results of Precision, Recall, F-measure, Kappa

value, and Area under ROC curve (AUC) achieved by J48,

RF, MLP and Logistic classifiers for the base model and sub-

models. Results in Table IV show that created sub-models

using "assessment only", "assessment + course access",

"assessment + mobile course access" features and base model

have the same high performance using random forest and J48

classifiers in terms of evaluation measures. For these sub-

models and base model, random forest achieves the highest

result of 0.99, 0.98, and 1 in terms of f-measure, kappa, and

AUC respectively. While J48 achieves the second highest

result of 0.99, 0.98, and 0.99 for f-measure, kappa, and AUC,

respectively.

TABLE IV. RESULTS OF FIVE EVALUATION METRICS (PRECISION,

RECALL, F-MEASURE, KAPPA, AND AUC) FOR ALL DATASETS WITH THE FOUR

CLASSIFICATION ALGORITHMS

Dataset

Precision

Recall

F-

measure

Kappa

AUC

J48

All features





0.99

Assessment

only





0.99

Assessment +

course access





0.99

Assessment +

mobile course

access





0.99

Course access +

mobile course

access







0.94

Course access

only







0.93

Mobile course

access only









0.88

All features





Assessment

only





Assessment +

course access





Assessment +

mobile course

access





Course access +

mobile course

access







Course access

only





0.99

Mobile course

access only









0.95

MLP

All features





0.99

Assessment only







0.99

Assessment +

course access





0.98

Assessment +

mobile course

access







0.99

Course access +

mobile course

access







0.88

Course access

only









0.87

Mobile course

access only









0.81

Logistic

All features





0.97

Assessment only







0.98

Assessment +

course access





0.98

Assessment +

mobile course

access





0.97

Course access +

mobile course

access







0.80

Course access

only







0.75

Mobile course

access only









0.69

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

191 | P a g e

www.ijacsa.thesai.org

Moreover, the sub-model that generated based on

"assessment + mobile course access" features outperforms to

its base model and other sub-models when using MLP and

Logistic algorithms. This sub-model with MLP algorithm

achieves results higher than other models in terms of precision

and kappa value of 0.99 and 0.97 respectively. Also, this sub-

model with logistic algorithm obtains results higher than other

models in terms of precision, recall, f-measure and kappa value

of 0.98, 0.98, 0.98 and 0.95, respectively.

However, the performance of created sub-models using

activity data only (such as "course access only", "mobile

course access only", and "course access + mobile course

access") deteriorate comparing to performance of the base

model. Hence, base model performs better than sub-models

created based on activity data only by achieving values above

0.97, 0.94, and 0.97 for f-measure, kappa, AUC, respectively

for all classifiers.

Among sub-models created based on activity data only,

sub-model that represent the "course access only" features with

random forest classifier achieves result better than other

activity sub-models. This sub-model obtains values of 0.97,

0.94, 0.99 in terms of f-measure, kappa, AUC, respectively.

Followed by the sub-model created using all activity data

("course access + mobile course access") with the random

forest, it obtains results better than other sub-models of activity

that have been created using J48, MLP, and Logistic

algorithms. This sub-model obtains values of 0.96, 0.93, and

1.00 for f-measure, kappa, AUC respectively.

2) Evaluate and compare the performance of sub-models

to the base model based on accuracy and root mean squared

error (RMSE): Table V shows the results of accuracy and root

mean squared error (RMSE) achieved by J48, RF, MLP and

Logistic classifiers for the base model and sub-models.

Results in Table V show that base model that represents "all

features" with random forest superior to all other classifiers

and models by achieving the highest accuracy value of 99.17.

In addition, random forest ensures sub-models produced using

"assessment only", "assessment + course access", and

"assessment + mobile course access" features obtain high

accuracy value close to 99.00 and low RMSE value of 0.06.

However, the two sub-models based on "assessment only"

and "assessment + mobile course access" features with the J48

classifier outperform their base model and other sub-models by

achieving the lowest root mean squared error (RMSE) value of

0.04 and accuracy value of 98.92. Moreover, the sub-model

that generated using "assessment + mobile course access"

features with MLP classifier outperform to its base model and

other sub-models by achieving higher accuracy of 98.38 and

lower root mean squared error (RMSE) value of 0.07. Also, the

sub-model based on "assessment + mobile course access"

features using the Logistic classifier outperforms its base

model by achieving a higher accuracy value of 97.54.

TABLE V. RESULTS OF ACCURACY AND ROOT MEAN SQUARED ERROR

(RMSE) FOR ALL DATASETS WITH THE FOUR CLASSIFICATION ALGORITHMS

Dataset

Accuracy

RMSE

J48

All features

98.75

0.05

Assessment only

98.92

0.04

Assessment + course access

98.75

0.05

Assessment + mobile course access

98.92

0.04

Course access + mobile course access

93.13

0.23

Course access only

91.42

0.27

Mobile course access only

81.13

0.35

All features

99.17

0.07

Assessment only

99.00

0.06

Assessment + course access

99.04

0.06

Assessment + mobile course access

99.00

0.06

Course access + mobile course access

96.33

0.17

Course access only

96.75

0.18

Mobile course access only

84.58

0.30

MLP

All features

98.08

0.08

Assessment only

97.25

0.12

Assessment + course access

97.71

0.11

Assessment + mobile course access

98.38

0.07

Course access + mobile course access

86.21

0.33

Course access only

84.63

0.34

Mobile course access only

71.00

0.42

Logistic

All features

97.04

0.12

Assessment only

90.92

0.20

Assessment + course access

97.08

0.16

Assessment + mobile course access

97.54

0.13

Course access + mobile course access

75.00

0.43

Course access only

71.21

0.45

Mobile course access only

62.79

0.47

V. DISCUSSION

This research was to investigate the impact of assessment

and activity data on students’ academic performance.

Therefore, students' dataset was analyzed using different

feature selection algorithms to identify important features that

affect their academic performance. Moreover, the base model

and sub-models based on assessment and activity features

jointly and separately were constructed. Also, the performance

of used classification algorithms was compared to find the best

algorithm for classifying student performance.

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

192 | P a g e

www.ijacsa.thesai.org

Feature selection results revealed the important features

that affect student performance are assessment grades,

especially assignments mark and final exam. Hence, this

research corroborates with the finding reached by [11, 13] they

concluded that student performance significantly influenced by

assessment data. In [13] they compared four feature sets

include student and course characteristics, LMS features, and

past performance including assessment grades. Their results

demonstrated that student characteristics and the assessment

grades had a larger impact on student performance than other

features sets. While authors in [11] found a strong correlation

between grades of assessments and examinations with students'

final grades.

For the base and sub models, the experimental results

showed that base model generated from "All features" dataset

and classified using random forest algorithm outperforms other

prediction models, by obtaining the best results for all

performance evaluation measures, especially best accuracy

value of 99.17. Moreover, sub-models that include the

assessment grades separately or jointly with activity data obtain

results better than sub-models rely on the activity data alone for

prediction. Additionally, the findings reveal that sub-model

generated based on "assessment + mobile course access"

features perform well to predict student academic performance.

Researchers in [6] reported that the performance of created

sub-models superior to the performance of the base model. In

addition to the eﬀectiveness of use students' sub-datasets to

predict their academic performance. The results supports this

fact to some extent, in terms of the usefulness of investigating

the sub-datasets to predict students’ academic performance and

assess the impact of different features on their success in

courses. However, the results revealed that base model based

on all features and sub-models that included assessment data

separately or jointly with activity features achieved high

performance results. That indicates both base model and sub-

model perform well to predict students’ academic performance.

Regarding the impact of assessment and activity data,

results showed prediction models that include assessment

grades separately or jointly with activity data have superior

prediction results compared to models based on activity data

alone. This finding indicates assessment grades affect students'

performance significantly, while activity data alone has less

impact. Hence, this research corroborates with the finding have

been reached by researchers in [11], they revealed a strong

relationship between students’ online activities in the form of

assessments and exams with their final grades in the course.

Their finding indicates the importance of assessment data in

predicting students’ achievement in the course. In addition to

the usefulness of investigating students’ online activity to

assess its impact on academic achievement.

Furthermore, researchers reached a similar conclusion in

[13]; they found past performance in the course (including

assessment grades) and student characteristics have a greater

impact on student performance, while LMS features had a

lower impact. The experimental result support this fact, activity

data alone have a lower impact on student performance

compared to the assessment grades. However, assessment and

activity data together enhance the accuracy of the prediction

model. Hence, this finding demonstrates the importance of

including the assessment grades with activity data for the

prediction model of students’ academic performance.

However, researchers in [11] concentrated on online

assessments alone as indicators for student activity. Others in

[12] investigated only one feature of online activity data which

is time spent by a student on Moodle, while Moodle (or LMS)

provides more features that can be investigated. Moreover, the

dataset used in [12] included 22 instances only; which can be

considered a very small number of instances compared to

datasets used in previous works. However, this work studied

more features of student online activity than those examined in

[11, 12], using dataset includes 241 instances. Also, this

experiment studied the impact of students’ online activity in

other forms like course access measurements and mobile

course access measurements as well as the assessment grades.

For classification algorithms, the experimental results

revealed the random forest algorithm perform better compared

to other classification algorithms. This finding is in accordance

with findings reported by [12, 24, 2]; they also found random

forest algorithm outperform other classification algorithms for

student performance prediction using different features such as

personal, academic, and activity data. Moreover, in this

experiment, random forest algorithm ensures the highest

performance results for base and sub models. Followed by

decision tree algorithm by obtaining the second highest

performance results. As random forest does not provide

interpretable results, decision tree can be considered more

useful.

VI. CONCLUSION

This research was to investigate the impact of assessment

and activity data on students’ academic performance. For this

purpose, different feature selection algorithms were used to

identify the important features that affect students' academic

performance. Also, prediction models were constructed based

on assessment and activity data jointly and separately using

four classification algorithms that are decision tree, random

forest, multilayer perceptron, and logistic regression.

Results of feature selection revealed that the most

important features that affect student academic performance

are assessment data, especially assignments mark and final

exam. For prediction models, results demonstrated that both

base model and sub-model perform well for predicting

students' academic performance. Random forest outperformed

other classifiers to predict students’ performance by achieving

the highest accuracy degrees for both base model and sub

model, followed by decision tree. As the random forest does

not provide understandable output, the decision tree can be

considered more useful.

Furthermore, prediction models that included assessment

data separately or jointly with activity data performed better

than models based on activity data alone. This indicates that

assessment data affect student performance significantly, while

activity data have a lower impact. However, assessment and

activity data together work better to enhance the accuracy of

the prediction model. It is important to include assessment data

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

193 | P a g e

www.ijacsa.thesai.org

with activity data for the prediction model of students’

academic performance.

However, certain limitations are observed in this research.

The experiment was conducted using data of students for a

specific department at faculty. Dataset had only 241 records

and 19 features. These results might be different for another

dataset with more records and other different features. Also,

there might be a possibility of achieving more accurate results

by other data mining algorithms.

VII. FUTURE WORK

In future work, this work can be further extended to predict

students' academic performance using data from other faculties

and different departments to generalize the results. Also,

further work may visualize and interpret decision tree result to

obtain understandable results help to support low-performing

students. Moreover, the same features can be used with other

data mining techniques such as regression to predict student

final grade in the course, association rule to detect the

relationships between students’ final grade with their

assessment and activity data.

REFERENCES

[1] A. AL-Malaise, A. Malibari and M. Alkhozae, “Students performance

prediction system using multi agent data mining technique”,

International Journal of Data Mining & Knowledge Management

Process, vol. 4, no. 5, pp. 01-20, 2014.

[2] S. Hussain, N. Abdulaziz Dahan, F. Ba-Alwi and N. Ribata,

“Educational data mining and analysis of students’ academic

performance using WEKA”, Indonesian Journal of Electrical

Engineering and Computer Science, vol. 9, no. 2, p. 447, 2018.

[3] E. Amrieh, T. Hamtini and I. Aljarah, “Mining educational data to

predict student’s academic performance using ensemble methods”,

International Journal of Database Theory and Application, vol. 9, no. 8,

pp. 119-136, 2016.

[4] P. Shayan and M. Zaanen, “Predicting student performance from their

behavior in learning management systems”, International Journal of

Information and Education Technology, vol. 9, no. 5, pp. 337-341, 2019.

[5] N. A. Yassein, R. G. M. Helali, and S. B. Mohomad, “Predicting student

academic performance in KSA using data mining techniques,” Journal

of Information Technology & Software Engineering, vol. 07, no. 05,

2017.

[6] S. Helal, J. Li, L. Liu, E. Ebrahimie, S. Dawson, D. Murray and Q.

Long, “Predicting academic performance by considering student

heterogeneity”, Knowledge-Based Systems, vol. 161, pp. 134-146, 2018.

[7] J. You, “Identifying significant indicators using LMS data to predict

course achievement in online learning”, The Internet and Higher

Education, vol. 29, pp. 23-30, 2016.

[8] C. Romero, M. López, J. Luna and S. Ventura, “Predicting students'

final performance from participation in on-line discussion forums”,

Computers & Education, vol. 68, pp. 458-472, 2013.

[9] A. Mueen, B. Zafar and U. Manzoor, “Modeling and predicting

students’ academic performance using data mining techniques”,

International Journal of Modern Education and Computer Science, vol.

8, no. 11, pp. 36-42, 2016.

[10] N. Z. Zacharis, “Classification and regression trees (CART) for

predictive modeling in blended learning” , International Journal of

Intelligent Systems and Applications, vol. 10, no. 3, pp. 1–9, 2018.

[11] M. Ayub, H. Toba, M. Wijanto and S. Yong, “Modelling online

assessment in management subjects through educational data mining”,

in 2017 International Conference on Data and Software Engineering

(ICoDSE), 2017.

[12] R. Hasan, S. Palaniappan, A. Abdul Raziff, S. Mahmood and K. Sarker,

“Student academic performance prediction by using decision tree

algorithm”, in 2018 4th International Conference on Computer and

Information Sciences (ICCOINS), Kuala Lumpur, Malaysia, 2018.

[13] R. Conijn, A. Kleingeld, U. Matzat, C. Snijders and M. Zaanen,

“Influence of course characteristics, student characteristics, and behavior

in learning management systems on student performance”, in 30th

Conference on Neural Information Processing Systems (NIPS 2016),

Barcelona, Spain, 2016.

[14] M. H. Rahman and M. R. Islam, “Predict Students Academic

Performance and Evaluate the Impact of Different Attributes on the

Performance Using Data Mining Techniques”, 2017 2nd International

Conference on Electrical & Electronic Engineering (ICEEE), 2017.

[15] B. Francis and S. Babu, “Predicting academic performance of students

using a hybrid data mining approach”, Journal of Medical Systems, vol.

43, no. 6, 2019.

[16] G. Bilquise, S. Abdallah and T. Kobbaey, “Predicting student retention

among a homogeneous population using data mining”, in Proceedings of

the International Conference on Advanced Intelligent Systems and

Informatics 2019. AISI 2019, 2020, pp. 35-46.

[17] S. Hussain, N. Abdulaziz Dahan, F. Ba-Alwi and N. Ribata,

“Educational data mining and analysis of students’ academic

performance using WEKA”, Indonesian Journal of Electrical

Engineering and Computer Science, vol. 9, no. 2, p. 447, 2018.

[18] P. Kumari, P. Jain and R. Pamula, “An efﬁcient use of ensemble

methods to predict students academic performanc”, in 2018 4th

International Conference on Recent Advances in Information

Technology (RAIT), 2018, pp. 1-6.

[19] S. Gnanambal, M. Thangaraj, V. Meenatchi and V. Gayathri,

“Classification algorithms with attribute selection: an evaluation study

using WEKA”, Int. J. Advanced Networking and Applications, vol. 09,

no. 06, pp. 3640-3644, 2018.

[20] C. Anuradha and T. Velmurugan, “Performance evaluation of feature

selection algorithms in educational data mining”, International Journal

of Data Mining Techniques and Applications, vol. 5, no. 2, pp. 131-139,

2016.

[21] A. Acharya and D. Sinha, “Application of feature selection methods in

educational data mining”, International Journal of Computer

Applications, vol. 103, no. 2, pp. 34-38, 2014.

[22] A. Kumar, R. Selvam and K. Kumar, “Review on prediction algorithms

in educational data mining”, International Journal of Pure and Applied

Mathematics, vol. 118, no. 8, pp. 531-537, 2018.

[23] M. Al-Saleem, N. Al-Kathiry, S. Al-Osimi, and G. Badr, “Mining

educational data to predict students’ academic performance,” Machine

Learning and Data Mining in Pattern Recognition Lecture Notes in

Computer Science, pp. 403–414, 2015.

[24] K. T. S. Kasthuriarachchi, S. R. Liyanage, and C. M. Bhatt, “A data

mining approach to identify the factors affecting the academic success of

tertiary students in Sri Lanka”, Lecture Notes on Data Engineering and

Communications Technologies Software Data Engineering for Network

eLearning Environments, pp. 179–197, 2018.

[25] F. Widyahastuti and V. Tjhin, “Predicting students performance in final

examination using linear regression and multilayer perceptron”, in 2017

10th International Conference on Human System Interactions (HSI),

2017, pp. 188--192.

[26] M. Bucos and B. Drăgulescu, “Predicting student success using data

generated in traditional educational environments”, TEM Journal, vol. 7,

no. 3, pp. 617-625, 2018.

[27] N. Bhargav, G. Sharma, R. Bhargava and M. Mathuria, “Decision tree

analysis on J48 algorithm for data mining”, International Journal of

Advanced Research in Computer Science and Software Engineering,

vol. 3, no. 6, 2013.

[28] M. Khaldy and C. Kambhampati, “Resampling imbalanced class and the

effectiveness of feature selection methods for heart failure dataset”,

International Robotics & Automation Journal, vol. 4, no. 1, pp. 1-10,

2018.

[29] C. Zabriskie, J. Yang, S. DeVore and J. Stewart, “Using machine

learning to predict physics course outcomes”, Physical Review Physics

Education Research, vol. 15, no. 2, 2019.

[30] A. Verma, “Evaluation of classification algorithms with solutions to

class imbalance problem on bank marketing dataset using WEKA”,

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 4, 2020

194 | P a g e

www.ijacsa.thesai.org

International Research Journal of Engineering and Technology (IRJET),

vol. 06, no. 03, pp. 54-60, 2019.

[31] R. Arora and S. Suman, “Comparative analysis of classification

algorithms on different datasets using WEKA”, International Journal of

Computer Applications, vol. 54, no. 13, pp. 21-25, 2012.

[32] R. Longadg, S. S. Dongre and L. Malik, “Class imbalance problem in

data mining: review”, International Journal of Computer Science and

Network (IJCSN), vol. 2, no. 1, pp. 1305.1707, 2013.

[33] M. Koutina and K. Kermanidis, “Predicting postgraduate students’

performance using machine learning techniques”, Artificial intelligence

applications and innovations, pp. 159-168, 2011.

[34] P. Yıldırım, “Pattern classification with imbalanced and multiclass data

for the prediction of albendazole adverse event outcomes”, Procedia

Computer Science, vol. 83, pp. 1013-1018, 2016.

[35] A. Tharwat, “Classification assessment methods”, Applied Computing

and Informatics, 2018.

[36] H. M and S. M.N, “A review on evaluation metrics for data

classification evaluations”, International Journal of Data Mining &

Knowledge Management Process, vol. 5, no. 2, pp. 01-11, 2015.

[37] Undergraduate catalog. 2018, pp.11-12. [Online]. Accessed: Mar 26,

2020. Available  https://fcit.kau.edu.sa/aims/templates/catalog18-19-

online1.pdf.

Using Markov Chains and Data Mining Techniques to Predict Students’ Academic Performance

Article

Sep 2023

Saed Mallak

In this study, the academic performance of students from the E-Commerce department at Palestine Technical University – Kadoorie is predicted using a Markov chains model and educational data mining. Based on the complete data regarding the achievements of the students from the 2016 cohort of students obtained from the university’s admissions and registration department, a Markov chain is built, in which the states are divided according to the semester average of the student, and the ratio of students in each state is calculated in the long run. The results obtained are compared with the data from the 2015 cohort, which demonstrates the efficiency of the Markov chains model. For educational data mining, the classification technique is applied, and the decision tree algorithm is used to predict the academic performance of the students, generalizing results with an accuracy of 41.67%.

Predicting Student Performance Through Classroom Concentration: A Review on Machine Learning and Computer Vision Approaches

Conference Paper

Full-text available

Dec 2023

Umar Sunusi Umar

Student performance prediction is a complex problem in which a computer predicts student's future performance while they engage in their studies. To enable effective educational interventions during a course, accurate early forecasts of a student's performance is essential. The review proposed a silent method and the latest data for capturing student behaviour is classroom video monitoring. For student performance prediction dynamics to be understood and subsequently improved, it is essential to understand students' attention spans and what kinds of behaviours may advise the absence of attention. The ability to detect student attention is one of the features of computer vision systems used to monitor classrooms. This review included the pertinent Educational Data Mining literature on identifying dropouts and students at risk from classroom data. For example, during class hours students are to concentrate, focus and give all their attention in class. A machine learning approach can be used to rate each student's concentration with the help of computer vision algorithms since most studies use student data from colleges such as quiz marks and exam scores. The assessment outcomes revealed that a variety of Machine Learning (ML) and computer vision techniques are developed to understand and address the main problems, including guessing about to-fail learners and student dropouts. It has been validated that machine learning systems are important for detecting at-risk students and dropout frequencies which help for improving student success. To find the basics that can have an impact on how well students perform academically machine learning algorithms can be used to study a student’s daily routine such as their study habits, social interactions, and extracurricular activities. Since it applies the most relevant advanced methods for each specific task, this student concentration score using computer vision offers the ideal answer for monitoring classrooms. The integration of facial recognition algorithms and machine learning provides an assuring approach for monitoring classroom engagement and enhancing educational interventions.

Using students’ cognitive, affective, and demographic characteristics to predict their understanding of computational thinking concepts: a machine learning-based approach

Article

Mar 2024

Student performance prediction employing k-Nearest Neighbor Classification model and meta-heuristic algorithms

Article

Full-text available

Jun 2024

Xiaohuan Song

The precision of predicting student performance in Mathematics is pivotal for educational advancement, relying heavily on advanced machine learning (ML) methodologies. This predictive approach involves analyzing comprehensive datasets, emphasizing academic records, demographics, and various educational metrics, with a particular focus on Mathematics. Techniques like classification, regression analysis, decision trees, and neural networks yield highly accurate projections, aiding in timely interventions crucial for supporting students navigating mathematical complexities. These algorithms optimize resource allocation within educational institutions, identifying and aiding students requiring extra assistance in their mathematical pursuits. This study pioneers the enhancement of predictive capabilities in Mathematics through the innovative integration of the K-Nearest Neighbor Classification (KNNC) model with 2 novel optimization techniques: the Honey Badger Algorithm (HBA) and the Arithmetic Optimization Algorithm (AOA). By harnessing these cutting-edge ML and bio-inspired algorithms, the research is dedicated to pushing the boundaries of precision and reliability in forecasting, with a specific focus on elevating educational outcomes within the domain of Mathematics. The outcomes obtained for G1 and G3 reveal that the KNHB model exhibited outstanding performance in predicting and categorizing G3. It achieved remarkable Accuracy and Precision of 0.921 and 0.92, respectively. Moreover, the KNHB proved to be the most precise predictor for G1 value prediction, with Accuracy and Precision scores of 0.899% and 0.9%, respectively, in the prediction task.

Impact of dimensionality reduction techniques on student performance prediction using machine learning

Article

Full-text available

Oct 2023

This study addresses the crucial issue of predicting student performance in educational data mining (EDM) by proposing an Adaptive Dimensionality Reduction Algorithm (ADRA). ADRA efficiently reduces the dimensionality of student data, encompassing various academic, demographic, behavioral, social, and health-related features. It achieves this by iteratively selecting the most relevant features based on a combined normalized mean rank of five feature ranking methods. This reduction in dimensionality enhances the performance of predictive models and provides valuable insights into the key factors influencing student performance. The study evaluates ADRA using four different student performance datasets and six machine learning algorithms, comparing it to three existing dimensionality reduction methods. The results show that ADRA achieves an average dimensionality reduction factor of 6.2 while maintaing comprable accuracy with other mehtods.

An Adaptive Feature Selection Algorithm for Student Performance Prediction

Article

Full-text available

Jan 2024

Educational Data Mining (EDM) is used to ameliorate the teaching and learning process by analyzing and classifying data that can be applied to predict the students’ academic performance, and students’ dropout rate, as well as instructors’ performance. The prediction of student performance is complicated by the vast and diverse range of variables from academic records to behavioral and health metrics. In this paper, we have introduced a new Adaptive Feature Selection Algorithm (AFSA) by amalgamating an ensemble approach for initial feature ranking with normalized mean ranking from five distinct methods to enhance robustness. The proposed method iteratively selects the best features by adjusting its threshold based on each feature’s rank to ensure significant contributions to model accuracy and also effectively reduces dataset complexity. We have tested the performance of the proposed feature selection algorithm using five machine learning classifiers: Logistic Regression (LR), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Naïve Bayes (NB) classifier, and Decision Tree (DT) classifier on four student performance datasets. The experimental results highlight the proposed method significantly decreases feature count by an average feature reduction factor of 5.7, significantly streamlining datasets while maintaining competitive cross-validation accuracy, marking it as a valuable tool in the field of educational data analytics.

Subject Stream Prediction: A Machine learning Approach to Select the Suitable Subject Stream for Senior Secondary Students in Sri Lanka

Article

Full-text available

Dec 2022

Education is an important factor that measures the nation's wealth and education directly affects the country's future development. All children at all levels in Sri Lanka are entitled to free education up to the university level. During that period, the students have to face two essential examinations to complete their senior secondary education. According to the Sri Lankan education schema, students happen to select one subject stream to start their senior secondary education key stage 2. Most of the students select that subject stream without thinking deeper, thereby that decision will cause them to create a good future as well as not create. A prediction system which is called 'Subject Stream Prediction' predicts the appropriate subject stream to begin senior secondary education based on students' previous examination results as well as their skills, and preferred working area for their target career. If some student does not satisfy with one predicted answer, the model proposes ten appropriate subject streams with relevant jobs and educational and technical qualifications that need for those careers based on the above features. I have done a performance analysis between four machine learning algorithms to select the best-suited algorithm to predict the suitable subject stream by accessing their accuracy levels. That analysis demonstrates that the 'Random Forest Classifier' algorithm gives high accuracy (72).

Intelligent System for Student Performance Prediction Using Machine Learning

Article

Full-text available

May 2024

Accurately predicting student performance remains a significant challenge in the educational sector. Identifying students who need additional support early can significantly impact their academic outcomes. This study aims to develop an intelligent solution for predicting student performance using supervised machine learning algorithms. This proposed focus on addressing the limitations of existing prediction models and enhancing prediction accuracy. In this work employed three supervised machine learning algorithms: Random Forest, Extra Trees, and K-Nearest Neighbors. The steps of research methodology contained (data collection, preprocessing, feature identification, model construction, and evaluation). This paper utilized a dataset comprising 24,000 training instances and 6,000 testing instances, applying various preprocessing techniques for data optimization. The Extra Trees algorithm achieved the highest accuracy (98.15%), followed by Random Forest (94.03%) and K-Nearest Neighbors (91.65%). All algorithms demonstrated high precision and recall. Notably, K-Nearest Neighbors exhibited exceptional computational efficiency with a training time of 0.00 seconds. This study proposed an efficient model for prediction student performance. The high accuracy and efficiency of the proposed system highlight its potential for application in educational data mining. The findings of this proposed to improving student success rates in educational institutions by enabling timely and appropriate interventions.

Privacy Concerns of Student Data Shared with Instructors in an Online Learning Management System

Conference Paper

May 2024

Effective deep learning based grade prediction system using gated recurrent unit (GRU) with feature optimization using analysis of variance (ANOVA)

Article

Jan 2024

Using machine learning to predict physics course outcomes

Article

Full-text available

Aug 2019

The use of machine learning and data mining techniques across many disciplines has exploded in recent years with the field of educational data mining growing significantly in the past 15 years. In this study, random forest and logistic regression models were used to construct early warning models of student success in introductory calculus-based mechanics (Physics 1) and electricity and magnetism (Physics 2) courses at a large eastern land-grant university. By combining in-class variables such as homework grades with institutional variables such as cumulative GPA, we can predict if a student will receive less than a “B” in the course with 73% accuracy in Physics 1 and 81% accuracy in Physics 2 with only data available in the first week of class using logistic regression models. The institutional variables were critical for high accuracy in the first four weeks of the semester. In-class variables became more important only after the first in-semester examination was administered. The student’s cumulative college GPA was consistently the most important institutional variable. Homework grade became the most important in-class variable after the first week and consistently increased in importance as the semester progressed; homework grade became more important than cumulative GPA after the first in-semester examination. Demographic variables including gender, race or ethnicity, and first generation status were not important variables for predicting course grade.

Predicting Student Performance from Their Behavior in Learning Management Systems

Article

Full-text available

Jan 2019

Predicting Academic Performance of Students Using a Hybrid Data Mining Approach

Article

Full-text available

Apr 2019
J MED SYST

Data mining offers strong techniques for different sectors involving education. In the education field the research is developing rapidly increasing due to huge number of student’s information which can be used to invent valuable pattern pertaining learning behavior of students. The institutions of education can utilize educational data mining to examine the performance of students which can support the institution in recognizing the student’s performance. In data mining classification is a familiar technique that has been implemented widely to find the performance of students. In this study a new prediction algorithm for evaluating student’s performance in academia has been developed based on both classification and clustering techniques and been ested on a real time basis with student dataset of various academic disciplines of higher educational institutions in Kerala, India. The result proves that the hybrid algorithm combining clustering and classification approaches yields results that are far superior in terms of achieving accuracy in prediction of academic performance of the students.

Predicting student success using data generated in traditional educational environments

Article

Full-text available

Aug 2018

Educational Data Mining (EDM) techniques offer unique opportunities to discover knowledge from data generated in educational environments. These techniques can assist tutors and researchers to predict future trends and behavior of students. This study examines the possibility of only using traditional, already available, course report data, generated over years by tutors, to apply EDM techniques. Based on five algorithms and two cross-validation methods we developed and evaluated five classification models in our experiments to identify the one with the best performance. A time segmentation approach and specific course performance attributes, collected in a classical manner from course reports, were used to determine students' performance. The models developed in this study can be used early in identifying students at risk and allow tutors to improve the academic performance of the students. By following the steps described in this paper other practitioners can revive their old data and use it to gain insight for their classes in the next academic year.

Classification Assessment Methods: a detailed tutorial

Article

Full-text available

Aug 2018

Alaa Tharwat

Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. Most of these measures are scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This overview starts by highlighting the definition of the confusion matrix in binary and multi-class classification problems. Many classification measures are also explained in details, and the influence of balanced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Moreover, some graphical measures such as Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves are presented with details. Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.

Predict Student's Academic Performance and Evaluate the Impact of Different Attributes on the Performance Using Data Mining Techniques

Conference Paper

Full-text available

Dec 2017

Resampling Imbalanced Class and the Effectiveness of Feature Selection Methods for Heart Failure Dataset

Article

Full-text available

Feb 2018

Mohammad Al Khaldy

The real dataset has many shortcomings that pose challenges to machine learning. High dimensional and imbalanced class prevalence are two important challenges. Hence, the classification of data is negatively impacted by imbalanced data, and high dimensional could create suboptimal performance of the classifier. In this paper, we explore and analyse different feature selection methods for a clinical dataset that suffers from high dimensional and imbalance data. The aim of this paper is to investigate the effect of imbalanced data on selecting features by implementing the feature selection methods to select a subset of the original data and then resample the dataset. In addition, we resampled the dataset to apply feature selection methods on a balanced class to compare the results with the original data. Random forest and J48 techniques were used to evaluate the efficacy of samples. The experiments confirm that resampling imbalanced class obtains a significant increase in classification performance, for both taxonomy methods Random forest and J48. Furthermore, the biggest measure affected by balanced data is specificity where it is sharply increased for all methods. What is more, the subsets selected from the balanced data just improve the performance for information gain, where it is played down for the performance of others. Keywords— Clinical Data; Imbalance Class; Feature Selection; Oversampling; Under-sampling; Resampling.

Student Academic Performance Prediction by using Decision Tree Algorithm

Conference Paper

Aug 2018

Predicting academic performance by considering student heterogeneity

Article

Jul 2018
KNOWL-BASED SYST

The capacity to predict student academic outcomes is of value for any educational institution aiming to improve student performance and persistence. Based on the generated predictions, students identified as being at risk of academic retention or performance can be provided support in a more timely manner. This study creates different classification models for predicting student performance, using data collected from an Australian university. The data include student enrolment details as well as the activity data generated from the university learning management system (LMS). The enrolment data contain student information such as socio-demographic features, university admission basis (e.g. via entry exam or past experience) and attendance type (e.g. full-time vs. part-time). The LMS data record student engagement with their online learning activities. An important contribution of this study is the consideration of student heterogeneity in constructing the predictive models. This is based on the observation that students with different socio-demographic features or study modes may exhibit varying learning motivations. The experiments validated the hypothesis that the models trained with instances in student sub-populations outperform those constructed using all data instances. Furthermore, the experiments revealed that considering both enrolment and course activity features aids in identifying vulnerable students more precisely. The experiments determined that no individual method exhibits superior performance in all aspects. However, the rule-based and tree-based methods generate models with higher interpretability, making them more useful for designing effective student support.

An efficient use of ensemble methods to predict students academic performance

Conference Paper

Mar 2018

Application of data mining techniques in an educational background can discover hidden knowledge and patterns that will support in decision-making processes for improving the educational system. In e-learning system or web-based education, student’s behavioral(SB) features play an important role that will show the student’s interactivity towards the e-learning system. The aim of this paper is to show the importance of SB features and for this task we have collected the educational dataset from learning management system (LMS). On the included dataset, feature analysis has been done and after that, we have used data preprocessing phase that is an important step in knowledge discovery process. On the preprocessed dataset, classification is performed on it by using classifiers namely; Decision Tree (ID3), Nave Bayes, K-Nearest Neighbor, Support vector machines to predict student’s academic performance. The accuracy of the proposed model is achieved by using Ensemble Methods. We have used Bagging, Boosting, and Voting Algorithm that are the common ensemble methods. On using ensemble methods, we have got the better result that proves the reliability of the proposed model.

Predict Students’ Academic Performance based on their Assessment Grades and Online Activity Data

Recommended publications

Predicting Students’ Academic Performance Through Supervised Machine Learning

Session 11a7

Predicting Student Academic Performance in Computer Science Courses: A Comparison of Neural Network...

The consciousness quotient: A new predictor of the students' academic performance