ArticlePDF Available

Educational Data Mining and Analysis of Students' Academic Performance Using WEKA

Authors:

Abstract and Figures

In this competitive scenario of the educational system, the higher education institutes use data mining tools and techniques for academic improvement of the student performance and to prevent drop out. The authors collected data from three colleges of Assam, India. The data consists of socioeconomic , demographic as well as academic information of three hundred students with twenty-four attributes. Four classification methods, the J48, PART, Random Forest and Bayes Network Classifiers were used. The data mining tool used was WEKA. The high influential attributes were selected using the tool. The internal assessment attribute in the continuous evaluation process makes the highest impact in the final semester results of the students in our dataset. The results showed that random forest outperforms the other classifiers based on accuracy and classifier errors. Apriori algorithm was also used to find the association rule mining among all the attributes and the best rules were also displayed.
Content may be subject to copyright.
Indonesian Journal of Electrical Engineering and Computer Science
Vol. 9, No. 2, February 2018, pp. 447~459
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v9.i2.pp447-459 447
Journal homepage: http://iaescore.com/journals/index.php/ijeecs
Educational Data Mining and Analysis of Students’ Academic
Performance Using WEKA
Sadiq Hussain1, Neama Abdulaziz Dahan2, Fadl Mutaher Ba-Alwib3, Najoua Ribata4
1Examination Branch, Dibrugarh University, India
2,3Department of Computer Science, Sana’a University, Sana’a, Yemen
4Lirosa Laboratory, Abdelmalek Essaâdi University, Tetuan, Morocco
Article Info
ABSTRACT
Article history:
Received Sep 2, 2016
Revised Dec 25, 2016
Accepted Jan 11, 2017
In this competitive scenario of the educational system, the higher education
institutes use data mining tools and techniques for academic improvement of
the student performance and to prevent drop out. The authors collected data
from three colleges of Assam, India. The data consists of socio-economic,
demographic as well as academic information of three hundred students with
twenty-four attributes. Four classification methods, the J48, PART, Random
Forest and Bayes Network Classifiers were used. The data mining tool used
was WEKA. The high influential attributes were selected using the tool. The
internal assessment attribute in the continuous evaluation process makes the
highest impact in the final semester results of the students in our dataset. The
results showed that random forest outperforms the other classifiers based on
accuracy and classifier errors. Apriori algorithm was also used to find the
association rule mining among all the attributes and the best rules were also
displayed.
Keywords:
Educational data mining
Classification algorithms
WEKA
Students’ academic
performance
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Sadiq Hussain,
Examination Branch, Dibrugarh University, India
Email: sadiq@dibru.ac.in
1. INTRODUCTION
Data Mining (DM) is one of the active fields in the Computer Sciences (CSs). It is a young and
promising field. Due to the extensivity and the huge availability of the amounts of data and the urgent need to
convert such data into useful information and knowledge, Data mining has enticed a great importance of
interest in the information industry and in society as well in recent years [1]. DM focuses on the extraction of
hidden knowledge from various data warehouses, data marts, and repositories. Large data becomes useless
without proper utilization.
Sometimes DM can be named also Knowledge data discovery (KDD). They are similar in many
things but they are really different in an essential point. DM is to find a subset Di of D that met a logical
formula within the scope of Di reduced matrix. If DM cannot deduced any results from that logical formula,
KDD will be found, in contrast, even if that logical formula can cover all the data as well as the possibility of
the knowledge discovery. The main feature of both data mining and knowledge discovery is to derive
common expressions of characteristics that are shared by all elements in a set [2]. KDD and DM have
techniques that are used to extract useful information from large amount of data in the database [3, 4]. The
results of applying the DM algorithms on any given or manual-generated dataset can be named the Rule
Discovery [5]. There are two main types of these rules, the production rules and the association rules.
According to Quinlan [6], the production rules are a common formalism for expressing knowledge in expert
systems. Decision Trees rules can be also transformed into the production rules [6]. The association rules was
firstly addressed to find a relationship among sales of different items from the analysis of a big data [7].There
are many fields that DM has been applied in, One of them is the educational DM (EDM).
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
448
Educational data mining is an emerging field in the area of data mining. In this competitive world,
the educational setting also uses data mining tools to explore and analyze student performance, predict their
results to prevent drop out and focus on both good and academically poor performers, feedback for the
faculties and instructors, visualization of data and to have a better assessment of learning process. The quality
of education needs to be improved and educational data mining is a tool for this improvement. Modern
educational institutes need data mining for their strategy and future plans. Student’s performance depends on
various factors like personal, social, economic and other environmental ones[8, 9]. The top-level educational
institutes’ authorities may utilize the outcome of the experimental results to understand the trends and
behaviors in students’ performance which may lead to design new pedagogical strategies [10].
There are a number of classification algorithms: Decision Tree, Neural Network, Naïve Bayes, K-
Nearest neighbor, Random Forest, AdaBoost, Support Vector Machines etc. [11]. In this research, authors are
going to use notably some of them for mining the academic students’ performance: J48, BayesNet, PART
and Random Forest classification algorithms. Apriori algorithm, as a part of the unsupervised learning and
one of the most popular algorithms for association rule mining was used additionally to reveal the hidden
rules from our dataset [12]. They compared each of the algorithms based on its accuracy to select the best
performed algorithm for the job.
Classification is one of the predictive tasks [1] and is the most commonly used data mining
technique in predicting the students’ performance in educational institutes [11, 13, 14]. Several attributes
were considered in our study. To find the high influence attributes, feature selection was conducted first.
Feature selection removes the unnecessary attributes from the dataset to extract useful and meaningful
information. It makes the mining process faster, valuable and meaningful. In the study, students’ end
semester percentage is selected as the dependent parameter. The percentages are categorized as ‘Best’, ‘Very
Good’, ‘Good’, ‘Pass’, ‘Fail’. The data mining tool used for the study was WEKA (Waikato Environment for
Knowledge Analysis). WEKA is an open source tool written in Java that is widely used by the data miners
[15]. WEKA implements most of the machine learning algorithms and visualizes its results as well.
The paper is organized as follows: in Section II a review of related literature is presented, Section III
introduces Classifier evaluations and Error Measurement Techniques used in this research. Section IV
provides Applied Data mining algorithms on the selected dataset. Section V showed experimental results,
Section VI presents the Association rule mining work, and section VII concludes the work.
2. LITERATURE REVIEW
Ahmad et al [16] designed a framework to predict the academic performance of the first year
bachelor students of computer science course. The dataset contained 8 years data starting from July 2006-07
to July 2013-14. The data collected contained various aspects of students' records including previous
academic records, family background and demographics. Three classifiers viz. Decision Tree, Naïve Bayes
and Rule-Based classifiers are applied to find the academic performance of students. The experiments
showed that Rule Based classifier was the best among the other classifiers and its accuracy was found as
71.3%. The first year students’ level of success was predicted by the model. Sumitha et. al. [17] developed a
data model to predict student’s future learning outcomes using senior students dataset. They compared the
data mining classification algorithms and found that J48 algorithm was best suited for such job based on their
data. Khasanah et. al. [18] conducted a study to find that high influence attributes may be selected
carefully to predict student performance. Feature selection may be used before classification for such job.
The student data was from Department of Industrial Engineering Universitas Islam Indonesia. They used
Bayesian Network and Decision Tree algorithms for classification and prediction of student performance.
The Feature Selection methods showed that student’s attendance and Grade Point Average in the first
semester topped the list of features. When the accuracy rate was considered, the Bayesian Network
outperformed the Decision Tree classification in their case. Ankita A Nichat et. al. [19] built classification
models using decision tree and artificial neural network techniques. They used several attributes to access the
strength and weakness of the students to improve the performance of the students.
Hilal Almarabeh [20] used WEKA tool to evaluate the performance of the university students. He
found that the accuracy of the classifier algorithms depends upon size and nature of data. The author used
Naïve Bayes, Bayesian Network, Neural Network, ID3 and J48 classification techniques. It was found that
Bayesian Network outperforms the others in terms of accuracy. Amjad Abu Saa [21] worked out a qualitative
model to analyze the student performance based on students’ personal and social factors. The author explored
theoretically various factors of the students’ performance in the field of higher education.
Pedro Strecht et. al. [10] predicted students' results (pass/fail) and their grades in their work. They
used classification model for the students' results and a regression model for the prediction of the grades.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
449
They carried out the experiments using the 700 courses students' data who studied at the University of Porto.
They used decision trees and SVM for classification while SVM, Random Forest, and AdaBoost.R2 were
best suited for regression analysis. The classification model was able to extract useful patterns, but the
models for regression were not able to beat a simple baseline. Fahim Sikder et. al. [13] used Cumulative
Grade Point Average (CGPA) for prediction of students’ yearly performance. The dataset used was from
Bangabandhu Sheikh Mujibur Rahman Science and Technology University students’ records. The authors
used neural network technique for prediction and it was compared with the real CGPA of the student.
2.1 Classifier Evaluations and error measurement techniques:
The performance measures are derived from confusion matrix [22]. A confusion matrix is formed
based on the four outcomes of binary classification. In binary classification, the dataset usually has two labels
positive (P) and negative (N). The outcomes are true positive (TP) i.e. correct positive prediction, true
negative (TN) i.e. correct negative prediction, false positive (FP) i.e. incorrect positive prediction and false
negative (FN) i.e. incorrect negative prediction.
a. Sensitivity (Recall or True positive rate)
Recall is the number of correct classifications divided by the total number of positives. So,
(1)
b. Precision
Precision is the number of correct positive classifications divided by total number of positive classifications.
So,
(2)
c. F-score
F-score is harmonic mean of precision and recall. So,
(3)
d. Accuracy [23]
Accuracy is the number of all correct classifications divided by the total numbers of cases. So,
Accuracy = (TP+TN) / (TP+TN+FN+FP) = (TP+TN) / (P+N)
(4)
The following section explains different error measures used for classification methods.
e. Mean Absolute Error (MAE) [24]
MAE estimates how far the predictions or forecasts differ from the actual values.
(5)
where n = the number of errors, |xii x| = the absolute errors.
f. Root Mean Square Error (RMSE) [24]
RMSE is an evaluator of the differences between the predictor values and the actual observed values.
(6)
where Xobs is observed values and Xmodel is modeled values at time/place i.
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
450
g. Relative Absolute Error (RAE) [20] [15]
RAE is defined as the ratio of absolute error by the magnitude of the actual value. It is represented as below,
(7)
where pi is the forecast value, ai is the actual value and is the average of actual values.
h. Root Relative Squared Error (RRSE) [20] [15]
It is denoted as mean absolute error (MAE) divided by the classification model error. It can be
represented as below,
(8)
Avoiding bias in the algorithms selection:
There were many studies for accessing the student academic performance and prediction of drop out
of students and their job prospects [25]. The goal of such type of study was to improve the quality of
education in higher educational institutes. Most of the studies consider the grade point averages (GPA) [26,
27], as their response variable and the explanatory variables are varied. In our study, we had used final
semester percentage as our response variable as the grading system are not yet introduced at undergraduate
level in most of the courses in Assam.
There were also various classification methods applied for student academic performance studies
[16, 20]. The different studies showed that on their dataset the results found on accuracy varies. Some of the
studies found that the decision trees are the best among other classification algorithms whereas some found
that Bayes Network performed better than others.
The authors had applied four of the classification methods one by one until the accuracy found to be
99% in case of random forest. The first method used by the authors was Bayesian Network (BN). According
to Almarabeh [20] had analyzed the performance of students’ of King Saud Bin Abdulaziz University for
Health Sciences. He found that BN was the best-suited classification methods. Directed acyclic graphs are
used in Bayesian networks to depict the dependencies among random variables. Random variables are
represented as nodes. If the nodes are connected by an arc, then these variables are dependent on each other.
BN has been used for performing bi-directional inference since 1980. It is also used for reasoning under
uncertainty.
The authors then tried the rule-based classification techniques available as PART in WEKA. Ahmad
et al [16] also used this technique for classification and found that it was the best technique for student
academic performance assessment among Naïve Bayes, decision trees, and rule-based classifiers. PART is
rules-based classifier which combines separate and conquer method with divide and conquer strategy. This
classification method builds a partial tree with the available set of records. It then creates a rule from the tree.
After discarding the decision tree and deleting records covered by the rule, it again builds the partial decision
tree in an iterative manner.
The authors then used the decision tree classification method. Patil et al [28] established that
decision tree algorithm performs better than Naïve Bayes methods. The advantage of using decision tree
classifier is that the tree can be visualized, understood and interpreted easily by the users [29]. The tree
performs well in case of both numerical and categorical variables. The decision tree has a tree-like structure
start with root node and ends with leaf attributes. So, it is one of the powerful as well as popular classifiers.
WEKA implements C4.5 decision tree using J48 classification method.
The authors used random forest classifier as their next attempt. Random forests (RF) [11] reduce
overfitting, bias, and variance. So, RF is more accurate and robust. RF works on bagging algorithm. RF
replaces data to construct the tree and the partition is not done on the same important variable as the
explanatory variables are bootstrapped. RF creates lots of individual decision trees from the training set. It is
good at predicting the target values.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
451
4. APPLYING DATA MINING ALGORITHMS TO THE SELECTED DATASET
The dataset contained 300 instances with 24 attributes. The proposed framework is shown in Figure
1 below.
4.1. Data Preprocessing phase
The data for this research was collected from three different colleges, those are Duliajan College,
Doomdooma College and Digboi College of Assam, India. Initially, data of twenty-four attributes were
collected. As the attribute name of the student does not carry any significance, we removed it from the list of
the attributes. The attribute "marks in practical paper" was also removed at the pre-processing phase, because
of the interesting number of the missing values. Finally, twenty-two attributes were selected after data
cleaning. Table-1 shows the selected attributes with their possible values.
Figure 1: Framework for Students’ Academic Performance Classification
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
452
Table 1: Dataset Description
Attribute
Description
Values
GE
Gender
(Male, Female)
CST
Caste
(General,SC,ST,OBC,MOBC)
TNP
Class X Percentage
(Best, Very Good, Good, Pass, Fail)
If percentage >=80 then Best
If percentage >= 60 but less than 80 then Very Good
If percentage >= 45 but less than 60 then Good
If Percentage >= 30 but less than 45 then Pass
If Percentage < 30 then Fail
TWP
Class XII Percentage
(Best, Very Good, Good, Pass, Fail)
Same as TNP
IAP
Internal Assessment Percentage
(Best, Very Good, Good, Pass, Fail)
Same as TNP
ESP
End Semester Percentage
(Best, Very Good, Good, Pass, Fail)
Same as TNP
ARR
Whether the student has back or arrear
papers
(Yes, No)
MS
Marital Status
(Married, Unmarried)
LS
Lived in Town or Village
(Town, Village)
AS
Admission Category
(Free, Paid)
FMI
Family Monthly Income
(in INR)
(Very High, High, Above Medium, Medium, Low)
If FMI >= 30000 then Very High
If FMI >= 20000 but less than 30000 then High
If FMI >= 10000 but less than 20000 then Above Medium
If FMI >= 5000 but less than 10000 then Medium
If FMI is less than 5000 then Low
The figures are expressed in INR.
FS
Family Size
(Large, Average, Small)
If FS > 12 then Large
If FS >= 6 but less than 12 then Average
If FS < 6 then Small
FQ
Father Qualification
(IL, UM, 10, 12 , Degree, PG )
IL= Illiterate UM= Under Class X
MQ
Mother Qualification
(IL, UM, 10, 12 , Degree, PG )
IL= Illiterate UM= Under Class X
FO
Father Occupation
(Service, Business, Retired, Farmer, Others)
MO
Mother Occupation
(Service, Business, Retired, Farmer, Others)
NF
Number of Friends
(Large, Average, Small)
Same as Family Size
SH
Study Hours
(Good, Average, Poor)
>= 6 hours Good >= 4 hours Average < 2 hours Poor
SS
Student School attended at Class X
level
( Govt., Private)
ME
Medium
(Eng,Asm,Hin,Ben)
TT
Home to College Travel Time
( Large, Average, Small )
>= 2 hours Large >=1 hours Average < 1 hour Small
ATD
Class Attendance Percentage
(Good, Average, Poor)
If percentage >= 80 then Good
If percentage >= 60 but less than 80 then Average
If Percentage < 60 then poor
Descriptions of some of the attributes of the dataset
CST: It is caste of the student. The possible values of this attribute are ‘G’ (General category or unreserved
category), ‘SC’ (Schedule Caste category), ‘ST’ (Schedule Tribe Category), ‘OBC’ (Other Backward
Classes), ‘MOBC’ (Minorities and other backward classes) students. These categories are based on the Indian
Constitution.
TNP: It is the percentage attained by the student in Class X. The examination is called HSLC Examination in
Assam, India. The authors had categorized the results as Best, Very Good, Good, Pass, Fail. The ‘Best' is
called when the student secured more than or equal to 80% (it is termed as Star percentage), ‘Very Good' is
labeled as when the student secures more than or equal to 60% but less than 80% (more than or equal to 60%
is always termed as First Division or Class in most of the examinations), ‘Good' is termed as when the
student secures more than or equal to 45% but less than 60% (in most of the Universities in Assam it is called
as Second Division or class), ‘Pass’ is called when the student got less than or equal to 30% but less than
45%. It is termed as ‘Fail’ when the student secured less than 30%. The same is true for TWP (Class XI
percentage secured by the student), IAP (Internal Assessment percentage secured by the student at Degree
level (10+2+3)) and ESP (End Semester Examination percentage secured by the student at Degree level).
ESP is the response variable.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
453
IAP (Internal Assessment percentage secured by the student at Degree level (10+2+3)): Internal Assessment
is part of continuous evaluation. It comprises of sessional examinations, surprise tests, assignments, field
work, quizzes etc. It is categorized as the same way as TNP,TWP and ESP.
ARR: It is categorized as ‘Yes’ or ‘No’. This attribute collected the data based on the fact that whether the
student had any failed paper in any of the previous semesters.
ME: It is categorized as ‘Eng’ (English), ‘Asm’ (Assamese), ‘Hin’ (Hindi) and Ben (‘Bengali’). Assamese,
Hindi and Bengali are the modern Indian languages. It is the language or medium of instructions for the
students in which languages they were being taught or appeared in an examination.
FQ: The possible values of this attribute are ‘Il’ (illiterate), ‘Um’ (Under class X level), ‘10’ (Passed Class X
Examination), ‘12’ (Passed Class XII Examination), ‘Degree’ (Passed Bachelor of Arts or Science or
Commerce Examination), ‘PG’ (passed Masters of Arts or Science or Commerce Examination). It is the
educational qualification of father of student. MQ stands for mother qualification. The possible values of this
attribute are same as father qualification.
4.2 Feature Selection
Using Weka, the feature selection discovers the most influential attributes using correlation-based
attribute evaluation, gain-ratio attribute evaluation, information-gain attribute evaluation, relief attribute
evaluation, symmetrical uncertainty attribute evaluation. Correlation-based attribute evaluation is a greedy
search method while others are rank search methods [18].
Using these feature selection methods, total eleven attributes were found to be highly influential.
The selected attributes are shown as bold in Table 2. They were used for classification and other attributes
were removed. The end semester percentage (esp) is the response variable. Figure 2 shows the data in the arff
format.
Table 2: Attribute Selection using feature selection methods
Feature Selection Method
High Influence Attributes
Correlation-based Attribute Evaluation
Gain-Ratio Attribute Evaluation
Information-Gain Attribute Evaluation
Relief Attribute Evaluation
Symmetrical Uncertainty Attribute
arr, iap,tnp,as,twp,sh,me,fs,nf, atd,fo,fmi,fq,tt,ss
iap,ms,arr,tnp,twp,as,me,sh,atd,fmi,fq,nf,fo,mq,fs
iap,tnp,twp,arr,fmi,as,fq,me,atd,sh,fo,mq,nf,cst,tt
iap,tnp,arr,tnp,nf,as,atd,me,fo,sh,fmi,fs,ls,ge,tt
iap,tnp,twp,arr,as,me,fmi,atd,sh,fq,fo,mq,nf,fs,tt
Figure 2: Data File in arff format
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
454
4.3 Specifying the selected algorithms
After feature selection, the classification algorithms were applied. There are various classification
methods: Decision Tree, Neural Network, Naïve Bayes, K-Nearest neighbor, Random Forest, AdaBoost,
Support Vector Machines etc. [13]. The authors used specific algorithms, for mining the academic
performance of the students, those are found in the WEKA program: J48, PART, BayesNet and Random
Forest classification algorithms. According to the WEKA algorithms specification [30]: J48 is an algorithm
that generates a pruned or unpruned C4.5 decision tree. PART is an algorithm that uses divide-and-conquer
mechanism to build a partial C4.5 decision tree in each iteration, i.e. it generates a PART decision list, and
makes the best leaf into a rule. BayesNet produces random instances based on a Bayes network that uses
various search algorithms and quality measures. It also offers data structures (network structure, conditional
probability distributions, etc.) and facilities public to Bayes Network learning algorithms. Random Forest is
a group of unpruned classification or regression trees that are created using bootstrap examples of the training
data and random feature selection in tree induction that is finally constructing a forest of random trees [30,
31]. Then the authors compared each of the algorithms based on its accuracy to select the best-performed
algorithm for the job.
5. EXPERIMENTS AND RESULTS
5.1 Classification Results:
The stage is set for the experiments. WEKA has various classification algorithms. The authors had
used J48, BayesNet, PART and Random Forest classification methods available in WEKA. These methods
are supervised learning algorithms which use the training data to test the correctness of testing data [20].
Figure 4 shows the comparison between these four classifiers.
J48 Classifier: This classifier is used for generating decision tree based on C4.5 algorithm. Ross Quinlan
developed this algorithm [20]. Its performance is shown in figure 6.
BayesNet Classifier: This classifier delivers higher accuracy on large database. It also makes the
computational time less than better speed. Bayesian Network uses conditional dependencies using direct
graph [20].
Random Forest Classifier: This classifier used bootstrap sampling method on the training dataset to
construct many unpruned classification trees. In the testing phase, the mean of all unpruned classification
trees for a randomly selected feature provides the final predicted output [32]. Its performance is shown in
figure 7 and 8.
PART Classifier: This rue learning classifier combines the divide-and-conquer strategy with separate-and
conquers strategy. It builds a partial decision tree on the current set of instances and creates a rule from the
decision tree [33].
There are 300 student records from three different colleges with 12 selected attributes. Table 3 shows the
performance of the 4 classification methods based on their accuracy.
Table 3: Comparison of different classifiers based on accuracy.
Classifiers
Accuracy
Correctly Classified Instances
Incorrectly Classified Instances
Random Forest
99%
297
3
PART
74.33%
223
77
J48
73%
219
81
BayesNet
65.33%
196
104
Based on the accuracy of the four classifiers, the Random Forest has more correctly classified instances than
other classification methods. Its accuracy percentage is 99%. Figure 4 and 5 shows that the Random Forest
Classifier has the minimum errors in terms of Mean Absolute Error (MAE), Root Mean Square Error
(RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE) when compared with
other methods.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
455
Figure 3. Comparison of Classifiers
Figure 4: MAE and RMSE Metrics
Figure 5: RAE and RRSE Metrics
Figure 6. J48 Tree Visualization
The Kappa statistic value is 0.9859 which shows that the model is statistically significant.
The significance is rather high according to this value. So, this model may be used for the prediction of final
semester percentage of the student.
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
456
The authors had also compared the random forest classifier with feature selection and without
feature selection. The random forest classifier with feature selection outperforms the other. Table 4 shows the
comparison.
Figure 7: Random Forest Visualization of Cost Curve of ‘Best’ Class of ‘end semester percentage’ attribute
Table 4: Comparison of Random Forest Classifier with and without selected attributes
Classifiers
Accuracy
Correctly Classified Instances
Incorrectly Classified Instances
Random Forest
With 12 selected attributes
99%
297
3
Random Forest
With all the attributes
84.33%
233
67
Figure 8. Random Forest Visualization of Cost/Benefit Analysis for ‘Good’ Class of ‘end semester
percentage’ attribute
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
457
5.1 Association Rule results
Association rules are used for analyzing the data to uncover the frequent if/then patterns. The most
important relationships are identified by using support and confidence criteria. The association rule
comprises of two parts. They are antecedent (if part) and a consequent (then part). The Apriori algorithm is
most frequently used algorithm to find the correlation based data mining works [12]. Using WEKA, we had
applied the Apriori algorithm on our datasets. The Minimum support was 0.6 (180 instances), minimum
metric (confidence) was 0.9 and number of cycles performed were 8. We had found the best rules as shown
below:
1. ls=V 240 ==> ms=Unmarried 240 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.8)
2. ls=V mo=Housewife 213 ==> ms=Unmarried 213 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.71)
3. fs=Small 202 ==> ms=Unmarried 202 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.67)
4. as=Free 191 ==> ms=Unmarried 191 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.64)
5. fs=Small mo=Housewife 182 ==> ms=Unmarried 182 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.61)
6. ls=V ss=Govt 181 ==> ms=Unmarried 181 <conf:(1)> lift:(1) lev:(0) [0] conv:(0.6)
7. mo=Housewife 269 ==> ms=Unmarried 268 <conf:(1)> lift:(1) lev:(-0) [0] conv:(0.45)
8. ss=Govt 221 ==> ms=Unmarried 220 <conf:(1)> lift:(1) lev:(-0) [0] conv:(0.37)
9. mo=Housewife ss=Govt 200 ==> ms=Unmarried 199 <conf:(0.99)> lift:(1) lev:(-0) [0] conv:(0.33)
10. me=Asm 193 ==> ms=Unmarried 192 <conf:(0.99)> lift:(1) lev:(-0) [0] conv:(0.32)
The experiment was again performed with the selected attributes. This time the Minimum support
was 0.1 (30 instances), minimum metric (confidence) was 0.9 and number of cycles performed were 18. The
authors had found the best rules as shown below:
1. esp=Fail 32 ==> arr=Y 32 <conf:(1)> lift:(1.97) lev:(0.05) [21] conv:(15.79)
2. fmi=Low fo=Farmer me=Asm 32 ==> as=Free 32 <conf:(1)> lift:(1.57) lev:(0.04) [15] conv:(11.63)
3. arr=Y fo=Farmer nf=Average 31 ==> as=Free 31 <conf:(1)> lift:(1.57) lev:(0.04) [15] conv:(11.26)
4. twp=Good iap=Good arr=Y fo=Farmer 32 ==> me=Asm 31 <conf:(0.97)> lift:(1.51) lev:(0.03) [33]
conv:(5.71)
5. tnp=Good fmi=Low me=Asm 31 ==> as=Free 30 <conf:(0.97)> lift:(1.52) lev:(0.03) [33] conv:(5.63)
6. arr=Y nf=Average me=Asm 50 ==> as=Free 48 <conf:(0.96)> lift:(1.51) lev:(0.05) [13] conv:(6.06)
7. twp=Good iap=Good as=Free fo=Farmer 44 ==> me=Asm 42 <conf:(0.95)> lift:(1.48) lev:(0.05) [14]
conv:(5.23)
8. fmi=Low fo=Farmer 43 ==> as=Free 41 <conf:(0.95)> lift:(1.5) lev:(0.05) [14] conv:(5.21)
9. iap=Good arr=Y fo=Farmer 43 ==> as=Free 41 <conf:(0.95)> lift:(1.5) lev:(0.05) [14] conv:(5.21)
10. iap=Good arr=Y fo=Farmer me=Asm 40 ==> as=Free 38 <conf:(0.95)> lift:(1.49) lev:(0.04) [18]
conv:(4.84)
6. CONCLUSION AND FUTURE WORK
The students’ academic performance was evaluated based on academic and personal data collected
from 3 different colleges from Assam, India. The total number of records were 300 with 24 attributes. Two
attributes were dropped in the phase of data cleaning. Using feature selection, 12 highly influential attributes
were selected. After that four different classification algorithms were used. They were J48, PART, BayesNet
and Random Forest. The data mining tool used in the experiment was WEKA 3.8. Based on the accuracy and
the classification errors one may conclude that the Random Forest Classification method was the most suited
algorithm for the dataset. The Apriori algorithm was applied to the dataset using WEKA to find some of the
best rules. The data may be extended to collect some of the extra-curricular aspects and technical skills of the
students and mined with different classification algorithms to predict the student performance as future work.
The authors also interested in working in future on data of students assessments for each course trying to
know what kind of student succeed on what kind of courses. It may define what kinds of courses are adapted
for every cluster of students model who shares the same characteristics. It may also provide various
multidimensional summary reports and redefine pedagogical learning paths.
ACKNOWLEDGEMENTS
The authors acknowledge the Principals of Digboi, Doomdooma and Duliajan Colleges for
collecting the student data and to analyze the data to get desired results. The authors acknowledge the
Principals of Digboi, Doomdooma and Duliajan Colleges for collecting the student data and to analyze the
ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 447 459
458
data to get desired results. The authors would also like to acknowledge Prof. Alak Kr. Buragohain, Vice-
Chancellor of Dibrugarh University and Prof. G.C. Hazarika, Department of Mathematics, Dibrugarh
University for their inspiring words and guidance.
REFERENCES
1. Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2 ed. 2011: Elsevier.
2. Ohsuga, S., Difference Between Data Mining And Knowledge Discovery --A View To Discovery From Knowledge-
Processing, in Granular Computing, 2005 IEEE International Conference on. 2005, IEEE. p. 6.
3. Ba-Alwi, F.M. and H.M. Hintaya, Comparative Study for Analysis the Prognostic in Hepatitis Data: Data Mining
Approach. International Journal of Scientific & Engineering Research, 2013. 4(8): p. 6.
4. Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, From Data Mining to Knowledge Discovery in Databases.
American Association for Artificial Intelligence Magazine, 1996. 17: p. 18.
5. Bhatnagar, V., A.S. Al-Hegami, and N. Kumar, A Hybrid Approach for Quantification of Novelty in Rule Discovery.
International Journal of Computer, Electrical, Automation, Control and Information Engineering, 2007. 1(4): p. 4.
6. Quinlan, J.R., Generating Production Rules From Decision Trees. ijcai, 1987. 87: p. 4.
7. Ba-Alwi, F.M., Discovery of novel association rules based on genetic algorithms. British Journal of Mathematics &
Computer Science, 2014. 4(23): p. 17.
8. Hijazi, S.T. and S.M.M.R. Naqvi, Factors Affecting Students’ Performance, A Case Of Private Colleges.
Bangladesh e-Journal of Sociology, 2006. 3(1): p. 10.
9. Bhardwaj, B.K. and S. Pal, Data Mining: A prediction for performance improvement using classification.
International Journal of Computer Science and Information Security, 2011. 9(4): p. 5.
10. Strecht, P., et al., A Comparative Study of Classification and Regression Algorithms for Modelling Students'
Academic Performance. Proceedings of the 8th International Conference on Educational Data Mining, 2015: p. 3.
11. Dekker, G.W., M. Pechenizkiy, and J.M. Vleeshouwers, Predicting students drop out: A case study. EDM ’09-
Educational Data Mining 2009: 2nd International Conference on Educational Data Mining, 2009. 2: p. 10.
12. Shrivastava, A.K. and R.N. Panda, Implementation of Apriori Algorithm using WEKA. KIET International Journal of
Intelligent Computing and Informatics, 2014. 1(1): p. 4.
13. Sikder, M.F., M.J. Uddin, and S. Halder, Predicting Students Yearly Performance using Neural Network: A Case
Study of BSMRSTU. 5th International Conference on Informatics, Electronics and Vision (ICIEV), 2016. 5: p. 6.
14. Millán, E., T. Loboda, and J.L. Pérez-de-la-Cruz, Bayesian networks for student model engineerin. Computers and
Education. Elsevier Ltd, 2010. 55(4): p. 20.
15. Kabakchieva, D., Predicting Student Performance by Using Data Mining Methods for Classification. Cybernatics
and Information Technologies, 2013. 13(1): p. 12.
16. Ahmad, F., N.H. Ismail, and A. Abdulaziz, The Prediction of Students’ Academic Performance Using Classification
Data Mining Techniques. Applied Mathematical Sciences, 2015. 9(129): p. 12.
17. Sumitha, R. and E.S. Vinothkumar, Prediction of Students Outcome Using Data Mining Techniques. International
Journal of Scientific Engineering and Applied Science (IJSEAS), 2016. 2(6): p. 8.
18. Khasanah, A.U. and Harwati, A Comparative Study to Predict Student’s Performance Using Educational Data
Mining Techniques. IOP Conf. Series: Materials Science and Engineering, 2017. 215(012036): p. 7.
19. Nichat, A.A. and D.A.B. Raut, Analysis of Student Performance Using Data Mining Technique. International
Journal of Innovative Research in Computer and Communication Engineering, 2017. 2007(An ISO 3297): p. 5.
20. Almarabeh, H., Analysis of Students’ Performance by Using Different Data Mining Classifiers. I.J. Modern
Education and Computer Science, 2017. 9(8): p. 9-15.
21. Saa, A.A., Educational Data Mining & Students’ Performance Prediction. (IJACSA) International Journal of
Advanced Computer Science and Applications, 2016. 7(5): p. 9.
22. Willmott, C.J. and K. Matsuura, Advantages of the mean absolute error (MAE) over the root mean square error
(RMSE) in assessing average model performance. Climate research, 2005. 30(1): p. 4.
23. J.Hyndman, R. and A. B.Koehler, Another look at measures of forecast accuracy. International Journal of
Forecasting, 2006. 22(4): p. 10.
24. Jr, R.G.P., O. Thontteh, and H. Chen, Components of information for multiple resolution comparison between maps
that share a real variable. Pontius, Robert Gilmore, Olufunmilayo Thontteh, and Hao Chen. "Components of
information for multiple resolution comparison between maps that share a real variable." Environmental and
Ecological Statistics, 2008. 15(2): p. 32.
25. Roth, P.L., et al., (1996). Meta-analyzing the relationship between grades and job performance. Journal of Applied
Psychology, 1996. 81: p. 8.
26. Kuncel, N.R., S.A. Hezlett, and D.S. Ones, Academic performance, career potential, creativity, and job
performance: Can one construct predict them all? . Journal of Personality and Social Psychology, 2004. 86(1): p.
13.
27. Kuncel, N.R., et al., A meta-analysis of the Pharmacy College Admission Test (PCAT) and grade predictors of
pharmacy student success. American Journal of Pharmaceutical Education, 2005. 69(3): p. 8.
28. Patil, T. and S.S. Sherekar, Performance Analysis of Naïve Bayes and J4.8 Classification Algorithm for data
classification. International Journal of Computer Science and Applications, 2013. 6(2): p. 5.
29. Anuradha, C. and T. Velmurugan, A Comparative Analysis on the Evaluation of Classification Algorithms in the
Prediction of Students Performance. Indian Journal of Science and Technology, 2015. 8(15): p. 12.
30. Waikato, T.U.o. WEKA documents.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA (Sadiq Hussain)
459
31. Svetnik, V., et al., Random Forest: A Classification and Regression Tool for Compound Classification and QSAR
Modeling. Journal of chemical information and computer sciences, 2003. 43(6): p. 12.
32. Agrawal, S., S.K. Vishwakarma, and A.K. Sharma, Using Data Mining Classifier for Predicting Student’s
Performance in UG level. International Journal of Computer Applications, 2017. 172(8): p. 6.
33. Tan, P.-N., M. Steinbach, and V. Kumar, CHAPER 5: Rule-based Classifiers, in Introduction to Data Mining, A.
Nordman, Editor. 2005, PEARSON EDUCATION: US.
... Another area that is addressed by ML is the identification of at-risk students [58,41,59]. The early warning systems for at-risk students [60,61] proposed aimed to identify the at-risk students at the earliest so that intervention programs can be planned by the teachers. ...
... This data is obtained from a LMS called Kalboard 360. The features used for classification belong to three categoriesdemographics features, academic background features, and behavioral features [59]. Uses the data collected from three different colleges in India. ...
... Prediction of at-risk students [31,32,34,65,38,40,41,42,52,67,63,66,62,60,90,91,61,58,59,64] 20 Features affecting the academic performance [76,36,39,43,44,79,70,77,51,81,80,69,83,74,82,78,84,92,68,85,75,71,93,72] 24 Suitable Course selection [87,86,55,88,89,48] 6 Prediction of marks / grades [29,30,37,45,46,47,94,49,50,53,95,96,62,90,97,98,58,99,100,16,72,9 8,18,15,75,101,102,64] 28 ...
Article
Review paper analyzing the role of Machine learning in academic performance of college hearing students and deaf students.
... In the realm of educational institutions, machine learning algorithms have been instrumental in categorizing students. For example, Ref. [21] examined several machine learning algorithms, including J48, Random Forest, PART, and Bayes Network, for classification purposes. The primary objective of this research was to boost students' academic performances and reduce course dropout rates. ...
... The primary objective of this research was to boost students' academic performances and reduce course dropout rates. The findings from [21] indicate that the Random Forest algorithm outperformed the others in achieving these goals. ...
Article
Full-text available
Educational institutions are increasingly focused on supporting students who may be facing academic challenges, aiming to enhance their educational outcomes through targeted interventions. Within this framework, leveraging advanced deep learning techniques to develop recommendation systems becomes essential. These systems are designed to identify students at risk of underper-forming by analyzing patterns in their historical academic data, thereby facilitating personalized support strategies. This research introduces an innovative deep learning model tailored for pinpointing students in need of academic assistance. Utilizing a Gated Recurrent Neural Network (GRU) architecture, the model is rich with features such as a dense layer, max-pooling layer, and the ADAM optimization method used to optimize performance. The effectiveness of this model was tested using a comprehensive dataset containing 15,165 records of student assessments collected across several academic institutions. A comparative analysis with existing educational recommendation models, like Recurrent Neural Network (RNN), AdaBoost, and Artificial Immune Recognition System v2, highlights the superior accuracy of the proposed GRU model, which achieved an impressive overall accuracy of 99.70%. This breakthrough underscores the model's potential in aiding educational institutions to proactively support students, thereby mitigating the risks of underachievement and dropout.
... In the realm of educational institutions, machine learning algorithms have been instrumental in categorizing students. For example, [21] examines several machine learning algorithms, including J48, Random Forest, PART, and Bayes Network, for classification purposes. The primary objective of this research is to boost students' academic performance and reduce course dropout rates. ...
... The primary objective of this research is to boost students' academic performance and reduce course dropout rates. The findings from [21] indicate that the Random Forest algorithm outperforms others in achieving these goals. [22] employed data log files from Moodle to create a model capable of predicting final course grades in an educational institution. ...
Preprint
Full-text available
Educational institutions must identify students who are academically struggling to provide them with necessary support to improve their performance. In this context, recommendation systems powered by deep learning techniques are vital for detecting and categorizing such students. These systems help students plan their future by uncovering patterns in their historical academic data. This study introduces a new deep learning model designed to classify academically underperforming students in educational settings. The model incorporates a Gated Recurrent Neural Network (GRU) and includes specific neural network features like a dense layer, max-pooling layer, and the ADAM optimization algorithm. The model's training and evaluation were conducted using a dataset comprising 15,165 student assessment records from various academic institutions. The performance of the developed GRU model was benchmarked against other educational recommendation systems, including Recurrent Neural Network models, AdaBoost, and the Artificial Immune Recognition System v2. The proposed GRU model demonstrated remarkable accuracy, achieving an overall rate of 99.70%.
... Understanding learner performance in this context is key to maximizing the potential of online education [6]. The performance of learners in online platforms is a complex phenomenon that requires ongoing research, assessment, and pedagogical strategies [7]. With the right support, resources, and pedagogical strategies, learners can excel in online platforms and achieve their educational objectives. ...
Article
Full-text available
E-learning has emerged as a prominent educational method, providing accessible and flexible learning opportunities to students worldwide. This study aims to comprehensively understand and categorize learner performance on e-learning platforms, facilitating timely support and interventions for improved academic outcomes. The proposed model utilizes various classifiers (random forest (RF), neural network (NN), decision tree (DT), support vector machine (SVM), and K-nearest neighbors (KNN)) to predict learner performance and classify students into three groups: fail, pass, and withdrawn. Commencing with an analysis of two distinct learning periods based on days elapsed (≤120 days and another exceeding 220 days), the study evaluates the classifiers' efficacy in predicting learner performance. NN (82% to 96%) and DT (81%-99.5%) consistently demonstrate robust performance across all metrics. The classifiers exhibit significant performance improvement with increased data size, suggesting the benefits of sustained engagement in the learning platform. The results highlight the importance of selecting suitable algorithms, such as DT, to accurately assess learner performance. This enables educational platforms to proactively identify at-risk students and offer personalized support. Additionally, the study highlights the significance of prolonged platform usage in enhancing learner outcomes. These insights contribute to advancing our understanding of e-learning effectiveness and inform strategies for personalized educational interventions.
... Data mining, as a powerful tool for information extraction, aims to discover potential patterns, associations, and knowledge from large-scale datasets, providing strong support for decision-making and problem-solving [1][2][3]. Its application covers various fields such as business, healthcare, and machine learning [4][5][6][7], specifically used in technologies of non-negative matrix factorization [8,9], multi-view clustering [10,11]. ...
Article
Full-text available
In recent years, the informatization of the educational system has caused a substantial increase in educational data. Educational data mining can assist in identifying the factors influencing students’ performance. However, two challenges have arisen in the field of educational data mining: (1) How to handle the abundance of unlabeled data? (2) How to identify the most crucial characteristics that impact student performance? In this paper, a semi-supervised feature selection framework is proposed to analyze the factors influencing student performance. The proposed method is semi-supervised, enabling the processing of a considerable amount of unlabeled data with only a few labeled instances. Additionally, by solving a feature selection matrix, the weights of each feature can be determined, to rank their importance. Furthermore, various commonly used classifiers are employed to assess the performance of the proposed feature selection method. Extensive experiments demonstrate the superiority of the proposed semi-supervised feature selection approach. The experiments indicate that behavioral characteristics are significant for student performance, and the proposed method outperforms the state-of-the-art feature selection methods by approximately 3.9% when extracting the most important feature.
Article
Full-text available
Abstract In recent days, a wide variety of tools have appeared for performing educational data mining (EDM) . The current education systems show that there are several factors affecting students’ performances. First and foremost, students need motivation in order to learn and this motivation results into their success. The prediction of student performances is an important field of research in Educational Data Mining, particularly through the application of different data mining techniques. The majority of EDM research focuses on prediction algorithms. The current work presents a review of the data mining predicting algorithms and tools that have been adopted in EDM. It also provides insight into the algorithms and powerful data mining tools that most widely used in student performance prediction. This will mainly be of use for educators, instructors and institutions, increasing the students’ levels of study.
Article
Full-text available
Data mining is the analysis of a large dataset to discover patterns and use those patterns to predict the likelihood of the future events. Data mining is becoming a very important field in educational sectors and it holds great potential for the schools and universities. There are many data mining classification techniques with different levels of accuracy. The objective of this paper is to analyze and evaluate the university students' performance by applying different data mining classification techniques by using WEKA tool. The highest accuracy of classifier algorithms depends on the size and nature of the data. Five classifiers are used NaiveBayes, Bayesian Network, ID3, J48 and Neural Network Different performance measures are used to compare the results between these classifiers. The results shows that Bayesian Network classifier has the highest accuracy among the other classifiers.
Article
Full-text available
Student's performance prediction is essential to be conducted for a university to prevent student fail. Number of student drop out is one of parameter that can be used to measure student performance and one important point that must be evaluated in Indonesia university accreditation. Data Mining has been widely used to predict student's performance, and data mining that applied in this field usually called as Educational Data Mining. This study conducted Feature Selection to select high influence attributes with student performance in Department of Industrial Engineering Universitas Islam Indonesia. Then, two popular classification algorithm, Bayesian Network and Decision Tree, were implemented and compared to know the best prediction result. The outcome showed that student's attendance and GPA in the first semester were in the top rank from all Feature Selection methods, and Bayesian Network is outperforming Decision Tree since it has higher accuracy rate.
Conference Paper
Full-text available
Students academic performance is the reflection of both academic background and family support. This performance record is critical for the educational institution because they can learn from this to improve their quality. Educational data mining helps to analyze these data and extract information from it. We can determine the status of learners academic performance. For achieving this we can use techniques like decision tree, neural network, classification, data clustering, support vector machine and so on. In this paper, we will predict student's yearly performance in the form of Cumulative Grade Point Average (CGPA) using neural network and compare that with real CGPA. In this regard, a real dataset would be of great importance. We used real dataset from Bangabandhu Sheikh Mujibur Rahman Science and Technology University (BSMRSTU) to perform the prediction.
Article
Full-text available
It is important to study and analyse educational data especially students’ performance. Educational Data Mining (EDM) is the field of study concerned with mining educational data to find out interesting patterns and knowledge in educational organizations. This study is equally concerned with this subject, specifically, the students’ performance. This study explores multiple factors theoretically assumed to affect students’ performance in higher education, and finds a qualitative model which best classifies and predicts the students’ performance based on related personal and social factors.
Article
The paper represents the data mining techniques used for analysing pupil performance. Educational institutions contain an enormous amount of academic database containing student details. These student databases along with other attributes are taken into consideration like family background, family income, etc. It will help us by identifying promising students and by providing us a chance to pay heed and to refine those students who likely get low marks. For answer, we prepare a structure which will analyse the pupil's performance from their last performances using concepts of Data Mining under Classification. Classification Algorithms like Decision Tree, Naïve Bayes and Support Vector Machine can help us for predicting student's performance. This prediction helps parents and teachers to keep track of student's performance and provide required counselling. These Analysis also help in providing scholarship and other required training to the student. We are actually trying to enhance student's acquirement and success more effectively in a way using educational data mining techniques. It can bring the benefits & influence of novice, teachers and educational institutions. Experimental answers show that suggested procedure significantly outperforms prevailing procedure due to the misuse of family incomes and students' personal data component sets. Results of this examination can act as policy improvement technique in higher education.
Article
Data Mining provides powerful techniques for various fields including education. The research in the educational field is rapidly increasing due to the massive amount of students' data which can be used to discover valuable pattern pertaining students' learning behaviour. This paper proposes a framework for predicting students' academic performance of first year bachelor students in Computer Science course. The data were collected from 8 year period intakes from July 2006/2007 until July 2013/2014 that contains the students' demographics, previous academic records, and family background information. Decision Tree, Naïve Bayes, and Rule Based classification techniques are applied to the students' data in order to produce the best students' academic performance prediction model. The experiment result shows the Rule Based is a best model among the other techniques by receiving the highest accuracy value of 71.3%. The extracted knowledge from prediction model will be used to identify and profile the student to determine the students' level of success in the first semester.