Content uploaded by Jorge Humberto Guanín Fajardo
Author content
All content in this area was uploaded by Jorge Humberto Guanín Fajardo on Apr 26, 2024
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Data 2024, 9, x. hps://doi.org/10.3390/xxxxx www.mdpi.com/journal/data
Article
Predicting Academic Success of College Students Using
Machine Learning Techniques
Jorge Humberto Guanin-Fajardo 1, Javier Guaña-Moya 2,*, and Jorge Casillas 3
1 Facultad de Ciencias de la Ingeniería, Universidad Técnica Estatal de Quevedo, Quevedo 120508, Ecuador;
jorgeguanin@uteq.edu.ec
2 Facultad de Ingeniería, Ponticia Universidad Católica del Ecuador, Quito 170525, Ecuador
* Correspondence: eguana953@puce.edu.ec; Tel.: +593-995000484
3 Department of Computer Science and Articial Intelligence, University of Granada, 18071 Granada, Spain;
casillas@decsai.ugr.es
Abstract: College context and academic performance are important determinants of academic suc-
cess; using students’ prior experience with machine learning techniques to predict academic success
before the end of the rst year reinforces college self-ecacy. Dropout prediction is related to stu-
dent retention and has been studied extensively in recent work; however, there is lile literature on
predicting academic success using educational machine learning. For this reason, CRISP-DM meth-
odology was applied to extract relevant knowledge and features from the data. The dataset exam-
ined consists of 6690 records and 21 variables with academic and socioeconomic information. Pre-
processing techniques and classication algorithms were analyzed. The area under the curve was
used to measure the eectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly
classied eight out of ten cases, while the decision tree improved interpretation with ten rules in
seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college
consolidates college self-ecacy, creating intervention and support strategies to retain students is a
priority for decision makers. Assessing the fairness and discrimination of the algorithms was the
main limitation of this work. In the future, we intend to apply the extracted knowledge and learn
about its inuence of on university management.
Keywords: educational data mining; machine learning; educational analysis; higher education;
academic success
1. Introduction
Higher education has developed a fundamental role due to the versatility and com-
plexity of today’s world, which has led to the rapid growth of scientic literature dedi-
cated to predicting academic success or the risk of student dropout [1–7]. Higher educa-
tion institutions and their traditional role of knowledge dissemination have changed; in-
novation in new knowledge especially with the irruption of articial intelligence [8] and
the training of qualied professionals make many of them interact in dierent areas of
society. In fact, their missions through teaching, research, and the ability to share and
transfer this knowledge constitute central functions of their academic and cultural activ-
ity, with the aim of improving the level of knowledge in society. They have the important
role of transmiing knowledge, skills, and values to students to create competitive pro-
fessionals in society. Therefore, channeling students towards academic success is tran-
scendental, as HEIs must continue the work undertaken and further deepen their involve-
ment, signicance, and service capacity in relation to the social, cultural, and economic
framework [9]. Thus, the prediction of academic success with past information of students
who have successfully completed their university studies has become a tool of interest for
educational managers since it allows them to strengthen decisions and build
Citation: Guanin-Fajardo, J.H.;
Guaña-Moya, J.; Casillas, J.
Predicting Academic Success of
College Students Using Machine
Learning Techniques. Data 2024, 9, x.
hps://doi.org/10.3390/xxxxx
Academic Editor: Antonio Sarasa
Cabezuelo
Received: 31 January 2024
Revised: 13 April 2024
Accepted: 15 April 2024
Published: 19 April 2024
Copyright: © 2024 by the authors.
Submied for possible open access
publication under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (hps://cre-
ativecommons.org/licenses/by/4.0/).
Data 2024, 9, x FOR PEER REVIEW 2 of 28
improvement alternatives or educational policies. ICT is one of the most widely used al-
ternatives today, especially machine learning.
Hence, advances in machine learning techniques, along with other areas of study, are
precursors to educational data mining. In higher education, the academic success of stu-
dents is statistically measured by the graduation rate, which is dened as the total number
of students graduating among the total number of entering students. In fact, ref. [10] states
that it is possible to think about student success more broadly by studying endogenous
and exogenous factors in the student environment. Thus, the constant need to be eective
in the academic success of students has led to the customization of machine learning, this
to achieve specic predictive models that provide useful information.
In the last decade, many studies have focused on investigative works that address
the problems of performance, dropout, and academic success in university students. As
detailed in [11–14], the authors emphasize that university dropout or failure converges
with students from disadvantaged social strata who project university dropout behavior.
To sustain university permanence among their ndings, the authors are inclined to con-
sider that extra-university activities that guarantee retention should be strengthened.
Therefore, early detection has become a tool for solving these problems. Academic history,
university context (tangible and intangible resources), and other data were used as the
input elements to predict the results [4]. For this purpose, qualitative and quantitative
research methods have been used to solve these problems. More recently, multiple studies
have been derived that employed data mining or machine learning techniques that,
among other things, use algorithms and two well-known techniques to extract useful
knowledge from data. The rst technique, supervised classication, evaluates the data
and predicts the target variable (class). The work of [6,15–17] has shown results related to
supervised classication.
Similarly, in [18,19], using another approach based on supervised classication, they
used a set of pre-selected algorithms that classify the data by applying the voting tech-
nique. Both approaches aempt to predict students’ academic success or performance ef-
fectively. The second technique, unsupervised classication, is one in which the target
variable is unknown and that focuses on nding hidden paerns among the data. In gen-
eral, association rules are used to discover facts occurring within the data and are com-
posed of two parts: antecedent and consequent; for example, the rule {A, B}⇒{C} means
that, when A and B occur, then C occurs. In [20–22], they look for the occurrence of data
by focusing on the association rules and evaluating the rules with metrics such as support,
condence, and lift, among others.
In the studies of [23–25], related to machine learning, the convergence of objectives
and techniques applied for the data preprocessing stage was observed, both in feature
reduction, data transformation, normalization, and instance selection, among others. At
the same time, data balancing techniques and “black box” classication algorithms were
analyzed. The synergy of the studies lies in the simplication of the predictive models
obtained given the high degree of complexity of the extracted knowledge, for which they
used decision trees, since this technique simplies the knowledge by means of the repre-
sentation of rules of type (X⇒Y). To some extent, the methods applied are part of the KDD
process proposed in [26]. However, data asymmetry is a typical problem in any area of
study. Duplicity, ambiguity, and missing and overlapping data are frequent, especially in
authentic problems. Indeed, in data mining classication techniques, problems are pre-
sented as an unequal distribution of examples among classes (target variable), where one
or more classes (minority class) are underrepresented compared to the others (majority
class) [27]. Commonly, the data balancing method dened by Chawla [28] is used in this
type of problem. However, it is intended to ll the existing gap of data balancing with
educational data by using dierent balancing methods for multiclass problems.
The approach of this study is like previous work described in [6,29–31], where similar
tasks were performed with predictions in binary and multiclass classes. However, the
main dierence with our approach focuses on the in-depth analysis of data balancing and
Data 2024, 9, x FOR PEER REVIEW 3 of 28
feature selection techniques to avoid biases in predictions. Using 53% fewer variables and
improving its accuracy by 10% over the preliminary results with the raw data, we not only
built classication models to identify the relevant factors of college students’ academic
success, but also obtained a general model from the decision tree to obtain a higher read-
ability of the predictive model. In this way, it is intended to provide additional guidance
to academic decision makers in decision making. The open license software used for this
work was R [32] through a customized library to visualize, preprocess and classify the
data. The Python library scikit-learn [33] was used for data balancing.
The core of the work focuses on the study of machine learning techniques that predict
academic success. This has allowed us to establish the objective of the work, which is to
know in advance the factors that explain the academic success of students at the end of
their rst year of university. To do this, it has been necessary to pose the research ques-
tions since we intend to identify the factors that contribute to the academic success of stu-
dents during their rst year of college. This will allow us to examine the preprocessing
techniques, the predictive model, the determinants of academic success and, of course, the
visualization techniques to improve its interpretation before and after obtaining the pre-
dictive model. In this sense, the following research questions were posed:
• RQ1: Which balancing and feature selection technique is relevant for supervised clas-
sication algorithms?
• RQ2: Which predictive model best discriminates students’ academic success?
• RQ3: Which factors are determinants of students’ academic success?
Most studies on predicting academic success by machine learning have focused
solely on nding a predictive model, which is, to some extent, highly eective. In contrast,
the work presented, in line with RQ1, seeks the group of features that are most signicant
for the model and, on the other hand, also seeks a balanced training dataset, using dier-
ent data balancing techniques and avoiding biases in the prediction. RQ2, on the other
hand, aims to nd the eective predictive model using dierent supervised learning al-
gorithms. Finally, RQ3 examines which variables were relevant in the predictive model
achieved by the machine learning algorithms to then obtain another model with a beer
interpretation for the decision maker.
The presented work diers, among other things, by the following contributions: (i)
we unveil the eectiveness of educational data mining techniques, to identify academi-
cally successful students early enough to act and reduce the failure rate; (ii) the impact of
data preprocessing is analyzed; (iii) the important variables underlying the predictive
model of beer performance are unveiled. Thus, an approach to the presented work is
associated with the works of [23,29,34], where the authors have examined the characteris-
tics and impact of the best-performing algorithm. The rest of the paper is organized as
follows: in Section 2, a literature review is carried out; in Section 3, the methodology used
in this work is explained; in Section 4, the main results obtained by applying machine
learning are presented; in Section 5, the discussion is presented; in Section 6, the relevant
conclusions, in Section 7, limitations; and nally, in Section 8 future work are described.
2. Literature Review
In the cited literature, there are works related to the study of machine learning in
higher education and its impact on the prediction of academic performance or success. In
prediction, the purpose is to predict the target variable (class) of a dataset. The works cited
in Table 1 employ supervised classication algorithms that focus on obtaining the predic-
tive model.
Data 2024, 9, x FOR PEER REVIEW 4 of 28
Table 1. Summary of papers related to the prediction of academic performance or success of univer-
sity students.
Objective
Inst. 1
Feat. 2
Class
DPM 3
Accuracy
Citation
Scope
Performance
6948
55
2
Data preprocessing meth-
ods
82%
[35]
Higher Education
Performance
3830
27
2
Data transformation,
Discretization
83%
[36]
Higher Education
Prediction
1854
4
2
75%
[37]
Academic Success
Assessment
731
12
2
Extraction Feature, Imbal-
anced Dataset
78%
[6]
Higher Education
Achievement
339
15
3
Extraction Feature
69.3%
[23]
Higher Education
Performance
32,593
31
4
Extraction Feature, Imbal-
anced Dataset
72.73%
[38]
Higher Education
Prediction
9652
68
2
Extraction Feature, Imbal-
anced Dataset
75.43%
[24]
Higher Education
Prediction
3225
57
2
Extraction Feature, Imbal-
anced Dataset
79.5%
[28]
Higher Education
Prediction
300
18
2
Extraction Feature
63.33%
[34]
Higher Education
Prediction
1491
13
2
Extraction Feature, Imbal-
anced Dataset
75.78%
[5]
Higher Education
Prediction
7936
29
2
Extraction Feature
69.3%
[30]
Higher Education
Prediction
4413
2
[18]
Higher Education
Prediction
6690
21
3
Selection Feature,
Selection Instance, Data
imbalanced
81%
Our
pro-
posal
Higher Education
1 Number of instances. 2 Number of features. 3 Data preprocessing methods.
Among other works, the use of machine learning techniques to predict the success or
failure of university courses or degrees stands out. The use of the recommender system
proposed by [35] suggests to computer science students the subjects they can take, in ad-
dition to the prediction of success or failure based on the previous experience of other
university students. In the work, data preprocessing and example balancing techniques
were applied. Then, the preprocessed data were used as input for the classication algo-
rithms to learn and obtain the prediction model from the test data. The results achieved
provide guidelines for university administrators to enhance educational quality. In this
sense, the early provision of useful information to predict a given event in the student
body is valuable. Hence, the study of academic performance is a relevant contribution in
higher education. Helal [36], in his work, predicted the academic performance of the stu-
dent body; the data used in his work were divided into groups, and each subgroup of data
was evaluated with dierent classication algorithms to predict academic performance.
Their results suggest that external students and female students performed well in the
prediction.
The work of Bertolini [29] set out to examine dierent classication algorithms to
predict nal exam grades with reasonable accuracy, considering midterm grades. Simi-
larly, Alyahyan [23] proposed the use of decision trees to predict students’ academic per-
formance and generate an early warning when low performance is detected. Dierent de-
cision tree approaches as well as relevant feature extraction were employed to obtain a
simpler model for decision making by academic experts. In line with this, refs. [29,34] also
examined high-impact features in the data to t representative variables with respect to
college retention and dropout, to develop interventions to help improve student academic
success.
Data 2024, 9, x FOR PEER REVIEW 5 of 28
Similarly, in Beaulac [39], the prediction of the academic success of university stu-
dents has been studied by applying the random forest and decision tree algorithms, the
laer being very intuitive for decision making; the authors propose the use of these tech-
niques to know if at the end of the rst two semesters the student would achieve the uni-
versity degree. Their results have indicated that there is a strong relationship between
underperforming grades and the likelihood of succeeding in a degree program, although
this did not necessarily indicate a causal connection.
Several of the related articles reveal the variety of work linked to improving the ed-
ucational system. The approach of Guerrero-Higueras [7], which proposes the use of the
GIT version control system as an evaluation methodology to observe the frequency and
use of the tool to help predict the student’s academic success, stands out. The variables
studied describe the student’s ability with tasks related to the development of the com-
puter science subject. This methodology as introduced diers from the rest given the ad-
aptation of the GIT version control platform and the issues specic to the computer science
area.
The literature cited above emphasizes gradualism to achieve features that achieve
high accuracy in the algorithms and obtain a simple and readable model. The lack of sali-
ent features prevents obtaining an eective prediction model. This is because of the am-
biguity or irrelevance of the variables [40]. On the other hand, of signicant importance is
the reduction of outliers in the data due to duplicate observations or overlapping data [41–
43]. It is understood, of course, that all of this leads to the application of each stage sug-
gested in the CRISP-DM [26], methodology that allows obtaining a reliable model at the
end. The validity of the model obtained is checked by the performance metrics of the clas-
sication algorithms. Based on what has been presented in this section, it was observed in
the literature that the work focuses mainly on two fronts: identifying signicant aributes
to predict student performance, success, or failure in higher education, and nding the
best prediction method to improve the accuracy of the predictive model achieved.
3. Materials and Methods
3.1. Context
The Institution of Higher Education (IES) is geographically located in the Municipal-
ity of Quevedo, Province of Los Ríos, Ecuador. Its coordinates are set at: 1°00′46″ S
79°28′09″ W/−1.012778, −79.469167. According to the policies of the IES and its minimum
requirements, each university course is taught in face-to-face mode, and in addition, each
academic year of the university course must be passed. In this case, each academic year
consists of two academic cycles (semesters). Students must enroll in the university degree
program and obtain grades in each subject, with a minimum grade of seven on a scale of
zero to ten. As a result of the academic activities performed and their permanence in the
university degree, the academic status of the student body is determined (dependent var-
iable/class). Academic statuses are established in three categories. The rst is “Passed”,
when the student has completed and passed all academic courses. The second is
“Change”, when the student passes courses other than the initial degree. And nally, third
is “Dropout”, when the student leaves the university completely.
3.2. Data Collection
Data collection was performed using SQL server scripts. The data were extracted
from the university’s information system database server. The dataset used in this work
consisted of two parts: student body and faculty, which were subsequently merged. It
should be noted that the criterion for the merger was the classes taught in the rst year by
the faculty in the teaching process for the university degree. Thus, the rst part of the
information referring to the students dealt with academic and socioeconomic data, while
that relating to the teaching sta referred to degrees obtained, age, and academic experi-
ence, among others. Among the diversity of professors in charge of university teaching of
Data 2024, 9, x FOR PEER REVIEW 6 of 28
rst-year students, there were full, associate, and occasional professors, totaling 286 pro-
fessors selected for this study.
On the other hand, the number of regular students was 6690. Although the number
of professors and students does not coincide, it is necessary to clarify that a professor can
teach dierent subjects. The students selected were those who were enrolled and had com-
pleted the rst year of all university courses. In short, all of the above was framed within
a retrospective of six complete academic years of each university degree, that is, ten cal-
endar years. It should also be noted that any identifying reference to both faculty and
students was eliminated to obtain an anonymous dataset. Among other things, the infor-
mation extracted for this work had the endorsement and permission of the competent
authority of the higher education institution detailed in the Institutional Review Board
Statement section. The database with the raw data had 21 variables and 6690 records (see
Appendix A, Table A1 for a description of the variables used).
So far, one of the main dierences in algorithms between machine learning (ML) and
traditional statistical methods lies in their purpose, as the former is still focused on the
ability to capture complex relationships between features and make predictions as accu-
rate as possible, while the laer, especially linear regression (LR), logistic regression
(LOR), generalized mixed models and relevance-based prediction and others, aim at in-
ferring relationships between variables. However, the key dierence between traditional
statistical approaches and ML is that, in ML, a model learns from examples rather than
being programmed with rules. For a given task, examples are provided in the form of
inputs (called features or aributes) and outputs (called labels or classes) [44,45].
In this work, we used the Cross-Industry Standard Process for Data Mining (CRISP-
DM) methodology proposed by [26], which comprises seven phases: understanding the
problem, understanding the data, data preparation, modeling, evaluation and implemen-
tation; the data preparation or data preprocessing is a stage that gained importance and
became a key stage, since its function is related to data preparation. In other words, the
objective is to reduce the complexity of the original dataset to obtain a readable predictive
model with useful variables. Therefore, the work is based on the best practice for data
preprocessing suggested in [46–48]. For this reason, Appendixes B and C detail the results
of the various methods used for data preprocessing using feature ltering, instance selec-
tion, and class balancing. The main advantage of ecient data preprocessing was the
transfer of suitable data to classication algorithms for simple and accurate learning. First,
the compacted data were cleaned and transformed and then analyzed with visualization
techniques that allowed, among other things, the location of trajectories, overlaps and data
behavior. Second, the data were stratied into two subsets of data: training and test. Then,
the training set was ltered for relevant instances and features to balance the data using
dierent methods. The already balanced dataset was used as input data for the classica-
tion algorithms, together with the test data that were used to obtain the predictive model.
Finally, this model was evaluated with the metrics proposed in this work. Figure 1 shows
the activities that were performed.
Data 2024, 9, x FOR PEER REVIEW 7 of 28
Figure 1. Diagram of activities performed. The processes conducted are described in four stages.
3.3. Metric Assessment
The metrics referred to in this section are used to evaluate the performance of the set
of algorithms used to obtain predictive models. In Equation (4), the term α represents
P(Tp) = Sensitivity, and (1 − β) represents P(Tn) = Specicity [49].
TP+TN
Accuracy= TP+FN+FP+TN
(1)
TP
Sensitivity=Recall= TP+FN
(2)
TN
Specificity= TN+FP
(3)
( ) ( )
1
AUC= 1 . 1 .
2
i
− + −
(4)
TP
Precision= TP FP+
(5)
Pr( ) Pr( )
Cohen's Kappa= 1 Pr( )
ae
e
−
−
(6)
( )
,,
11
1log
NM
i j i j
ij
LogLoss y p
N==
=−
(7)
3.4. Data Exploratory
The importance of data exploration is that it serves to understand the activity and
behavior of the data. Visualization techniques have been used that detected signicant
information in the data; specically, variables were examined according to each category
of the class using graphs (Figure 2).
Data 2024, 9, x FOR PEER REVIEW 8 of 28
(a) Pass
(b) Dropout
(c) Change
Figure 2. Undirected graph calculated from the correlation matrix (Pearson’s method). Both the arcs
and the adjacency matrix were ltered with cut-o points obtained from the weighted mean of the
nodes (Pass = 0.0007804694, Dropout = 0.0061971, Change = 0.01684287). The graphs had weights
associated with each of the arcs, and this weight xed their density. Three groups of subgures were
separated according to the target variable (pass, dropout, change). Subgure (a) showed three sub-
groups of variables (8, 5, 5) where a common variable overlaps. Cluster (b) showed three subgroups
of variables (8, 3, 8); this subgure lacks overlap. Group (c) showed four subgroups of variables (6,
7, 4, 2) overlapped by three common variables. On the other hand, red lines indicate a lower degree
of association, while black lines and thickness indicate their strength of association.
3.5. Data Preprocessing
The importance of data preprocessing is to synthesize and achieve expeditious data.
This fact has an important consequence for classication algorithms since the integrity of
the data is gradually assessed by the hit rate, i.e., the number of true positives that the
prediction algorithm can detect. Within this context, the aim is to obtain the set of features
and instances that are close to a reasonable hit rate. The problem around which the data
preprocessing revolves is the dierent search strategies such as sequential, random, and
complete that are proposed for this task. The evaluation criterion is set with ltering (dis-
tance, information, dependency, and consistency), hybrid and wrapper methods [50–54].
Data 2024, 9, x FOR PEER REVIEW 9 of 28
The data preprocessing was divided into four phases. First, missing values in the data
were replaced using the k-nearest neighbor’s algorithm KNN_MV [55]. Second, unrepre-
sentative instances were excluded using the “NoiseFiltersR” algorithm. Third, feature se-
lection was studied with dierent algorithms and functions that have evaluated feature
quality. Finally, data balancing was applied to avoid bias in the prediction model due to
the small amount of minority class data.
3.6. Missing Values
Data in their original form contain inconsistent data and often have missing values.
That is, when the value of a variable is not stored, it is considered missing data. Multiple
techniques have been developed to replace missing values. In general, statistical tech-
niques of central tendency are usually used; for numerical values, the mean or median is
used, while for nominal values, the “mode” is usually used. Another common technique
is to remove the entire record from the dataset. Deletion can cause signicant loss of in-
formation. Frequent techniques are easy to use and solve the problem of missing values,
although, in data mining practice, there is a tendency to implement algorithms that solve
this problem by examining the entire dataset. Specically, in this work, we have used the
“rfImpute” function, which replaced missing values by the nearest neighbor technique
that takes the class (target variable) as reference.
3.7. Instance Selection
Instance selection was also key in the data preprocessing, since poor-quality exam-
ples were eliminated by using the NoiseFiltersR algorithm [41], which ltered out the 5%
of examples that were not within the data standard. In other words, when a value is at an
unusual distance from the rest of the values in the dataset, it is considered an outlier or
noise.
3.8. Feature Selection
There is an important distinction to be made in this section since the generality and
accuracy of the predictive model will depend on the quality of the variables. Therefore, it
is crucial to decide which variables are relevant to include in the study. For this, we used
nine feature selection algorithms among them: “LasVegas-LVF”, “Relief” [56], “selectK-
Best”, “hillClimbing”, “sequentialBackward”, “sequentialFloatingForward”, “deepFirst”,
“geneticAlgorithm”, and “antColony”. On the other hand, the algorithms used distinct
functions to value the aributes. Among the functions, we had “mutualInformation” [57],
“MDLC” [58], “determinationCoecient” [59], “GainRatio” [60], “Gini Index” [61], and
“roughsetConsistency” [62,63]. The group of algorithms used for the study of signicant
characteristics obtained subgroups of variables that have been evaluated and are shown
in Table 2 and Appendix C Table A3.
Table 2. Feature ltering by the “Relief” algorithm using dierent k and bestk lters. The lowest
feature selection and the highest accuracy achieved by the C4.5 classication algorithm were estab-
lished with the “bestk” ltering (10 variables).
Filter
Variable
Value
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
k = 9
11
−0.002
0.75
0.56
0.83
0.85
0.79
0.83
0.80
k = 7
11
−0.001
0.76
0.5
0.82
0.85
0.78
0.82
0.80
k = 5
11
−0.003
0.74
0.52
0.80
0.83
0.77
0.80
0.78
k = 3
14
−0.001
0.76
0.56
0.82
0.85
0.79
0.82
0.80
bestk
10
0.062
0.79
0.62
0.85
0.87
0.81
0.85
0.83
3.9. Data Balancing
Data 2024, 9, x FOR PEER REVIEW 10 of 28
Sample balancing is another important step in data preprocessing. Currently, there
are several techniques for data balancing or resampling using Python software 3.9 and its
scikit-learn library [33]. In this work, the following techniques have been studied: over-
sampling, combined, undersampling and ensemble. The rst used the methods “Smote”
[28] and “KMeansSMOTE” (oversampling with SMOTE, followed by undersampling with
edited nearest neighbors) [64]. The second used both “Smote-ENN” and “Smote-Tomek”
(oversampling with SMOTE) [65]. The third technique used was subsampling with the
“RUS” method [66]. Finally, the ensemble technique used “EasyEnsemble” [67] and “Bag-
ging”. Specically, new balanced training datasets were generated. All of this was from
the initial training set, in which the dierent techniques and methods were used to balance
the data (See Table 3).
Table 3. The table displays the distribution of data per class using dierent data balancing tech-
niques, along with the corresponding imbalance ratio (IR) between the majority and minority clas-
ses. A higher IR indicates a more severe class imbalance problem.
Classes
Algorithms Used
Dropout
Change
Pass
Overall
IR
Origin data (not use algorithm)
3.346
466
2.080
5.892
7.180
Over (SMOTE)
2.826
5.652
8.478
16.956
3
Over (KMeansSMOTE
5.655
8.481
2.829
16.965
2.997
Combined (SMOTE-ENN)
5.365
2.822
4.164
12.351
1.901
Combined (SMOTE-Tomek)
5.360
2.826
7.894
16.080
1.472
Under (RUS)
355
1.065
710
2.130
3
Under (Tomelinks)
2.439
4.229
3.874
10.542
1.733
Ensembles (EasyEnsemble)
2.826
5.017
4.662
12.505
1.775
Ensembles (Bagging)
2.826
5.017
4.662
12.505
1.775
3.10. Classication Algorithms
The use of supervised classication techniques aims to achieve a prediction model
that is highly accurate. Hence, several algorithms have been created that use dierent
mathematical models to achieve the model. In this section, we detail the types of algo-
rithms and a provide a brief description of how each works.
• Decision Trees: Consists of building a tree structure in which each branch represents
a question about an aribute. New branches are created according to the answers to
the question until reaching the leaves of the tree (where the structure ends). The leaf
nodes indicate the predicted class; see [35].
• Support Vector Machine (SVM): A relatively simple supervised machine learning al-
gorithm used in regression or classication related problems. In many cases, it is used
for classication, although it is preferably useful for regression. Basically, SVM cre-
ates a hyperplane with boundaries between data types in a two-dimensional space;
this hyperplane is nothing more than a line. In SVM, each datum in the dataset is
ploed in an N-dimensional space, where N is the number of features/aributes of
the data; see [68].
• Neural Network: Multilayer perceptrons (MLP) are the best known and most widely
used type of neural network. They consist of neuron-like units, multiple inputs, and
an output. Each of these units forms a weighted sum of its inputs, to which a constant
term is added. This sum is then passed through a nonlinearity, usually called an ac-
tivation function. Most of the time, the units are interconnected in such a way that
they form no loop; see [69].
• Random Forest: A combination of tree predictors, where each tree depends on the
values of a random vector sampled independently and with the same distribution for
all trees in the forest. The use of random feature selection to split each node produces
Data 2024, 9, x FOR PEER REVIEW 11 of 28
error rates that compare favorably with “Adaboost” but are more robust with respect
to noise. The internal estimates control for error, strength, and correlation, and are
used to show the response to increasing the number of features used in the split.
Internal estimates are also used to measure the importance of variables; see [70].
• Gradient Boosting Machine: Gradient boosting is a machine learning technique used
to solve regression or classication problems, which builds a predictive model in the
form of decision trees. It develops a general gradient descent “boosting” paradigm
for additive expansions based on any ing criteria. Gradient boosting of regression
trees produces competitive, very robust, and interpretable regression and classica-
tion procedures, especially suitable for the extraction of not-so-clean data; see [71].
• XGBoost: XGBoost is a distributed and optimized gradient boosting library designed
to be highly ecient, exible, and portable. It implements machine learning algo-
rithms under the Gradient Boosting framework. XGBoost provides parallel tree
boosting (also known as GBDT, GBM) that solves many data science problems in a
fast and accurate manner; see [72].
• Bagging: Predictor bagging is a method of generating multiple versions of a predictor
and using them to obtain an aggregate predictor. Bagging averages the versions
when predicting a numerical outcome and performs plural voting when predicting
a class. Multiple versions are formed by making bootstrap replicas of the learning set
and using them as new learning sets. Tests on real and simulated datasets show that
bagging can provide a substantial increase in accuracy; see [73].
• Naïve Bayes: A probabilistic machine learning model used for classication tasks.
The core of the classier is based on Bayes’ theorem:
( | ) ( )
( | ) ()
P B A P A
P A B PB
=
, which is the
probability of A occurring, given that B has occurred. Here, B is the evidence, and A
is the hypothesis. The assumption made here is that the predictors/features are inde-
pendent. That is, the presence of a particular feature does not aect the other; see
[74].
4. Results
In response to the research questions posed, dierent data preprocessing algorithms
have been employed to reduce the dimensionality of the dataset, so that the classication
algorithms obtain a simple and accurate predictive model. In the following sections, we
study data preprocessing for feature selection rst. Second, we study data balancing using
dierent data balancing algorithms and, nally, the results using the metrics calculated
from the confusion matrix where the performance of the algorithms was evaluated.
4.1. Data Preprocessing
4.1.1. Feature Selection
Prior to preprocessing, the dataset was separated into two parts: 75% of the total was
selected for training data, and the other 25% for testing. The laer were used to evaluate
the predictive model achieved by the classication algorithms, while the training set was
subjected to preprocessing techniques to reduce dimensionality and obtain adequate data.
In this sense, the work has focused on achieving simplicity and improving the accuracy
of the predictive model, for which dierent feature and lter selection methods have been
congured. Table 2 shows the results of the algorithm that obtained the lowest features;
the rest of the runs of other algorithms and their results can be found in Appendix C.
In view of the cited works, in the studies of [15,28], relevant features in the data were
examined to improve the predictive model, in line with these. Table 2 presents the results
for the pre-selected feature set, where each evaluative lter and method rated the variables
according to the performance metric. Specically, the Relief method together with the
“bestk” evaluative lter achieved beer eciency, i.e., higher accuracy with fewer
Data 2024, 9, x FOR PEER REVIEW 12 of 28
variables. Based on these results, a new dataset with the new characteristics was estab-
lished and used as input data for the data balancing phase described in the next section.
4.1.2. Data Balancing
The importance of data balancing is fundamental to classication algorithms since
the disparity of examples between one class and another can lead to bias in the prediction
model. There are two common techniques for data balancing. The rst is the oversampling
of examples technique, in which the data are balanced to the same number of examples in
the majority class. The second is to reduce the other classes to the same number of exam-
ples in the minority class. Both techniques, although not very ecient, are useful for ob-
taining primary results since the redistribution of the data is achieved with the judgment
and experience of the data analyst. To some extent, this personalized judgment is avoided
by the intervention of algorithms that perform data balancing. The algorithms augment,
reduce or equalize the examples depending on the technique applied. From the above,
Table 3 shows the data imbalance index according to the algorithms used. Thus, each al-
gorithm generated a new balanced dataset that was used to train the classication algo-
rithms.
4.2. Classication Algorithms
In this section, we examine the eectiveness of the set of classication algorithms
proposed for this work, which is related as a multiclass problem, that is, a dependent var-
iable (class) with three types of outputs: Dropout, Change and Passed. For this reason,
and as is common in supervised classication problems, two datasets have been used: the
rst, for the algorithms to learn and obtain a prediction model; and the second, to evaluate
the eectiveness of the model obtained. Hence, we worked with two types of analysis: the
rst with the original data (without data preprocessing) and the second with the dierent
datasets generated from the preprocessing techniques used.
It is dicult not to appreciate the importance of data preprocessing, as it provides
classication algorithms with balanced and clean datasets. Obtaining the predictive model
requires the algorithm to learn from the provided data (training set), as the eectiveness
of the model will depend on it. Therefore, for the algorithm to achieve adequate learning,
the cross-validation technique k-fold cross-validation (CV) was applied; this approach
randomly subdivided the training set into 10 folds with approximately equal size, and
each fold, in turn, was fragmented into two sections: training and test. This was done so
that at the end of training, the mean prediction was obtained from among the folds. On
the other hand, to check what was learned by the algorithms, the metrics proposed in the
section of methodology were used, which helped to discriminate the most eective pre-
dictive models. While it is true that eectiveness is fundamental to evaluate the predictive
model, the comprehensibility of the model obtained is also important, since the experts
evaluate the simplicity of the model.
Here, we present the best result of the classication algorithms that were achieved
using the dataset balanced by the “EasyEnsemble” algorithm and the performance assess-
ment of the classiers using the ROC curve presented in Figure 3. The rest of the results
with dierent datasets derived from the application of the data balancing algorithms are
presented in Appendix B, Table A2.
In view of the results, Table 4 (raw data) and Table 5 (preprocessed data) show dif-
ferences in the performance of the algorithms. Negative values −0.0214 and −0.0222 for
precision and AUC, respectively, are evident. This negative eect between raw data and
preprocessed data is a consequence of preprocessing, so data preprocessing should be
interpreted not as a contradictory process but as an improvement of the predictive model
by using fewer variables from the original set. Therefore, the advantage of applying data
preprocessing has been observed.
Data 2024, 9, x FOR PEER REVIEW 13 of 28
Figure 3. Performance of the group of algorithms by ploing the area under the AUC curve. On the
ordinate axis is the true positive rate, and on the abscissa axis the false positive rate. The classier
lines above the diagonal (dashed line) represent good classication results (beer than random),
while those below represent bad results (worse than random). The best performance in classifying
the test data examples was obtained by the XGBoost algorithm; two algorithms had an AUC above
0.87, the rest performed below 0.86. This performance clearly indicates the eectiveness of the pre-
dictive model against the test set.
Table 4. Preliminary results for the original dataset, omiing data preprocessing.
Algorithms
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
AUC
LogLoss
XGBoost
0.8133
0.6617
0.8492
0.8861
0.8456
0.8492
0.8462
0.8997
0.3736
RandomForest
0.8163
0.6664
0.8523
0.8873
0.8428
0.8523
0.8468
0.8978
NA
Gbm
0.8062
0.6473
0.8460
0.8800
0.8352
0.8460
0.8401
0.8930
0.3925
Bagging
0.8008
0.6379
0.8423
0.8769
0.8291
0.8423
0.8351
0.8781
NA
C4.5
0.7822
0.6039
0.8378
0.8642
0.8033
0.8378
0.8193
0.8308
NA
NaiveBayes
0.6549
0.3847
0.5215
0.8025
0.7622
0.5215
0.5059
0.8168
NA
SvmRadial
0.7284
0.4934
0.7781
0.8218
0.7673
0.7781
0.7709
0.7973
NA
SvmPoly
0.7165
0.4687
0.7571
0.8132
0.7685
0.7571
0.7616
0.7754
0.5484
MLP
0.6895
0.4501
0.7673
0.8143
0.7471
0.7673
0.7511
0.7621
0.5378
Data 2024, 9, x FOR PEER REVIEW 14 of 28
Table 5. Evaluation results of the predictive models obtained by the classication algorithms. The
training set was balanced with the “EasyEnsemble” technique. Model validation was performed on
the test dataset. The data were sorted according to the AUC column.
Algorithms
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
AUC
LogLoss
XGBoost
0.7949
0.6299
0.8425
0.8753
0.8214
0.8425
0.8306
0.8775
6.3430
RandomForest
0.7925
0.6269
0.8444
0.8747
0.8205
0.8444
0.8305
0.8744
NA
Gbm
0.7752
0.5923
0.8318
0.8605
0.8043
0.8318
0.8171
0.8606
5.6340
Bagging
0.7752
0.5933
0.8268
0.8617
0.8088
0.8268
0.8168
0.8591
NA
C4.5
0.7644
0.5803
0.8334
0.8594
0.7964
0.8334
0.8110
0.8249
NA
SvmPoly
0.6861
0.4094
0.7347
0.7919
0.7466
0.7347
0.7384
0.7679
4.1072
SvmRadial
0.6814
0.4073
0.7460
0.7920
0.7321
0.7460
0.7377
0.7676
NA
MLP
0.6539
0.4059
0.7620
0.8013
0.7462
0.7620
0.7360
0.7446
3.2832
NaiveBayes
0.6389
0.3850
0.6348
0.8022
0.7879
0.6348
0.6442
0.8018
6.3015
It should be noted that the logloss was lower with the original data than with the
preprocessed data. The increase with the laer was due to the smaller imbalance between
classes. That is, the smaller the imbalance between classes, the greater the logloss, due to
the smaller proportion of observations in the minority class. Table 3 shows the imbalance
index between the original set and the dataset preprocessed with “EasyEnsemble” (col-
umn IR: 7.18 and 1.775 respectively).
In Table 6, the confusion matrix of the best-scoring algorithm (XGBoost) aimed to
explain the predicted values of the test dataset, and the prediction model obtained by the
algorithm was established. First, the type II error or β type error was analyzed, where (a)
the “Dropout” class had predicted values of 868 cases, of which 741 were correct, and 127
cases were classied as “Pass”; (b) the “Change” class had 126 cases, of which 115 were
correct and 11 were classied as “Pass”; (c) the “Pass” class of the 679 predicted cases had
474 that were correct, four cases were classied as “Change”, and 201 were classied as
“Dropout”. Secondly, the type I error or type α error was analyzed, where (a) the class
“Dropout” had 942 cases, of which 741 were correct and 201 “Pass”; (b) the class “Change”
had 119 cases, of which 115 were correct and four were classied as “Pass”; (c) the class
“Pass” had 612 cases, of which 474 were correct, 11 were classied as “Change”, and 127
were classied as “Dropout”.
Table 6. Confusion matrix of the XGBoost algorithm. Here, the actual values (rows) are shown ver-
sus the values predicted by the classier (columns).
Prediction
Actual
Dropout
Change
Pass
Total
Error Type II (β)
Dropout
741
0
127
868
0.8536
Change
0
115
11
126
0.9126
Pass
201
4
474
679
0.6980
Total
942
119
612
1673
µ = 0.8214
Error Type I (α)
0.7866
0.9663
0.7745
µ = 0.8431
Overall, a more ecient predictive model was obtained with the XGBoost classica-
tion algorithm. In the work of [75], they highlight that the random forest algorithm ob-
tained a beer result in accuracy (ACC: 0.81) using only 10 features of the original dataset,
pointing out the importance of improving academic performance and increasing the grad-
uation rate of the students of the educational center. Consequently, it is necessary to con-
sider that the accuracy of the model increases, and its complexity needs to be explainable
as well. In this context, we looked for a way to apply a simple and readable method. The
decision tree provides a simple rule-based model that improves comprehensibility. The
use of the decision tree, although less ecient, is very easy to interpret. Figure 4 shows
Data 2024, 9, x FOR PEER REVIEW 15 of 28
the decision tree generated from the training data and Figure 5 shows the important var-
iables.
Figure 4. The decision tree drawn is based on the rules obtained. The nodes represent the class. The
three decimal values within the node represent the probability of each class with respect to the eval-
uation of the rule. In turn, the total percentage of cases for the rule (cover) is shown. Below the node,
the condition of the rule is displayed.
Figure 5. The importance of the variable is calculated by summing the decrease in error when di-
vided by a variable. Thus, the higher the value, the more the variable contributes to improve the
model, so the values are bounded between 0 and 1.
Data 2024, 9, x FOR PEER REVIEW 16 of 28
4.3. Static Comparison of Several Classiers
Formally, statistical signicance is dened as a probability measure to assess experi-
ments or studies. Ronald Fisher promoted the use of the null hypothesis [76], establishing
a signicance threshold of 0.05 (1/20) to determine the validity of the results obtained in
empirical tests. In this way, it is guaranteed that the provenance of their results is not due
to chance coincidences. In the work of Demšar [77], the statistical signicance of dierent
classication algorithms and real-world datasets was validated by dierent empirical
tests. In this context, the nonparametric Friedman and Wilcoxon tests were used, which
are suitable for this type of analysis because they both do not skimp on the normal distri-
bution of the data or on the homogeneity of variances, making them suitable for studies
with data of a real or unmanipulated nature.
Prior to the calculation of the nonparametric tests, the results matrix of the group of
algorithms and the datasets was organized, using the area under the curve (AUC, see Ap-
pendix D Table A6) as the metric. The signicance threshold was set at 0.05 for the Fried-
man and Wilcoxon tests to determine if there were signicant dierences between more
than two dependent groups. To perform the empirical tests, we used the null hypothesis
H0: there are no signicant dierences between the groups of algorithms, and the alter-
native hypothesis, Ha: there is at least one signicant dierence between the groups of
algorithms. The results of the Friedman test yielded a chi-square (χ2) of 52.305 with 8 de-
grees of freedom and a p-value of 1.47 × 10−8 (See Appendix D, Table A4). Since the p-value
was below the threshold, the null hypothesis was rejected, and the alternative hypothesis
was accepted, conrming the existence of signicant dierences. Next, a pairwise com-
parison of algorithms will be performed using the Wilcoxon test to assess the signicance
of these dierences.
The above analysis established that there were signicant dierences, so a test was
performed for each pair of algorithms using the Wilcoxon test, which is a Friedman post
hoc test and is presented in Table 7, where the p-values obtained are shown.
Table 7. Wilcoxon signed rank test.
XGBoost
RF
Gbm
Bagging
C4.5
NaiveBayes
SvmRadial
SvmPoly
RF
0.018
-
Gbm
0.018
0.063 *
-
Bagging
0.018
0.018
0.018
-
C4.5
0.018
0.018
0.018
0.018
-
NaiveBayes
0.018
0.018
0.018
0.018
0.612 *
-
SvmRadial
0.018
0.018
0.018
0.018
0.028
0.018
-
SvmPoly
0.018
0.018
0.018
0.018
0.028
0.018
0.398 *
-
MLP
0.018
0.018
0.018
0.018
0.043
0.018
0.091 *
0.128 *
* Reject the null hypothesis.
According to the results, signicant dierences were found in RF vs. Gbm (0.063);
C4.5 vs. NaiveBayes (0.612); SVMRadial vs. SVMPoly (0.398); SVMRadial vs. MLP (0.091);
and SVMPoly vs. MLP (0.128) (See Appendix D, Table A5 for detailed results). In [78–80],
opinions on statistics and signicance tests have been discussed, because they are often
misused, either by misinterpretation or by overemphasizing their results. It should be
stated that statistical tests provide some assurance of the validity and non-randomness of
the results [77].
5. Discussion
This paper explores and discusses three research questions related to machine learn-
ing techniques that are applied to achieve a predictive model with greater accuracy and
readability, in addition to the study of factors that lead to the academic success of
Data 2024, 9, x FOR PEER REVIEW 17 of 28
university students when they nish the rst course. The answers to the questions posed
are detailed.
RQ1: Which balancing and feature selection technique is relevant for supervised clas-
sication algorithms? In general, it is evident that with the increase in variables, the accu-
racy of the model increases, and so does its complexity, since the classication algorithms
improve performance, although the readability of the model decreases. Against this in the
work of Alwarthan [24], they apply recursive feature elimination (RFE) with Pearson cor-
relation coecient, RFE with mutual information and GA to nd relevant features, in ad-
dition to class balancing using SMOTE-TomekLink to build the nal prediction model.
The relevant variables were related to English courses and GPA, as well as students’ social
variables. Alwarthan [24] used 68 features and achieved 93% accuracy with the initial re-
sults, while feature ltering detected 44 relevant variables and 90% accuracy. On the other
hand, they analyzed eight relevant characteristics that achieved 77% accuracy; the varia-
bles were directly related to the academic performance of the student body.
In [6], the ltering of characteristics using the Gini index was proposed, from which
seven characteristics were selected, achieving 79% accuracy using the random-forest al-
gorithm. These results were very similar to ours, but far from being explainable, due to
the bias derived from the imbalance of the data. In the proposal made in this study, dif-
ferent data processing techniques were used to obtain an expeditious dataset. On the one
hand, the instance ltering method was considered to reduce duplicate or noisy observa-
tions by 5%. On the other hand, for feature group ltering, six methods were used, and
ve lters were applied, with which an accuracy between 58% and 78% was achieved. On
the other hand, when applying the “ReliefF” method, 10 features were obtained with an
accuracy of 79% (algorithm C4.5). In contrast, with the literature presented, the analyzed
datasets had accuracy values below 84% and 32 features on average. The dierence with
what is proposed in this work is greater than 5% in accuracy, initially aractive. However,
the handling of 22 additional features generates a robust and poorly explainable model
for decision support.
Consequently, data balancing as part of data preprocessing was crucial to achieve a
robust predictive model. The literature reviewed generically posits data balancing as a
step prior to feature ltering. The approach taken so far is to obtain a ltered dataset (in-
stances and features) and then apply data balancing. Among the best classication accu-
racies achieved by the data balancing methods, a range between 73% and 79% was ob-
tained. The “EasyEnsemble” method obtained the best accuracy, AUC and logloss. The
laer was far from the original data, as the imbalance rate was high. For example, the
imbalance rates (IR) of the original data (7.35 IR) for undergraduate academic statuses
(dropout, change and pass) were 57%, 7% and 36%, while for the balanced data (1.75 IR),
they were 23%, 40% and 37% with synthetic observations. The accuracy of the XGBoost
model with balanced data was approximately 80%. In summary, the proposed data pre-
processing made the dataset unbiased and the predictive model simple and explainable.
RQ2: Which predictive model best discriminates students’ academic success? Cur-
rently, there are several supervised algorithms used in higher education to predict dier-
ent educational contexts in higher education. Specically, the best discrimination was per-
formed by the XGBoost algorithm. This criterion was based rst on the values collected
with the predictive model, where the accuracy value was 79.49% and the AUC was 87.75%.
Sensitivity = 84.25%, which indicated the rate of positive examples that the algorithm was
able to classify, while specicity = 87.53% for negative examples. Next, the logloss metric
measuring computational cost had 0.3736 and an imbalance rate of 7.18 with the original
dataset. However, the logloss value went to 6.34 with the preprocessed dataset and an
imbalance rate of 1.775, i.e., lower computational cost and a higher data imbalance rate
were inversely proportional to the performance of the predictive model. Although the
predictive model obtained using XGBoost is poorly explainable due to its high complex-
ity, it performed beer by classifying examples from the test set. Explainability of the
Data 2024, 9, x FOR PEER REVIEW 18 of 28
predictive model was obtained when the decision tree was applied to the training set to
obtain a predictive model based on rules (If, Then) and readable for decision makers.
Similarly, [6,16–19,24,75] converge in their predictions on higher education data us-
ing classiers such as Random Forest (RF), SVM, Neural Networks and decision trees.
Likewise, linear regression or logistic regression was used to obtain predictive models that
detect failure, success, or academic performance early enough [1,81], or in turn, semi-su-
pervised learning to obtain paerns in students who managed to pass the courses for a
university degree [22]. Being the main objective to achieve very aractive and reliable ac-
curacies, undoubtedly, accuracy always comes hand in hand with the quantity and quality
of the data. For example, Gil [38] obtained accuracy rates with “random forest” of 77%,
91% and 94% with features of 30, 44 and 68, respectively, where the positive correlation
between number of features and accuracy was evidenced. That said, in our results, accu-
racies very close to 80% were achieved with only 10 features and a completely readable
model (10 rules).
RQ3: Which factors are determinants of students’ academic success? As part of the
development of this study, variables that play a signicant role in the academic success of
students were found. Specically, the variables ChangeDegree, RateApproval, Average,
and Degree were determinants for the prediction model obtained. These ndings are close
to the results obtained by Alturki [34], where individual results from the third and fourth
semester were examined, both with accuracies of 63.33% (six variables) and 92.6% (nine
variables), respectively. The inuential variables were grade point average, number of
credits taken and academic assessment performance, applying the selection of character-
istics for each academic semester. Similarly, Alyahyan [23] identied variables related to
GPA and key subjects that detect student performance early enough. As detailed by Beau-
lac [39] in their study, they identied variables associated with undergraduate degree
completion as a rst group of variables, whereas the second group of variables was related
to the type of major. In summary, the rst-year students opt for computer and English
related subjects to reach their academic achievement, i.e., characteristics related to aca-
demic performance.
Specically, data preprocessing provides as input an expedited dataset for classica-
tion algorithms to achieve an adequate predictive model. Although the results in the re-
viewed literature resemble ours, and these can be improved by inducing endogenous or
exogenous variables for the model to achieve more optimal results, the results can also be
improved by over-ing parameters in the algorithms. It is also worth mentioning that,
for example, Ismanto [82] obtained an RF prediction model with an accuracy higher than
90% without preprocessing the data, which resulted in a complex predictive model due
to its explainability. Therefore, even if the model obtains the highest accuracy, the predic-
tion bias can also be extended if the parameters are over-ed or the data preprocessing
phase is omied.
Kaushik [83] has dened feature selection as increasing the quality in the data to fa-
cilitate beer results, all according to the proposed method set of techniques for feature
selection in educational data. What is applied in this paper ts with Kaushik’s perspective.
It is important to anticipate early enough and with general quality characteristics to take
eective countermeasures, providing timely warnings to students to achieve academic
success. In this way, the percentage of underachieving students can be reduced, and ap-
propriate counseling and intervention can be provided to them by the college.
The results provide conclusive support for the anticipation of college completion [84–
86], which is essential to assist students in the learning process and ensure their academic
success. Thus, taking advantage of the fact that predictions made early enough by ma-
chine learning manage to reveal possible diculties or improvements from students’ his-
torical data, its eective use requires building specic strategies [84]. Consequently, the
application of the knowledge obtained from the data is leveraged, for example, in constant
monitoring or continuous tracking that acts as a tool to assess progress in academic per-
formance, class aendance, extracurricular activities and other key indicators [87]. Other
Data 2024, 9, x FOR PEER REVIEW 19 of 28
strategies include personalized tutorial support or intervention plans, remediation and
other resources for students who have demonstrated compelling needs [88,89]. Machine
learning, along with other data analysis techniques, oers valuable suggestions for tar-
geted interventions for the benet of students, with the goal of helping them achieve aca-
demic success in the shortest possible time. The results presented support the authenticity
of the analyses performed, as the information is not based on mere coincidences, but on
real data. In this context, signicant tests were performed using statistical methods such
as the nonparametric Friedman and Wilcoxon test, which are widely recognized for com-
paring the performance of machine learning algorithms [77,90,91]. Although these tests
are not recommended for a comprehensive study, due to the need to conform to other
assumptions, some authors have deepened their analysis and proposed alternatives to the
tests [92,93]. In summary, signicant tests are essential for a solid and objective interpre-
tation of the results obtained.
6. Conclusions
In response to the research questions, the eectiveness of the prediction model lies in
the good practice conducted in the data preprocessing phase. Hence, the importance of
obtaining an expeditious dataset is crucial. Unlike the methodologies reviewed in the lit-
erature, our applied methodology avoided bias in the accuracy rates of the predictive
model, as well as in the academic status (class). In fact, both the robust predictive model
achieved by means of XGBoost as well as the simplied decision tree model proved to be
eective. The simplied predictive model was able to detect students with high potential
for academic success in seven out of ten cases, while the robust model detected them in
eight out of ten cases. The simplication and explainability of the model were based on a
set of rules obtained from the decision tree used, to make them understandable and pro-
vide them to academic experts as suggestions for decision making. Overall, this study
provides valuable information on the factors underlying college students’ academic suc-
cess expectations and highlights the importance of eective data preprocessing and model
simplication techniques for making accurate, meaningful, and understandable predic-
tions about college students’ academic success.
7. Limitations
The main limitation of this work was the absence of variables that help to have con-
sistent measurements in the classication algorithms in terms of gender, scholarships, and
nancial aid, since it is important to analyze the evaluation of equity and discrimination
aspects in the decisions made by the algorithms to build the predictive model.
8. Future Work
Looking ahead, we intend to explore how the knowledge extracted in this work and
the university practices applied with this knowledge can inuence classroom manage-
ment, with the aim of improving students’ academic outcomes and reducing the disparity
in educational opportunities. To this end, we propose studies related to (i) examining how
the personalization of predictive models can be adapted to the phenotype (characteristics)
of the student body, where the objective is to examine the use of fuzzy logic to make un-
certainty exible and how the fuzzy model can manage the university context; (ii) design-
ing early warning systems to intervene early and prevent failure or dropout; and (iii) other
approaches, such as longitudinal studies, that aid evaluation and eectiveness over time
to adjust the models as needed.
Author Contributions: Individual contributions: J.H.G.-F. and J.C.: conceptualization, methodology,
software, validation, formal analysis, research, writing—writing of the original draft, writing—re-
vising and editing, visualization, supervision; J.G.-M.: resources, writing—revising and editing, vis-
ualization, support, project administration. All authors have read and agreed to the published ver-
sion of the manuscript.
Data 2024, 9, x FOR PEER REVIEW 20 of 28
Funding: This research received no external funding.
Institutional Review Board Statement: The work is supported as part of the UTEQ-FOCICYT IX-
2023/29 project “Factors that aect the completion of time to degree and aect the retention of UTEQ
students” approved by the twentieth resolution made by the Honorable University Council dated
14 February 2023. The study keeps the respective condentiality of the information stipulated in the
Organic Law for the Protection of Personal Data of the Republic of Ecuador, in addition to the ap-
plication of the respective “Code of Ethics for Ocials and Servants Designated or Contracted by
the Universidad Tecnica Estatal de Quevedo” approved by the Honorable University Council on 6
September 2011. Therefore, the research group has declared this research approved for publication
in any journal with the document CERT-ETHIC-001-2023.
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset is not available but can be obtained from the correspond-
ing author upon reasonable request.
Acknowledgments: We would like to express our deep appreciation to the authorities of the IES for
authorizing and allowing access, exploration, and analysis of the information, especially for the
support provided by the project “Factors that inuence the completion of the time to degree and
aect the retention of students at UTEQ”, headed by Javier Guaña-Moya, and Efraín Díaz Macías.
The work is supported as part of the UTEQ-FOCICYT IX-2023/29 project.
Conicts of Interest: The authors declare no conicts of interest.
Appendix A
This section presents the information used in the work. The dataset used consists of
data such as career, class aendance, students’ academic performance and socioeconomic
information. Numerical and categorical data are according to each variable.
Table A1. Description of the dataset used for the study.
Variable Names
Values
Description
Type
Faculty
1–5
Names of the faculties.
Categorical
Degree
1–27
Names of the university degrees.
Categorical
Sex
1. Male,
2. Female
Sex of students.
Categorical
Age Entrance
16–50
Age at entrance to university.
Numeric
Support
1. Public
2. Private
Type of nancial support from the high school
where the student completed high school.
Categorical
Localization
1. Local,
2. Outside of Quevedo,
3. Other Province
The geographical area of the school where the
student nished high school.
Categorical
AveragePre
0–10
Average of the grades of the university leveling
program (Pre-university/Admission/Selectivity).
Numeric
AendancePre
0–100
Pre-university aendance percentage.
Numeric
Average
0–10
Average of the subjects taught in the rst year.
Numeric
Aendance
0–100
Average of the student’s aendance percentage
in all subjects enrolled. Must meet the minimum
aendance percentage of 70%.
Numeric
TimeApproval
1–3
Number of enrollments used by the student to
pass the rst course.
Numeric
RateApproval
0–3
Weighting of the eort in the exams to pass the
subjects; the rst exam (recovery) has a value of
0.25, while the second one has a value of 0.75.
Numeric
CounterDegree
0–2
The number of college courses in which the stu-
dent was enrolled.
Numeric
Data 2024, 9, x FOR PEER REVIEW 21 of 28
StructureFamily
1. I am independent
2. Only with mom,
3. Only with dad,
4. Both parents,
5. Couple,
6. Other relative.
Variable associated with the student’s family
structure.
Categorical
Job
1. Does not work,
2. Full time,
3. Part-time,
4. Part-time by the hour,
5. Occasionally.
This variable is linked to the student’s work or
occupational situation.
Categorical
Financing
1. Family support, with 1
or 2 children studying.
2. Self-employed (own ac-
count).
3. Family support, with
more than three children
studying.
4. Loan, scholarship, or
current credit.
This variable is related to the student’s economic
disposition to pay for the academic year.
Categorical
Zone
1. Outside of Quevedo,
2. Urban,
3. Slum,
4. Rural.
Describes the geographic district where the stu-
dent lives.
Categorical
Income
1. More than $400,
2. Between $399 and $200,
3. Between $199 and $100,
4. Less than or equal to
$99.
Monthly cash income (approximate) of the fam-
ily nucleus.
Categorical
Housing
1. Own housing,
2. Rental,
3. Mortgaged,
4. Borrowed.
This variable is related to the usufruct of the
housing where the student and his family live.
Categorical
ChangeDegree
1. Yes,
2. No.
This variable describes whether the student has
changed degrees when repeating the rst year.
Categorical
Class
1. Dropout,
2. Change,
3. Pass.
Variable with the student’s academic status at
the end of the university degree.
Categorical
Appendix B
Table A2 presents various results from the calculation of the metrics applied to the
group of classication algorithms. The results presented in this appendix are complemen-
tary trainings, as six dierent balancing techniques were used to generate new datasets
that were contributed to train and achieve eective predictive models. Each technique ap-
plied balancing methods related to oversampling, undersampling and combined balanc-
ing based on the SMOTE algorithm. The “EasyEnsemble” data balancing was the best
performing of the algorithms and has been presented in the Results section as part of the
data input supply for the group of classication algorithms to obtain the predictive model.
Data 2024, 9, x FOR PEER REVIEW 22 of 28
Table A2. Performance results of the classication algorithms that trained and tested the predictive
models using new datasets constructed using the data balancing algorithms.
Bal.
Algorithms
Acc.
Kappa
Sensi.
Speci.
Preci.
Recall
F1
AUC
LogLoss
SMOTE
XGBoost
0.7878
0.6214
0.8472
0.8740
0.8237
0.8472
0.8318
0.8743
6.7890
RF
0.7812
0.6118
0.8418
0.8720
0.8143
0.8418
0.8229
0.8671
Gbm
0.7723
0.5984
0.8446
0.8679
0.8105
0.8446
0.8205
0.8575
5.0183
Bagging
0.7687
0.5887
0.8315
0.8633
0.8045
0.8315
0.8135
0.8546
C4.5
0.7639
0.5771
0.8248
0.8577
0.8008
0.8248
0.8101
0.7936
SvmPoly
0.6999
0.4740
0.7819
0.8242
0.7618
0.7819
0.7631
0.7640
5.8172
SvmRadial
0.6970
0.4681
0.7835
0.8215
0.7587
0.7835
0.7630
0.7649
MLP
0.6545
0.4190
0.7779
0.8087
0.7552
0.7779
0.7350
0.7512
5.0928
NaiveBayes
0.6198
0.3802
0.7478
0.7990
0.7817
0.7478
0.6957
0.8040
5.0538
KMeans.SMOTE
XGBoost
0.7956
0.6366
0.8565
0.8802
0.8262
0.8565
0.8367
0.8702
6.3079
RF
0.7794
0.6080
0.8420
0.8702
0.8125
0.8420
0.8226
0.8600
Gbm
0.7693
0.5929
0.8396
0.8660
0.8060
0.8396
0.8160
0.8515
5.1901
Bagging
0.7681
0.5860
0.8228
0.8620
0.8049
0.8228
0.8103
0.8467
C4.5
0.7663
0.5828
0.8259
0.8605
0.8036
0.8259
0.8113
0.7979
SvmPoly
0.6946
0.4613
0.7751
0.8187
0.7548
0.7751
0.7584
0.7616
5.8277
SvmRadial
0.6892
0.4499
0.7717
0.8139
0.7492
0.7717
0.7551
0.7591
MLP
0.6712
0.4229
0.7703
0.8045
0.7353
0.7703
0.7452
0.7424
4.9865
NaiveBayes
0.6067
0.3644
0.7505
0.7933
0.7804
0.7505
0.6862
0.7970
5.0905
SMOTE.Tomek
XGBoost
0.7914
0.6278
0.8474
0.8766
0.8241
0.8474
0.8320
0.8665
6.5445
Bagging
0.7747
0.5970
0.8269
0.8656
0.8090
0.8269
0.8148
0.8468
Gbm
0.7741
0.6029
0.8430
0.8705
0.8137
0.8430
0.8199
0.8577
4.9046
RF
0.7717
0.5922
0.8295
0.8639
0.8050
0.8295
0.8139
0.8562
C4.5
0.7579
0.5639
0.8088
0.8526
0.7947
0.8088
0.8001
0.7623
SvmPoly
0.6975
0.4722
0.7822
0.8242
0.7627
0.7822
0.7619
0.7634
5.7663
SvmRadial
0.6910
0.4579
0.7749
0.8182
0.7550
0.7749
0.7565
0.7633
MLP
0.6724
0.4459
0.7885
0.8182
0.7631
0.7885
0.7483
0.7592
4.8305
NaiveBayes
0.6372
0.4053
0.7622
0.8079
0.7805
0.7622
0.7104
0.7959
SMOTE.ENN
XGBoost
0.7478
0.5573
0.8216
0.8542
0.7933
0.8216
0.7988
0.8335
5.9690
Gbm
0.7406
0.5481
0.8239
0.8519
0.7923
0.8239
0.7965
0.8230
5.2251
RF
0.7394
0.5492
0.8285
0.8534
0.7962
0.8285
0.7972
0.8192
Bagging
0.7352
0.5387
0.8179
0.8485
0.7908
0.8179
0.7926
0.8165
C4.5
0.7310
0.5274
0.8109
0.8429
0.7828
0.8109
0.7888
0.7548
SvmRadial
0.6880
0.4590
0.7846
0.8198
0.7594
0.7846
0.7589
0.7511
SvmPoly
0.6880
0.4598
0.7809
0.8206
0.7608
0.7809
0.7568
0.7490
5.5081
MLP
0.6665
0.4398
0.7875
0.8169
0.7661
0.7875
0.7436
0.7650
4.7577
NaiveBayes
0.6186
0.3810
0.7615
0.7989
0.7428
0.7615
0.6858
0.7764
4.2434
RUS
Gbm
0.7346
0.5346
0.8188
0.8455
0.7861
0.8188
0.7938
0.8102
5.1165
XGBoost
0.7328
0.5344
0.8205
0.8466
0.7886
0.8205
0.7931
0.8173
4.4343
RF
0.7304
0.5303
0.8187
0.8450
0.7868
0.8187
0.7914
0.8153
Bagging
0.7197
0.5062
0.8034
0.8346
0.7731
0.8034
0.7813
0.7962
C4.5
0.6987
0.4954
0.8131
0.8387
0.7957
0.8131
0.7667
0.7764
SvmRadial
0.6629
0.4081
0.7634
0.7992
0.7249
0.7634
0.7368
0.7360
SvmPoly
0.6605
0.4040
0.7622
0.7976
0.7275
0.7622
0.7374
0.7348
4.3219
MLP
0.6402
0.3871
0.7610
0.7950
0.7320
0.7610
0.7251
0.7304
2.9982
NaiveBayes
0.6031
0.3575
0.7428
0.7907
0.7773
0.7428
0.6827
0.7841
Data 2024, 9, x FOR PEER REVIEW 23 of 28
Appendix C
This table presents the results of the ltering of characteristics using the dierent
methods proposed in the study. Each method, according to its nature, ltered the group
of variables that best represented the data. Then, the group of variables was evaluated
with the C4.5 classication algorithm.
Table A3. Selection of characteristics used for the evaluation of the best group of variables. The best
group of variables was selected by the RelefFbestK algorithm.
Filter
Var.
Method
Value
Acc.
Kappa
Sensi.
Speci.
Preci.
Recall
F1
Roughset con-
sistency
11
Las Vegas
1.00
0.68
0.39
0.53
0.79
0.65
0.53
0.56
9
SelectKBest
0.02
0.67
0.36
0.47
0.78
0.47
8
HillClimbing
1.00
0.62
0.21
0.41
0.73
0.41
9
Sequential Backward
1.00
0.67
0.36
0.47
0.78
0.47
9
Sequential Floating
Forward
1.00
0.67
0.36
0.47
0.78
0.47
10
Genetic Algorithm
1.00
0.67
0.34
0.47
0.78
0.47
20
AntColony
1.00
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Determination coef-
cient
13
Las Vegas
0.48
0.72
0.46
0.60
0.82
0.70
0.60
0.62
9
SelectKBest
0.06
0.67
0.36
0.47
0.78
0.47
20
HillClimbing
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Sequential Backward
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Sequential Floating
Forward
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Genetic Algorithm
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
AntColony
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Gini index
11
Las Vegas
1.00
0.68
0.39
0.53
0.79
0.65
0.53
0.56
9
SelectKBest
0.51
0.67
0.36
0.47
0.78
0.47
8
HillClimbing
1.00
0.62
0.21
0.41
0.73
0.41
9
sequentialBackward
1.00
0.67
0.36
0.47
0.78
0.47
9
sequential Floating For-
ward
1.00
0.67
0.36
0.47
0.78
0.47
11
Genetic Algorithm
1.00
0.62
0.21
0.41
0.73
0.41
20
AntColony
1.00
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Mutual information
12
Las Vegas
1.27
0.72
0.46
0.56
0.82
0.69
0.56
0.58
9
SelectKBest
0.16
0.67
0.36
0.47
0.78
0.47
6
HillClimbing
1.27
0.58
0.09
0.37
0.69
0.37
8
Sequential Backward
1.27
0.62
0.21
0.41
0.73
0.41
8
Sequential Floating
Forward
1.27
0.62
0.21
0.41
0.73
0.41
4
GeneticAlgorithm
1.27
0.67
0.34
0.47
0.78
0.47
20
AntColony
1.27
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Gain ratio
7
Las Vegas
0.10
0.59
0.15
0.39
0.71
0.39
9
SelectKBest
0.13
0.67
0.36
0.47
0.78
0.47
7
HillClimbing
0.10
0.59
0.15
0.39
0.71
0.39
11
SequentialBackward
0.10
0.68
0.39
0.53
0.79
0.65
0.53
0.56
11
Sequential Floating
Forward
0.10
0.68
0.39
0.53
0.79
0.65
0.53
0.56
1
GeneticAlgorithm
0.10
0.59
0.15
0.39
0.71
0.39
19
AntColony
0.10
0.72
0.48
0.60
0.82
0.71
0.60
0.62
Data 2024, 9, x FOR PEER REVIEW 24 of 28
Appendix D
This section presents the results of the nonparametric Friedman and Wilcoxon tests
performed. For this purpose, the value of the AUC metric was used. The calculation was
performed using the R statistical program. Table A4 presents the values obtained from the
calculation of the Friedman test. Table A5 presents the matrix of the Wilcoxon test results,
both the Z-value on the left and the p-value on the right. Table A6 is the matrix used for
the calculation of the tests.
Table A4. Average Rankings of the algorithms.
Algorithm
Ranking
XGBoost
0.9999999999999998
RandomForest
2.2857142857142856
Gbm
2.714285714285714
Bagging
3.999999999999999
C4.5
5.999999999999999
NaiveBayes
5.428571428571429
SvmRadial
7.428571428571429
SvmPoly
7.571428571428571
MLP
8.571428571428571
Friedman statistic considering reduction performance (distributed according to chi-square with 8
degrees of freedom: 52.3047619047619 p-value computed by Friedman test: 1.474479383034577 × 10−8.
Table A5. Z Score and signicance on Wilcoxon test (Z/p-value, within table).
Algorithms
XGBoost
RF c
Gbm
Bagging
C4.5
NaiveBayes
SvmRadial
SvmPoly
RF
−2.366 a/0.018
-
Gbm
−2.366 a/0.018
−1.859 a/0.063 *
-
Bagging
−2.371 a/0.018
−2.366 a/0.018
−2.366 a/0.018
-
C4.5
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
-
NaiveBayes
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−0.507 b/0.612 *
-
SvmRadial
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.197 a/0.028
−2.366 a/0.018
-
SvmPoly
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.197 a/0.028
−2.366 a/0.018
−0.845 a/0.398 *
-
MLP
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.028 a/0.043
−2.366 a/0.018
−1.690 a/0.091 *
−1.521 a/0.128 *
a Based on positive rankings. b Based on negative rankings. c Random Forest. * Reject the null hy-
pothesis.
Table A6. AUC value metrics with dierent classiers and dataset.
DataSet
Algorithms
XGBoost
RF
Gbm
Bagging
C45
NaiveBayes
SvmRadial
SvmPoly
MLP
RawData
0.8997
0.8978
0.8930
0.8781
0.8308
0.8168
0.7973
0.7754
0.7621
EasyEnsemble
0.8775
0.8744
0.8606
0.8591
0.8249
0.8018
0.7676
0.7679
0.7446
SMOTE
0.8743
0.8671
0.8575
0.8546
0.7936
0.8040
0.7649
0.7640
0.7512
KmeansSMOTE
0.8702
0.8600
0.8515
0.8467
0.7979
0.7970
0.7591
0.7616
0.7424
SMOTETomek
0.8665
0.8562
0.8577
0.8468
0.7623
0.7959
0.7633
0.7634
0.7592
SMOTEENN
0.8335
0.8192
0.8230
0.8165
0.7548
0.7764
0.7511
0.7490
0.7650
RUS
0.8173
0.8153
0.8102
0.7962
0.7764
0.7841
0.7360
0.7348
0.7304
References
1. Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predicting Student Dropout and Academic Success. Data 2022, 7, 146.
https://doi.org/10.3390/data7110146.
2. Ortiz-Lozano, J.M.; Rua-Vieites, A.; Bilbao-Calabuig, P.; Casadesús-Fa, M. University student retention: Best time and data to
identify undergraduate students at risk of dropout. Innov. Educ. Teach. Int. 2018, 57, 74–85.
https://doi.org/10.1080/14703297.2018.1502090.
Data 2024, 9, x FOR PEER REVIEW 25 of 28
3. Urbina-Nájera, A.B.; Téllez-Velázquez, A.; Barbosa, R.C. Paerns to Identify Dropout University Students with Educational
Data Mining. Rev. Electron. De Investig. Educ. 2021, 23,1-15. hps://doi.org/10.24320/REDIE.2021.23.E29.3918.
4. Lopes Filho JA, B.; Silveira, I.F. Early detection of students at dropout risk using administrative data and machine learning.
RISTI—Rev. Iber. De Sist. E Tecnol. De Inf. 2021, 40, 480–495.
5. Guanin-Fajardo, J.H.; Barranquero, J.C. Contexto universitario, profesores y estudiantes: Vínculos y éxito académico. Rev.
Iberoam. De Educ. 2022, 88, 127–146. https://doi.org/10.35362/rie8814733.
6. Zeineddine, H.; Braendle, U.; Farah, A. Enhancing prediction of student success: Automated machine learning approach. Com-
put. Electr. Eng. 2020, 89, 106903. https://doi.org/10.1016/j.compeleceng.2020.106903.
7. Guerrero-Higueras, M.; Llamas, C.F.; González, L.S.; Fernández, A.G.; Costales, G.E.; González, M..C. Academic Success As-
sessment through Version Control Systems. Appl. Sci. 2020, 10, 1492. https://doi.org/10.3390/app10041492.
8. Rafik, M. Artificial Intelligence and the Changing Roles in the Field of Higher Education and Scientific Research. In Artificial
Intelligence in Higher Education and Scientific Research. Bridging Human and Machine: Future Education with Intelligence; Springer:
Singapore, 2023; pp. 35–46. https://doi.org/10.1007/978-981-19-8641-3_3.
9. BOE. BOE-A-2023-7500 Ley Orgánica 2/2023, de 22 de marzo, del Sistema Universitario. 2023. Available online:
https://www.boe.es/buscar/act.php?id=BOE-A-2023-7500 (accessed on 23 March 2024).
10. Guney, Y. Exogenous and endogenous factors influencing students’ performance in undergraduate accounting modules. Ac-
count. Educ. 2009, 18, 51–73. https://doi.org/10.1080/09639280701740142.
11. Tamada, M.M.; Giusti, R.; Netto, J.F.d.M. Predicting Students at Risk of Dropout in Technical Course Using LMS Logs. Electron-
ics 2022, 11, 468. https://doi.org/10.3390/electronics11030468.
12. Contini, D.; Cugnata, F.; Scagni, A. Social selection in higher education. Enrolment, dropout and timely degree attainment in
Italy. High. Educ. 2017, 75, 785–808. https://doi.org/10.1007/s10734-017-0170-9.
13. Costa, E.B.; Fonseca, B.; Santana, M.A.; De Araújo, F.F.; Rego, J. Evaluating the effectiveness of educational data mining tech-
niques for early prediction of students' academic failure in introductory programming courses. Comput. Hum. Behav. 2017, 73,
247–256. https://doi.org/10.1016/j.chb.2017.01.047.
14. Márquez-Vera, C.; Cano, A.; Romero, C.; Noaman, A.Y.M.; Fardoun, H.M.; Ventura, S. Early dropout prediction using data
mining: A case study with high school students. Expert Syst. 2015, 33, 107–124. https://doi.org/10.1111/exsy.12135.
15. Fernández, A.; del Río, S.; Chawla, N.V.; Herrera, F. An insight into imbalanced Big Data classification: Outcomes and chal-
lenges. Complex Intell. Syst. 2017, 3, 105–120. https://doi.org/10.1007/s40747-017-0037-9.
16. Rodríguez-Hernández, C.F.; Musso, M.; Kyndt, E.; Cascallar, E. Artificial neural networks in academic performance prediction:
Systematic implementation and predictor evaluation. Comput. Educ. Artif. Intell. 2021, 2, 100018.
https://doi.org/10.1016/j.caeai.2021.100018.
17. Contreras, L.E.; Fuentes, H.J.; Rodríguez, J.I. Academic performance prediction by machine learning as a success/failure indi-
cator for engineering students. Form. Univ. 2020, 13, 233–246.
18. Hassan, H.; Anuar, S.; Ahmad, N.B.; Selamat, A. Improve student performance prediction using ensemble model for higher
education. In Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2019; Volume 318, pp.
217–230.
19. Bolón-Canedo, V.; Alonso-Betanzos, A. Ensembles for feature selection: A review and future trends. Inf. Fusion 2018, 52, 1–12.
https://doi.org/10.1016/j.inffus.2018.11.008.
20. Meghji, A.F.; Mahoto, N.A.; Unar, M.A.; Shaikh, M.A. The role of knowledge management and data mining in improving edu-
cational practices and the learning infrastructure. Mehran Univ. Res. J. Eng. Technol. 2020, 39, 310–323.
https://doi.org/10.22581/muet1982.2002.08.
21. Crivei, L.; Czibula, G.; Ciubotariu, G.; Dindelegan, M. Unsupervised learning based mining of academic data sets for students’
performance analysis. In Proceedings of the SACI 2020—IEEE 14th International Symposium on Applied Computational In-
telligence and Informatics, Proceedings, Timisoara, Romania, 21–23 May 2020; Volume 17, pp. 11–16.
22. Guanin-Fajardo, J.; Casillas, J.; Chiriboga-Casanova, W. Semisupervised learning to discover the average scale of graduation of
university students. Rev. Conrado 2019, 15, 291–299.
23. Alyahyan, E.; Düşteargör, D. Decision trees for very early prediction of student’s achievement, In Proceedings of the 2020 2nd
International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–15 October 2020; pp. 1–7.
24. Alwarthan, S.; Aslam, N.; Khan, I.U. An Explainable Model for Identifying At-Risk Student at Higher Education. IEEE Access
2022, 10, 107649–107668. https://doi.org/10.1109/access.2022.3211070.
25. Adekitan, A.I.; Noma-Osaghae, E. Data mining approach to predicting the performance of first year student in a university
using the admission requirements. Educ. Inf. Technol. 2018, 24, 1527–1543. https://doi.org/10.1007/s10639-018-9839-7.
26. Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In Pro-
ceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 2–4 August
1996; pp. 82–88.
27. Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods
and applications. Expert Syst. Appl. 2017, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035.
28. Chawla, N.; Bowyer, K. SMOTE: Synthetic Minority Over-sampling Technique Nitesh. J. Artif. Intell. Res. 2002, 16, 321–357.
29. Bertolini, R.; Finch, S.J.; Nehm, R.H. Enhancing data pipelines for forecasting student performance: Integrating feature selection
with crossvalidation. Int. J. Educ. Technol. High. Educ. 2021, 18, 44.
Data 2024, 9, x FOR PEER REVIEW 26 of 28
30. Febro, J.D. Utilizing Feature Selection in Identifying Predicting Factors of Student Retention. Int. J. Adv. Comput. Sci. Appl. 2019,
10, 269–274. https://doi.org/10.14569/ijacsa.2019.0100934.
31. Ghaemi, M.; Feizi-Derakhshi, M.-R. Feature selection using Forest Optimization Algorithm. Pattern Recognit. 2016, 60, 121–129.
https://doi.org/10.1016/j.patcog.2016.05.012.
32. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
Vienna, Austria, 2020.
33. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg,
V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
34. Alturki, S.; Alturki, N.; Stuckenschmidt, H. Using Educational Data Mining to Predict Students’ Academic Performance For
Applying Early Interventions. J. Inf. Technol. Educ. JITE. Innov. Pract. IIP 2021, 20, 121–137.
35. Fernández-García, A.J.; Rodríguez-Echeverría, R.; Preciado, J.C.; Manzano, J.M.C.; Sánchez-Figueroa, F. Creating a recom-
mender system to support higher education students in the subject enrollment decisión. IEEE Access 2020, 8, 189069–189088.
36. Helal, S.; Li, J.; Liu, L.; Ebrahimie, E.; Dawson, S.; Murray, D.J.; Long, Q. Predicting academic performance by considering
student heterogeneity. Knowl.-Based Syst. 2018, 161, 134–146. https://doi.org/10.1016/j.knosys.2018.07.042.
37. Yağci, M. Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart
Learn. Environ. 2022, 9, 1–19.
38. Gil, P.D.; Martins, S.d.C.; Moro, S.; Costa, J.M. A data-driven approach to predict first-year students’ academic success in higher
education institutions. Educ. Inf. Technol. 2020, 26, 2165–2190. https://doi.org/10.1007/s10639-020-10346-6.
39. Beaulac, C.; Rosenthal, J.S. Predicting University Students’ Academic Success and Major Using Random Forests. Res. High. Educ.
2019, 60, 1048–1064. https://doi.org/10.1007/s11162-019-09546-y.
40. Fernandes, E.R.; de Carvalho, A.C. Evolutionary inversion of class distribution in overlapping areas for multiclass imbalanced
learning. Inf. Sci. 2019, 494, 141–154.
41. Morales, P.; Luengo, J.; García, L.P.F.; Lorena, A.C.; de Carvalho, A.C.P.L.F.; Herrera, F.; Ciencias, I.D.; Paulo, U.D.S.; Av, T.S.-
C.; Carlos, S.; et al. Noisefiltersr the noise-filtersr package. R J. 2017, 9, 1–8.
42. Zeng, X.; Martinez, T. A noise filtering method using neural networks, In Proceedings of the IEEE International Workshop on
Soft Computing Techniques in Instrumentation and Measurement and Related Applications (SCIMA2003), Provo, UT, USA, 17
May 2003; pp. 26–31.
43. Verbaeten, S.; Assche, A. Ensemble methods for noise elimination in classification problems. In Multiple Classifier Systems. MCS
2003; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; pp. 317–325.
44. Ali, A.; Jayaraman, R.; Azar, E.; Maalouf, M. A comparative analysis of machine learning and statistical methods for evaluating
building performance: A systematic review and future benchmarking framework. J. Affect. Disord. 2024, 252, 111268.
https://doi.org/10.1016/j.buildenv.2024.111268.
45. Rajula, H.S.R.; Verlato, G.; Manchia, M.; Antonucci, N.; Fanos, V. Comparison of Conventional Statistical Methods with Machine
Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina 2020, 56, 455. https://doi.org/10.3390/medic-
ina56090455.
46. García, S.; Luengo, J.; Herrera, F. Tutorial on practical tips of the most influential data preprocessing algo-rithms in data mining.
Knowl.-Based Syst. 2016, 98, 1–29.
47. Cruz RM, O.; Sabourin, R.; Cavalcanti GD, C. Dynamic classifier selection: Recent advances and perspectives. Inf. Fusion 2018,
41, 195–216.
48. Yadav, S.K.; Pal, S. Data Mining: A Prediction for Performance Improvement of Engineering Students using Classification. arXiv
2012, arXiv:1203.3832. https://doi.org/10.48550/arXiv.1203.3832.
49. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997,
30, 1145–1159. https://doi.org/10.1016/s0031-3203(96)00142-2.
50. Nájera, A.B.U.; de la Calleja, J.; Medina, M.A. Associating students and teachers for tutoring in higher education using clustering
and data mining. Comput. Appl. Eng. Educ. 2017, 25, 823–832. https://doi.org/10.1002/cae.21839.
51. Kononenko, I. Estimating Attributes: Analysis and Extensions of RELIEF. In European Conference on Machine Learning; Springer:
Berlin/Heidelberg, Germany, 1994; pp. 171–182.
52. Liu, H.; Setiono, R. Feature selection and classification: A probabilistic wrapper approach. In Proceedings of the 9th Interna-
tional Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEAAIE´96), Fuku-
oka, Japan, 4–7 June 1996; pp.419–424.
53. Zhu, Z.; Ong, Y.-S.; Dash, M. Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework. IEEE Trans. Syst. Man
Cybern. Part B 2007, 37, 70–76. https://doi.org/10.1109/tsmcb.2006.883267.
54. Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng.
2005, 17, 491–502. https://doi.org/10.1109/tkde.2005.66.
55. Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell.
2003, 17, 519–533. https://doi.org/10.1080/713827181.
56. Kira, K.; Rendell, L. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the AAAI'92:
Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; pp. 129–134.
57. Qian, W.; Shu, W. Mutual information criterion for feature selection from incomplete data. Neurocomputing 2015, 168, 210–220.
https://doi.org/10.1016/j.neucom.2015.05.105.
Data 2024, 9, x FOR PEER REVIEW 27 of 28
58. Sheinvald, J.; Dom, B.; Niblack, W. A modeling approach to feature selection. In Proceedings of the 10th International Confer-
ence on Pattern Recognition, Atlantic City, NJ, USA, 16–21 June 1990; Volume i, pp. 535–539.
59. Coefficient of Determination. In The Concise Encyclopedia of Statistics; Springer: New York, NY, USA, 2008; pp. 88–91.
https://doi.org/10.1007/978-0-387-32833-1_62.
60. Quinlan, J. Induction of decision trees. Mach. Learn. 1986, 1, 81–106.
61. Ceriani, L.; Verme, P. The origins of the Gini index: Extracts from Variabilità e Mutabilità (1912) by Corrado Gini. J. Econ. Inequal.
2011, 10, 421–443. https://doi.org/10.1007/s10888-011-9188-x.
62. Pawlak, Z. Imprecise Categories, Approximations and Rough Sets, Springer: Dordrecht, The Netherlands, 1991; Volume 19, pp. 9–
32.
63. Wang, D.; Zhang, Z.; Bai, R.; Mao, Y. A hybrid system with filter approach and multiple population genetic algorithm for feature
selection in credit scoring. J. Comput. Appl. Math. 2018, 329, 307–321. https://doi.org/10.1016/j.cam.2017.04.036.
64. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and
SMOTE. Inf. Sci. 2018, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056.
65. Batista, G.E.; Bazzan, A.L.; Monard, M.C. Balancing training data for automated annotation of keywords: A case study. WOB
2003, 3, 10–18.
66. Ivan, T. Two modifications of cnn. IEEE Trans. Syst. Man Commun. SMC 1976, 6, 769–772.
67. Liu, X.-Y.; Wu, J.; Zhou, Z.-H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B
2008, 39, 539–550. https://doi.org/10.1109/tsmcb.2008.2007853.
68. Hearst, M.A. Support vector machines. IEEE Intell. Syst. 1998, 13, 18–28.
69. Almeida, L.B. C1. 2 multilayer perceptrons. In Handbook of Neural Computation; Oxford University Press: New York, NY, USA,
1997; pp. 1–30.
70. Breiman, L. Random forests. Ensemble Mach. Learn. 2001, 45, 5–32.
71. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232.
https://doi.org/10.1214/aos/1013203451.
72. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
73. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140.
74. Webb, G.I. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714.
75. Shetu, S.F.; Saifuzzaman, M.; Moon, N.N.; Sultana, S.; Yousuf, R. Student’s performance prediction using data mining technique
depending on overall academic status and environmental attributes. In Advances in Intelligent Systems and Computing; Springer:
Berlin/Heidelberg, Germany, 2021; Volume 1166, pp. 757–769.
76. Fisher, R.A. The Design of Experiments; Oliver & Boyd: Thomas Oliver, NY, USA, 1935.
77. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. Available online:
http://jmlr.org/papers/v7/demsar06a.html (accessed on 9 April 2024).
78. Cohen, J. The eart is round (p < 0.05). Am. Psychol. 1994, 49, 997–1003. https://doi.org/10.1037/0003-066X.49.12.997
79. Schmidt, F.L. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.
Psychol. Methods 1996, 1, 115–129.
80. Harlow, L.L.; Mulaik, S.A.; Steiger, J.H. (Eds.) Multivariate Applications Book Series. In What If There Were No Significance Tests?
Lawrence Erlbaum Associates Publishers: Mahwah, NJ, USA, 1997.
81. Al-Fairouz, E.I.; Al-Hagery, M.A. Students Performance: From Detection of Failures and Anomaly Cases to the Solutions-Based
Mining Algorithms. Int. J. Eng. Res. Technol. 2020, 13, 2895–2908. https://doi.org/10.37624/ijert/13.10.2020.2895-2908.
82. Ismanto, E.; Ghani, H.A.; Saleh, N.I.B.M. A comparative study of machine learning algorithms for virtual learning environment
performance prediction. IAES Int. J. Artif. Intell. 2023, 12, 1677–1686. https://doi.org/10.11591/ijai.v12.i4.pp1677-1686.
83. Kaushik, Y.; Dixit, M.; Sharma, N.; Garg, M.. Feature Selection Using Ensemble Techniques. In Futuristic Trends in Network and
Communication Technologies; FTNCT 2020. Communications in Computer and Information Science; Springer: Singapore, 2021;
Volume 1395, pp. 288–298. https://doi.org/10.1007/978-981-16-1480-4_25.
84. Mayer, A.-K.; Krampen, G. Information literacy as a key to academic success: Results from a longitudinal study. Commun. Com-
put. Inf. Sci. 2016, 676, 598–607. https://doi.org/10.1007/978-3-319-52162-6_59.
85. Harackiewicz, J.M.; Barron, K.E.; Tauer, J.M.; Elliot, A.J. Predicting success in college: A longitudinal study of achievement goals
and ability measures as predictors of interest and performance from freshman year through graduation. J. Educ. Psychol. 2002,
94, 562–575. https://doi.org/10.1037/0022-0663.94.3.562.
86. Meier, Y.; Xu, J.; Atan, O.; van der Schaar, M. Predicting Grades. IEEE Trans. Signal Process. 2015, 64, 959–972.
https://doi.org/10.1109/tsp.2015.2496278.
87. Lord, S.M.; Ohland, M.W.; Orr, M.K.; Layton, R.A.; Long, R.A.; Brawner, C.E.; Ebrahiminejad, H.; Martin, B.A.; Ricco, G.D.;
Zahedi, L. MIDFIELD: A Resource for Longitudinal Student Record Research. IEEE Trans. Educ. 2022, 65, 245–256.
https://doi.org/10.1109/te.2021.3137086.
88. Tompsett, J.; Knoester, C. Family socioeconomic status and college attendance: A consideration of individual-level and school-
level pathways. PLoS ONE 2023, 18, e0284188. https://doi.org/10.1371/journal.pone.0284188.
89. Ma, Y.; Cui, C.; Nie, X.; Yang, G.; Shaheed, K.; Yin, Y. Pre-course student performance prediction with multi-instance multi-
label learning. Sci. China Inf. Sci. 2018, 62, 29101. https://doi.org/10.1007/s11432-017-9371-y.
Data 2024, 9, x FOR PEER REVIEW 28 of 28
90. Berrar, D. Confidence curves: An alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn.
2017, 106, 911–949. https://doi.org/10.1007/S10994-016-5612-6/FIGURES/12.
91. Berrar, D.; Lozano, J.A. Significance tests or confidence intervals: Which are preferable for the comparison of classifiers? J. Exp.
Theor. Artif. Intell. 2013, 25, 189–206. https://doi.org/10.1080/0952813x.2012.680252.
92. García, S.; Herrera, F. An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Com-
parisons. J. Mach. Learn. Res. 2008, 9, 2677–2694.
93. Biju, V.G.; Prashanth, C. Friedman and Wilcoxon Evaluations Comparing SVM, Bagging, Boosting, K-NN and Decision Tree
Classifiers. J. Appl. Comput. Sci. Methods 2017, 9, 23–47. https://doi.org/10.1515/jacsm-2017-0002.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual au-
thor(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any in jury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.