ArticlePDF Available

Predicting Academic Success of College Students Using Machine Learning Techniques

Authors:

Abstract

College context and academic performance are important determinants of academic success; using students’ prior experience with machine learning techniques to predict academic success before the end of the first year reinforces college self-efficacy. Dropout prediction is related to student retention and has been studied extensively in recent work; however, there is little literature on predicting academic success using educational machine learning. For this reason, CRISP-DM methodology was applied to extract relevant knowledge and features from the data. The dataset examined consists of 6690 records and 21 variables with academic and socioeconomic information. Preprocessing techniques and classification algorithms were analyzed. The area under the curve was used to measure the effectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly classified eight out of ten cases, while the decision tree improved interpretation with ten rules in seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college consolidates college self-efficacy, creating intervention and support strategies to retain students is a priority for decision makers. Assessing the fairness and discrimination of the algorithms was the main limitation of this work. In the future, we intend to apply the extracted knowledge and learn about its influence of on university management.
Data 2024, 9, x. hps://doi.org/10.3390/xxxxx www.mdpi.com/journal/data
Article
Predicting Academic Success of College Students Using
Machine Learning Techniques
Jorge Humberto Guanin-Fajardo 1, Javier Guaña-Moya 2,*, and Jorge Casillas 3
1 Facultad de Ciencias de la Ingeniería, Universidad Técnica Estatal de Quevedo, Quevedo 120508, Ecuador;
jorgeguanin@uteq.edu.ec
2 Facultad de Ingeniería, Ponticia Universidad Católica del Ecuador, Quito 170525, Ecuador
* Correspondence: eguana953@puce.edu.ec; Tel.: +593-995000484
3 Department of Computer Science and Articial Intelligence, University of Granada, 18071 Granada, Spain;
casillas@decsai.ugr.es
Abstract: College context and academic performance are important determinants of academic suc-
cess; using students’ prior experience with machine learning techniques to predict academic success
before the end of the rst year reinforces college self-ecacy. Dropout prediction is related to stu-
dent retention and has been studied extensively in recent work; however, there is lile literature on
predicting academic success using educational machine learning. For this reason, CRISP-DM meth-
odology was applied to extract relevant knowledge and features from the data. The dataset exam-
ined consists of 6690 records and 21 variables with academic and socioeconomic information. Pre-
processing techniques and classication algorithms were analyzed. The area under the curve was
used to measure the eectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly
classied eight out of ten cases, while the decision tree improved interpretation with ten rules in
seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college
consolidates college self-ecacy, creating intervention and support strategies to retain students is a
priority for decision makers. Assessing the fairness and discrimination of the algorithms was the
main limitation of this work. In the future, we intend to apply the extracted knowledge and learn
about its inuence of on university management.
Keywords: educational data mining; machine learning; educational analysis; higher education;
academic success
1. Introduction
Higher education has developed a fundamental role due to the versatility and com-
plexity of today’s world, which has led to the rapid growth of scientic literature dedi-
cated to predicting academic success or the risk of student dropout [1–7]. Higher educa-
tion institutions and their traditional role of knowledge dissemination have changed; in-
novation in new knowledge especially with the irruption of articial intelligence [8] and
the training of qualied professionals make many of them interact in dierent areas of
society. In fact, their missions through teaching, research, and the ability to share and
transfer this knowledge constitute central functions of their academic and cultural activ-
ity, with the aim of improving the level of knowledge in society. They have the important
role of transmiing knowledge, skills, and values to students to create competitive pro-
fessionals in society. Therefore, channeling students towards academic success is tran-
scendental, as HEIs must continue the work undertaken and further deepen their involve-
ment, signicance, and service capacity in relation to the social, cultural, and economic
framework [9]. Thus, the prediction of academic success with past information of students
who have successfully completed their university studies has become a tool of interest for
educational managers since it allows them to strengthen decisions and build
Citation: Guanin-Fajardo, J.H.;
Guaña-Moya, J.; Casillas, J.
Predicting Academic Success of
College Students Using Machine
Learning Techniques. Data 2024, 9, x.
hps://doi.org/10.3390/xxxxx
Academic Editor: Antonio Sarasa
Cabezuelo
Received: 31 January 2024
Revised: 13 April 2024
Accepted: 15 April 2024
Published: 19 April 2024
Copyright: © 2024 by the authors.
Submied for possible open access
publication under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (hps://cre-
ativecommons.org/licenses/by/4.0/).
Data 2024, 9, x FOR PEER REVIEW 2 of 28
improvement alternatives or educational policies. ICT is one of the most widely used al-
ternatives today, especially machine learning.
Hence, advances in machine learning techniques, along with other areas of study, are
precursors to educational data mining. In higher education, the academic success of stu-
dents is statistically measured by the graduation rate, which is dened as the total number
of students graduating among the total number of entering students. In fact, ref. [10] states
that it is possible to think about student success more broadly by studying endogenous
and exogenous factors in the student environment. Thus, the constant need to be eective
in the academic success of students has led to the customization of machine learning, this
to achieve specic predictive models that provide useful information.
In the last decade, many studies have focused on investigative works that address
the problems of performance, dropout, and academic success in university students. As
detailed in [11–14], the authors emphasize that university dropout or failure converges
with students from disadvantaged social strata who project university dropout behavior.
To sustain university permanence among their ndings, the authors are inclined to con-
sider that extra-university activities that guarantee retention should be strengthened.
Therefore, early detection has become a tool for solving these problems. Academic history,
university context (tangible and intangible resources), and other data were used as the
input elements to predict the results [4]. For this purpose, qualitative and quantitative
research methods have been used to solve these problems. More recently, multiple studies
have been derived that employed data mining or machine learning techniques that,
among other things, use algorithms and two well-known techniques to extract useful
knowledge from data. The rst technique, supervised classication, evaluates the data
and predicts the target variable (class). The work of [6,15–17] has shown results related to
supervised classication.
Similarly, in [18,19], using another approach based on supervised classication, they
used a set of pre-selected algorithms that classify the data by applying the voting tech-
nique. Both approaches aempt to predict students’ academic success or performance ef-
fectively. The second technique, unsupervised classication, is one in which the target
variable is unknown and that focuses on nding hidden paerns among the data. In gen-
eral, association rules are used to discover facts occurring within the data and are com-
posed of two parts: antecedent and consequent; for example, the rule {A, B}{C} means
that, when A and B occur, then C occurs. In [20–22], they look for the occurrence of data
by focusing on the association rules and evaluating the rules with metrics such as support,
condence, and lift, among others.
In the studies of [23–25], related to machine learning, the convergence of objectives
and techniques applied for the data preprocessing stage was observed, both in feature
reduction, data transformation, normalization, and instance selection, among others. At
the same time, data balancing techniques and “black box” classication algorithms were
analyzed. The synergy of the studies lies in the simplication of the predictive models
obtained given the high degree of complexity of the extracted knowledge, for which they
used decision trees, since this technique simplies the knowledge by means of the repre-
sentation of rules of type (XY). To some extent, the methods applied are part of the KDD
process proposed in [26]. However, data asymmetry is a typical problem in any area of
study. Duplicity, ambiguity, and missing and overlapping data are frequent, especially in
authentic problems. Indeed, in data mining classication techniques, problems are pre-
sented as an unequal distribution of examples among classes (target variable), where one
or more classes (minority class) are underrepresented compared to the others (majority
class) [27]. Commonly, the data balancing method dened by Chawla [28] is used in this
type of problem. However, it is intended to ll the existing gap of data balancing with
educational data by using dierent balancing methods for multiclass problems.
The approach of this study is like previous work described in [6,29–31], where similar
tasks were performed with predictions in binary and multiclass classes. However, the
main dierence with our approach focuses on the in-depth analysis of data balancing and
Data 2024, 9, x FOR PEER REVIEW 3 of 28
feature selection techniques to avoid biases in predictions. Using 53% fewer variables and
improving its accuracy by 10% over the preliminary results with the raw data, we not only
built classication models to identify the relevant factors of college students’ academic
success, but also obtained a general model from the decision tree to obtain a higher read-
ability of the predictive model. In this way, it is intended to provide additional guidance
to academic decision makers in decision making. The open license software used for this
work was R [32] through a customized library to visualize, preprocess and classify the
data. The Python library scikit-learn [33] was used for data balancing.
The core of the work focuses on the study of machine learning techniques that predict
academic success. This has allowed us to establish the objective of the work, which is to
know in advance the factors that explain the academic success of students at the end of
their rst year of university. To do this, it has been necessary to pose the research ques-
tions since we intend to identify the factors that contribute to the academic success of stu-
dents during their rst year of college. This will allow us to examine the preprocessing
techniques, the predictive model, the determinants of academic success and, of course, the
visualization techniques to improve its interpretation before and after obtaining the pre-
dictive model. In this sense, the following research questions were posed:
RQ1: Which balancing and feature selection technique is relevant for supervised clas-
sication algorithms?
RQ2: Which predictive model best discriminates students’ academic success?
RQ3: Which factors are determinants of students’ academic success?
Most studies on predicting academic success by machine learning have focused
solely on nding a predictive model, which is, to some extent, highly eective. In contrast,
the work presented, in line with RQ1, seeks the group of features that are most signicant
for the model and, on the other hand, also seeks a balanced training dataset, using dier-
ent data balancing techniques and avoiding biases in the prediction. RQ2, on the other
hand, aims to nd the eective predictive model using dierent supervised learning al-
gorithms. Finally, RQ3 examines which variables were relevant in the predictive model
achieved by the machine learning algorithms to then obtain another model with a beer
interpretation for the decision maker.
The presented work diers, among other things, by the following contributions: (i)
we unveil the eectiveness of educational data mining techniques, to identify academi-
cally successful students early enough to act and reduce the failure rate; (ii) the impact of
data preprocessing is analyzed; (iii) the important variables underlying the predictive
model of beer performance are unveiled. Thus, an approach to the presented work is
associated with the works of [23,29,34], where the authors have examined the characteris-
tics and impact of the best-performing algorithm. The rest of the paper is organized as
follows: in Section 2, a literature review is carried out; in Section 3, the methodology used
in this work is explained; in Section 4, the main results obtained by applying machine
learning are presented; in Section 5, the discussion is presented; in Section 6, the relevant
conclusions, in Section 7, limitations; and nally, in Section 8 future work are described.
2. Literature Review
In the cited literature, there are works related to the study of machine learning in
higher education and its impact on the prediction of academic performance or success. In
prediction, the purpose is to predict the target variable (class) of a dataset. The works cited
in Table 1 employ supervised classication algorithms that focus on obtaining the predic-
tive model.
Data 2024, 9, x FOR PEER REVIEW 4 of 28
Table 1. Summary of papers related to the prediction of academic performance or success of univer-
sity students.
Objective
Inst. 1
Feat. 2
DPM 3
Accuracy
Citation
Scope
Performance
6948
55
Data preprocessing meth-
ods
82%
[35]
Higher Education
Performance
3830
27
Data transformation,
Discretization
83%
[36]
Higher Education
Prediction
1854
4
75%
[37]
Academic Success
Assessment
731
12
Extraction Feature, Imbal-
anced Dataset
78%
[6]
Higher Education
Achievement
339
15
Extraction Feature
69.3%
[23]
Higher Education
Performance
32,593
31
Extraction Feature, Imbal-
anced Dataset
72.73%
[38]
Higher Education
Prediction
9652
68
Extraction Feature, Imbal-
anced Dataset
75.43%
[24]
Higher Education
Prediction
3225
57
Extraction Feature, Imbal-
anced Dataset
79.5%
[28]
Higher Education
Prediction
300
18
Extraction Feature
63.33%
[34]
Higher Education
Prediction
1491
13
Extraction Feature, Imbal-
anced Dataset
75.78%
[5]
Higher Education
Prediction
7936
29
Extraction Feature
69.3%
[30]
Higher Education
Prediction
4413
[18]
Higher Education
Prediction
6690
21
Selection Feature,
Selection Instance, Data
imbalanced
81%
Our
pro-
posal
Higher Education
1 Number of instances. 2 Number of features. 3 Data preprocessing methods.
Among other works, the use of machine learning techniques to predict the success or
failure of university courses or degrees stands out. The use of the recommender system
proposed by [35] suggests to computer science students the subjects they can take, in ad-
dition to the prediction of success or failure based on the previous experience of other
university students. In the work, data preprocessing and example balancing techniques
were applied. Then, the preprocessed data were used as input for the classication algo-
rithms to learn and obtain the prediction model from the test data. The results achieved
provide guidelines for university administrators to enhance educational quality. In this
sense, the early provision of useful information to predict a given event in the student
body is valuable. Hence, the study of academic performance is a relevant contribution in
higher education. Helal [36], in his work, predicted the academic performance of the stu-
dent body; the data used in his work were divided into groups, and each subgroup of data
was evaluated with dierent classication algorithms to predict academic performance.
Their results suggest that external students and female students performed well in the
prediction.
The work of Bertolini [29] set out to examine dierent classication algorithms to
predict nal exam grades with reasonable accuracy, considering midterm grades. Simi-
larly, Alyahyan [23] proposed the use of decision trees to predict students’ academic per-
formance and generate an early warning when low performance is detected. Dierent de-
cision tree approaches as well as relevant feature extraction were employed to obtain a
simpler model for decision making by academic experts. In line with this, refs. [29,34] also
examined high-impact features in the data to t representative variables with respect to
college retention and dropout, to develop interventions to help improve student academic
success.
Data 2024, 9, x FOR PEER REVIEW 5 of 28
Similarly, in Beaulac [39], the prediction of the academic success of university stu-
dents has been studied by applying the random forest and decision tree algorithms, the
laer being very intuitive for decision making; the authors propose the use of these tech-
niques to know if at the end of the rst two semesters the student would achieve the uni-
versity degree. Their results have indicated that there is a strong relationship between
underperforming grades and the likelihood of succeeding in a degree program, although
this did not necessarily indicate a causal connection.
Several of the related articles reveal the variety of work linked to improving the ed-
ucational system. The approach of Guerrero-Higueras [7], which proposes the use of the
GIT version control system as an evaluation methodology to observe the frequency and
use of the tool to help predict the student’s academic success, stands out. The variables
studied describe the student’s ability with tasks related to the development of the com-
puter science subject. This methodology as introduced diers from the rest given the ad-
aptation of the GIT version control platform and the issues specic to the computer science
area.
The literature cited above emphasizes gradualism to achieve features that achieve
high accuracy in the algorithms and obtain a simple and readable model. The lack of sali-
ent features prevents obtaining an eective prediction model. This is because of the am-
biguity or irrelevance of the variables [40]. On the other hand, of signicant importance is
the reduction of outliers in the data due to duplicate observations or overlapping data [41–
43]. It is understood, of course, that all of this leads to the application of each stage sug-
gested in the CRISP-DM [26], methodology that allows obtaining a reliable model at the
end. The validity of the model obtained is checked by the performance metrics of the clas-
sication algorithms. Based on what has been presented in this section, it was observed in
the literature that the work focuses mainly on two fronts: identifying signicant aributes
to predict student performance, success, or failure in higher education, and nding the
best prediction method to improve the accuracy of the predictive model achieved.
3. Materials and Methods
3.1. Context
The Institution of Higher Education (IES) is geographically located in the Municipal-
ity of Quevedo, Province of Los Ríos, Ecuador. Its coordinates are set at: 1°00′46″ S
79°28′09″ W/−1.012778, −79.469167. According to the policies of the IES and its minimum
requirements, each university course is taught in face-to-face mode, and in addition, each
academic year of the university course must be passed. In this case, each academic year
consists of two academic cycles (semesters). Students must enroll in the university degree
program and obtain grades in each subject, with a minimum grade of seven on a scale of
zero to ten. As a result of the academic activities performed and their permanence in the
university degree, the academic status of the student body is determined (dependent var-
iable/class). Academic statuses are established in three categories. The rst is “Passed”,
when the student has completed and passed all academic courses. The second is
“Change”, when the student passes courses other than the initial degree. And nally, third
is “Dropout”, when the student leaves the university completely.
3.2. Data Collection
Data collection was performed using SQL server scripts. The data were extracted
from the university’s information system database server. The dataset used in this work
consisted of two parts: student body and faculty, which were subsequently merged. It
should be noted that the criterion for the merger was the classes taught in the rst year by
the faculty in the teaching process for the university degree. Thus, the rst part of the
information referring to the students dealt with academic and socioeconomic data, while
that relating to the teaching sta referred to degrees obtained, age, and academic experi-
ence, among others. Among the diversity of professors in charge of university teaching of
Data 2024, 9, x FOR PEER REVIEW 6 of 28
rst-year students, there were full, associate, and occasional professors, totaling 286 pro-
fessors selected for this study.
On the other hand, the number of regular students was 6690. Although the number
of professors and students does not coincide, it is necessary to clarify that a professor can
teach dierent subjects. The students selected were those who were enrolled and had com-
pleted the rst year of all university courses. In short, all of the above was framed within
a retrospective of six complete academic years of each university degree, that is, ten cal-
endar years. It should also be noted that any identifying reference to both faculty and
students was eliminated to obtain an anonymous dataset. Among other things, the infor-
mation extracted for this work had the endorsement and permission of the competent
authority of the higher education institution detailed in the Institutional Review Board
Statement section. The database with the raw data had 21 variables and 6690 records (see
Appendix A, Table A1 for a description of the variables used).
So far, one of the main dierences in algorithms between machine learning (ML) and
traditional statistical methods lies in their purpose, as the former is still focused on the
ability to capture complex relationships between features and make predictions as accu-
rate as possible, while the laer, especially linear regression (LR), logistic regression
(LOR), generalized mixed models and relevance-based prediction and others, aim at in-
ferring relationships between variables. However, the key dierence between traditional
statistical approaches and ML is that, in ML, a model learns from examples rather than
being programmed with rules. For a given task, examples are provided in the form of
inputs (called features or aributes) and outputs (called labels or classes) [44,45].
In this work, we used the Cross-Industry Standard Process for Data Mining (CRISP-
DM) methodology proposed by [26], which comprises seven phases: understanding the
problem, understanding the data, data preparation, modeling, evaluation and implemen-
tation; the data preparation or data preprocessing is a stage that gained importance and
became a key stage, since its function is related to data preparation. In other words, the
objective is to reduce the complexity of the original dataset to obtain a readable predictive
model with useful variables. Therefore, the work is based on the best practice for data
preprocessing suggested in [46–48]. For this reason, Appendixes B and C detail the results
of the various methods used for data preprocessing using feature ltering, instance selec-
tion, and class balancing. The main advantage of ecient data preprocessing was the
transfer of suitable data to classication algorithms for simple and accurate learning. First,
the compacted data were cleaned and transformed and then analyzed with visualization
techniques that allowed, among other things, the location of trajectories, overlaps and data
behavior. Second, the data were stratied into two subsets of data: training and test. Then,
the training set was ltered for relevant instances and features to balance the data using
dierent methods. The already balanced dataset was used as input data for the classica-
tion algorithms, together with the test data that were used to obtain the predictive model.
Finally, this model was evaluated with the metrics proposed in this work. Figure 1 shows
the activities that were performed.
Data 2024, 9, x FOR PEER REVIEW 7 of 28
Figure 1. Diagram of activities performed. The processes conducted are described in four stages.
3.3. Metric Assessment
The metrics referred to in this section are used to evaluate the performance of the set
of algorithms used to obtain predictive models. In Equation (4), the term α represents
P(Tp) = Sensitivity, and (1 β) represents P(Tn) = Specicity [49].
TP+TN
Accuracy= TP+FN+FP+TN
(1)
TP
Sensitivity=Recall= TP+FN
(2)
TN
Specificity= TN+FP
(3)
( ) ( )
1
AUC= 1 . 1 .
2
i

+




(4)
TP
Precision= TP FP+
(5)
Pr( ) Pr( )
Cohen's Kappa= 1 Pr( )
ae
e
(6)
( )
,,
11
1log
NM
i j i j
ij
LogLoss y p
N==
=− 
(7)
3.4. Data Exploratory
The importance of data exploration is that it serves to understand the activity and
behavior of the data. Visualization techniques have been used that detected signicant
information in the data; specically, variables were examined according to each category
of the class using graphs (Figure 2).
Data 2024, 9, x FOR PEER REVIEW 8 of 28
(a) Pass
(b) Dropout
(c) Change
Figure 2. Undirected graph calculated from the correlation matrix (Pearson’s method). Both the arcs
and the adjacency matrix were ltered with cut-o points obtained from the weighted mean of the
nodes (Pass = 0.0007804694, Dropout = 0.0061971, Change = 0.01684287). The graphs had weights
associated with each of the arcs, and this weight xed their density. Three groups of subgures were
separated according to the target variable (pass, dropout, change). Subgure (a) showed three sub-
groups of variables (8, 5, 5) where a common variable overlaps. Cluster (b) showed three subgroups
of variables (8, 3, 8); this subgure lacks overlap. Group (c) showed four subgroups of variables (6,
7, 4, 2) overlapped by three common variables. On the other hand, red lines indicate a lower degree
of association, while black lines and thickness indicate their strength of association.
3.5. Data Preprocessing
The importance of data preprocessing is to synthesize and achieve expeditious data.
This fact has an important consequence for classication algorithms since the integrity of
the data is gradually assessed by the hit rate, i.e., the number of true positives that the
prediction algorithm can detect. Within this context, the aim is to obtain the set of features
and instances that are close to a reasonable hit rate. The problem around which the data
preprocessing revolves is the dierent search strategies such as sequential, random, and
complete that are proposed for this task. The evaluation criterion is set with ltering (dis-
tance, information, dependency, and consistency), hybrid and wrapper methods [50–54].
Data 2024, 9, x FOR PEER REVIEW 9 of 28
The data preprocessing was divided into four phases. First, missing values in the data
were replaced using the k-nearest neighbor’s algorithm KNN_MV [55]. Second, unrepre-
sentative instances were excluded using the “NoiseFiltersR” algorithm. Third, feature se-
lection was studied with dierent algorithms and functions that have evaluated feature
quality. Finally, data balancing was applied to avoid bias in the prediction model due to
the small amount of minority class data.
3.6. Missing Values
Data in their original form contain inconsistent data and often have missing values.
That is, when the value of a variable is not stored, it is considered missing data. Multiple
techniques have been developed to replace missing values. In general, statistical tech-
niques of central tendency are usually used; for numerical values, the mean or median is
used, while for nominal values, the “mode” is usually used. Another common technique
is to remove the entire record from the dataset. Deletion can cause signicant loss of in-
formation. Frequent techniques are easy to use and solve the problem of missing values,
although, in data mining practice, there is a tendency to implement algorithms that solve
this problem by examining the entire dataset. Specically, in this work, we have used the
“rfImpute” function, which replaced missing values by the nearest neighbor technique
that takes the class (target variable) as reference.
3.7. Instance Selection
Instance selection was also key in the data preprocessing, since poor-quality exam-
ples were eliminated by using the NoiseFiltersR algorithm [41], which ltered out the 5%
of examples that were not within the data standard. In other words, when a value is at an
unusual distance from the rest of the values in the dataset, it is considered an outlier or
noise.
3.8. Feature Selection
There is an important distinction to be made in this section since the generality and
accuracy of the predictive model will depend on the quality of the variables. Therefore, it
is crucial to decide which variables are relevant to include in the study. For this, we used
nine feature selection algorithms among them: “LasVegas-LVF”, “Relief” [56], “selectK-
Best”, “hillClimbing”, “sequentialBackward”, “sequentialFloatingForward”, “deepFirst”,
“geneticAlgorithm”, and “antColony”. On the other hand, the algorithms used distinct
functions to value the aributes. Among the functions, we had “mutualInformation” [57],
“MDLC” [58], “determinationCoecient” [59], “GainRatio” [60], “Gini Index” [61], and
“roughsetConsistency” [62,63]. The group of algorithms used for the study of signicant
characteristics obtained subgroups of variables that have been evaluated and are shown
in Table 2 and Appendix C Table A3.
Table 2. Feature ltering by the “Relief” algorithm using dierent k and bestk lters. The lowest
feature selection and the highest accuracy achieved by the C4.5 classication algorithm were estab-
lished with the “bestk” ltering (10 variables).
Filter
Variable
Value
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
k = 9
11
−0.002
0.75
0.56
0.83
0.85
0.79
0.83
0.80
k = 7
11
−0.001
0.76
0.5
0.82
0.85
0.78
0.82
0.80
k = 5
11
−0.003
0.74
0.52
0.80
0.83
0.77
0.80
0.78
k = 3
14
−0.001
0.76
0.56
0.82
0.85
0.79
0.82
0.80
bestk
10
0.062
0.79
0.62
0.85
0.87
0.81
0.85
0.83
3.9. Data Balancing
Data 2024, 9, x FOR PEER REVIEW 10 of 28
Sample balancing is another important step in data preprocessing. Currently, there
are several techniques for data balancing or resampling using Python software 3.9 and its
scikit-learn library [33]. In this work, the following techniques have been studied: over-
sampling, combined, undersampling and ensemble. The rst used the methods “Smote”
[28] and “KMeansSMOTE” (oversampling with SMOTE, followed by undersampling with
edited nearest neighbors) [64]. The second used both “Smote-ENN” and “Smote-Tomek”
(oversampling with SMOTE) [65]. The third technique used was subsampling with the
“RUS” method [66]. Finally, the ensemble technique used “EasyEnsemble” [67] and “Bag-
ging”. Specically, new balanced training datasets were generated. All of this was from
the initial training set, in which the dierent techniques and methods were used to balance
the data (See Table 3).
Table 3. The table displays the distribution of data per class using dierent data balancing tech-
niques, along with the corresponding imbalance ratio (IR) between the majority and minority clas-
ses. A higher IR indicates a more severe class imbalance problem.
Classes
Algorithms Used
Dropout
Change
Pass
Overall
IR
Origin data (not use algorithm)
3.346
466
2.080
5.892
7.180
Over (SMOTE)
2.826
5.652
8.478
16.956
3
Over (KMeansSMOTE
5.655
8.481
2.829
16.965
2.997
Combined (SMOTE-ENN)
5.365
2.822
4.164
12.351
1.901
Combined (SMOTE-Tomek)
5.360
2.826
7.894
16.080
1.472
Under (RUS)
355
1.065
710
2.130
3
Under (Tomelinks)
2.439
4.229
3.874
10.542
1.733
Ensembles (EasyEnsemble)
2.826
5.017
4.662
12.505
1.775
Ensembles (Bagging)
2.826
5.017
4.662
12.505
1.775
3.10. Classication Algorithms
The use of supervised classication techniques aims to achieve a prediction model
that is highly accurate. Hence, several algorithms have been created that use dierent
mathematical models to achieve the model. In this section, we detail the types of algo-
rithms and a provide a brief description of how each works.
Decision Trees: Consists of building a tree structure in which each branch represents
a question about an aribute. New branches are created according to the answers to
the question until reaching the leaves of the tree (where the structure ends). The leaf
nodes indicate the predicted class; see [35].
Support Vector Machine (SVM): A relatively simple supervised machine learning al-
gorithm used in regression or classication related problems. In many cases, it is used
for classication, although it is preferably useful for regression. Basically, SVM cre-
ates a hyperplane with boundaries between data types in a two-dimensional space;
this hyperplane is nothing more than a line. In SVM, each datum in the dataset is
ploed in an N-dimensional space, where N is the number of features/aributes of
the data; see [68].
Neural Network: Multilayer perceptrons (MLP) are the best known and most widely
used type of neural network. They consist of neuron-like units, multiple inputs, and
an output. Each of these units forms a weighted sum of its inputs, to which a constant
term is added. This sum is then passed through a nonlinearity, usually called an ac-
tivation function. Most of the time, the units are interconnected in such a way that
they form no loop; see [69].
Random Forest: A combination of tree predictors, where each tree depends on the
values of a random vector sampled independently and with the same distribution for
all trees in the forest. The use of random feature selection to split each node produces
Data 2024, 9, x FOR PEER REVIEW 11 of 28
error rates that compare favorably with “Adaboost” but are more robust with respect
to noise. The internal estimates control for error, strength, and correlation, and are
used to show the response to increasing the number of features used in the split.
Internal estimates are also used to measure the importance of variables; see [70].
Gradient Boosting Machine: Gradient boosting is a machine learning technique used
to solve regression or classication problems, which builds a predictive model in the
form of decision trees. It develops a general gradient descent “boosting” paradigm
for additive expansions based on any ing criteria. Gradient boosting of regression
trees produces competitive, very robust, and interpretable regression and classica-
tion procedures, especially suitable for the extraction of not-so-clean data; see [71].
XGBoost: XGBoost is a distributed and optimized gradient boosting library designed
to be highly ecient, exible, and portable. It implements machine learning algo-
rithms under the Gradient Boosting framework. XGBoost provides parallel tree
boosting (also known as GBDT, GBM) that solves many data science problems in a
fast and accurate manner; see [72].
Bagging: Predictor bagging is a method of generating multiple versions of a predictor
and using them to obtain an aggregate predictor. Bagging averages the versions
when predicting a numerical outcome and performs plural voting when predicting
a class. Multiple versions are formed by making bootstrap replicas of the learning set
and using them as new learning sets. Tests on real and simulated datasets show that
bagging can provide a substantial increase in accuracy; see [73].
Naïve Bayes: A probabilistic machine learning model used for classication tasks.
The core of the classier is based on Bayes’ theorem:
( | ) ( )
( | ) ()
P B A P A
P A B PB
=
, which is the
probability of A occurring, given that B has occurred. Here, B is the evidence, and A
is the hypothesis. The assumption made here is that the predictors/features are inde-
pendent. That is, the presence of a particular feature does not aect the other; see
[74].
4. Results
In response to the research questions posed, dierent data preprocessing algorithms
have been employed to reduce the dimensionality of the dataset, so that the classication
algorithms obtain a simple and accurate predictive model. In the following sections, we
study data preprocessing for feature selection rst. Second, we study data balancing using
dierent data balancing algorithms and, nally, the results using the metrics calculated
from the confusion matrix where the performance of the algorithms was evaluated.
4.1. Data Preprocessing
4.1.1. Feature Selection
Prior to preprocessing, the dataset was separated into two parts: 75% of the total was
selected for training data, and the other 25% for testing. The laer were used to evaluate
the predictive model achieved by the classication algorithms, while the training set was
subjected to preprocessing techniques to reduce dimensionality and obtain adequate data.
In this sense, the work has focused on achieving simplicity and improving the accuracy
of the predictive model, for which dierent feature and lter selection methods have been
congured. Table 2 shows the results of the algorithm that obtained the lowest features;
the rest of the runs of other algorithms and their results can be found in Appendix C.
In view of the cited works, in the studies of [15,28], relevant features in the data were
examined to improve the predictive model, in line with these. Table 2 presents the results
for the pre-selected feature set, where each evaluative lter and method rated the variables
according to the performance metric. Specically, the Relief method together with the
“bestk” evaluative lter achieved beer eciency, i.e., higher accuracy with fewer
Data 2024, 9, x FOR PEER REVIEW 12 of 28
variables. Based on these results, a new dataset with the new characteristics was estab-
lished and used as input data for the data balancing phase described in the next section.
4.1.2. Data Balancing
The importance of data balancing is fundamental to classication algorithms since
the disparity of examples between one class and another can lead to bias in the prediction
model. There are two common techniques for data balancing. The rst is the oversampling
of examples technique, in which the data are balanced to the same number of examples in
the majority class. The second is to reduce the other classes to the same number of exam-
ples in the minority class. Both techniques, although not very ecient, are useful for ob-
taining primary results since the redistribution of the data is achieved with the judgment
and experience of the data analyst. To some extent, this personalized judgment is avoided
by the intervention of algorithms that perform data balancing. The algorithms augment,
reduce or equalize the examples depending on the technique applied. From the above,
Table 3 shows the data imbalance index according to the algorithms used. Thus, each al-
gorithm generated a new balanced dataset that was used to train the classication algo-
rithms.
4.2. Classication Algorithms
In this section, we examine the eectiveness of the set of classication algorithms
proposed for this work, which is related as a multiclass problem, that is, a dependent var-
iable (class) with three types of outputs: Dropout, Change and Passed. For this reason,
and as is common in supervised classication problems, two datasets have been used: the
rst, for the algorithms to learn and obtain a prediction model; and the second, to evaluate
the eectiveness of the model obtained. Hence, we worked with two types of analysis: the
rst with the original data (without data preprocessing) and the second with the dierent
datasets generated from the preprocessing techniques used.
It is dicult not to appreciate the importance of data preprocessing, as it provides
classication algorithms with balanced and clean datasets. Obtaining the predictive model
requires the algorithm to learn from the provided data (training set), as the eectiveness
of the model will depend on it. Therefore, for the algorithm to achieve adequate learning,
the cross-validation technique k-fold cross-validation (CV) was applied; this approach
randomly subdivided the training set into 10 folds with approximately equal size, and
each fold, in turn, was fragmented into two sections: training and test. This was done so
that at the end of training, the mean prediction was obtained from among the folds. On
the other hand, to check what was learned by the algorithms, the metrics proposed in the
section of methodology were used, which helped to discriminate the most eective pre-
dictive models. While it is true that eectiveness is fundamental to evaluate the predictive
model, the comprehensibility of the model obtained is also important, since the experts
evaluate the simplicity of the model.
Here, we present the best result of the classication algorithms that were achieved
using the dataset balanced by the “EasyEnsemble” algorithm and the performance assess-
ment of the classiers using the ROC curve presented in Figure 3. The rest of the results
with dierent datasets derived from the application of the data balancing algorithms are
presented in Appendix B, Table A2.
In view of the results, Table 4 (raw data) and Table 5 (preprocessed data) show dif-
ferences in the performance of the algorithms. Negative values −0.0214 and −0.0222 for
precision and AUC, respectively, are evident. This negative eect between raw data and
preprocessed data is a consequence of preprocessing, so data preprocessing should be
interpreted not as a contradictory process but as an improvement of the predictive model
by using fewer variables from the original set. Therefore, the advantage of applying data
preprocessing has been observed.
Data 2024, 9, x FOR PEER REVIEW 13 of 28
Figure 3. Performance of the group of algorithms by ploing the area under the AUC curve. On the
ordinate axis is the true positive rate, and on the abscissa axis the false positive rate. The classier
lines above the diagonal (dashed line) represent good classication results (beer than random),
while those below represent bad results (worse than random). The best performance in classifying
the test data examples was obtained by the XGBoost algorithm; two algorithms had an AUC above
0.87, the rest performed below 0.86. This performance clearly indicates the eectiveness of the pre-
dictive model against the test set.
Table 4. Preliminary results for the original dataset, omiing data preprocessing.
Algorithms
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
AUC
LogLoss
XGBoost
0.8133
0.6617
0.8492
0.8861
0.8456
0.8492
0.8462
0.8997
0.3736
RandomForest
0.8163
0.6664
0.8523
0.8873
0.8428
0.8523
0.8468
0.8978
NA
Gbm
0.8062
0.6473
0.8460
0.8800
0.8352
0.8460
0.8401
0.8930
0.3925
Bagging
0.8008
0.6379
0.8423
0.8769
0.8291
0.8423
0.8351
0.8781
NA
C4.5
0.7822
0.6039
0.8378
0.8642
0.8033
0.8378
0.8193
0.8308
NA
NaiveBayes
0.6549
0.3847
0.5215
0.8025
0.7622
0.5215
0.5059
0.8168
NA
SvmRadial
0.7284
0.4934
0.7781
0.8218
0.7673
0.7781
0.7709
0.7973
NA
SvmPoly
0.7165
0.4687
0.7571
0.8132
0.7685
0.7571
0.7616
0.7754
0.5484
MLP
0.6895
0.4501
0.7673
0.8143
0.7471
0.7673
0.7511
0.7621
0.5378
Data 2024, 9, x FOR PEER REVIEW 14 of 28
Table 5. Evaluation results of the predictive models obtained by the classication algorithms. The
training set was balanced with the “EasyEnsemble” technique. Model validation was performed on
the test dataset. The data were sorted according to the AUC column.
Algorithms
Accuracy
Kappa
Sensitivity
Specicity
Precision
Recall
F1
AUC
LogLoss
XGBoost
0.7949
0.6299
0.8425
0.8753
0.8214
0.8425
0.8306
0.8775
6.3430
RandomForest
0.7925
0.6269
0.8444
0.8747
0.8205
0.8444
0.8305
0.8744
NA
Gbm
0.7752
0.5923
0.8318
0.8605
0.8043
0.8318
0.8171
0.8606
5.6340
Bagging
0.7752
0.5933
0.8268
0.8617
0.8088
0.8268
0.8168
0.8591
NA
C4.5
0.7644
0.5803
0.8334
0.8594
0.7964
0.8334
0.8110
0.8249
NA
SvmPoly
0.6861
0.4094
0.7347
0.7919
0.7466
0.7347
0.7384
0.7679
4.1072
SvmRadial
0.6814
0.4073
0.7460
0.7920
0.7321
0.7460
0.7377
0.7676
NA
MLP
0.6539
0.4059
0.7620
0.8013
0.7462
0.7620
0.7360
0.7446
3.2832
NaiveBayes
0.6389
0.3850
0.6348
0.8022
0.7879
0.6348
0.6442
0.8018
6.3015
It should be noted that the logloss was lower with the original data than with the
preprocessed data. The increase with the laer was due to the smaller imbalance between
classes. That is, the smaller the imbalance between classes, the greater the logloss, due to
the smaller proportion of observations in the minority class. Table 3 shows the imbalance
index between the original set and the dataset preprocessed with “EasyEnsemble” (col-
umn IR: 7.18 and 1.775 respectively).
In Table 6, the confusion matrix of the best-scoring algorithm (XGBoost) aimed to
explain the predicted values of the test dataset, and the prediction model obtained by the
algorithm was established. First, the type II error or β type error was analyzed, where (a)
the “Dropout” class had predicted values of 868 cases, of which 741 were correct, and 127
cases were classied as “Pass”; (b) the “Change” class had 126 cases, of which 115 were
correct and 11 were classied as “Pass”; (c) the “Pass” class of the 679 predicted cases had
474 that were correct, four cases were classied as “Change”, and 201 were classied as
“Dropout”. Secondly, the type I error or type α error was analyzed, where (a) the class
“Dropout” had 942 cases, of which 741 were correct and 201 “Pass”; (b) the class “Change”
had 119 cases, of which 115 were correct and four were classied as “Pass”; (c) the class
“Pass” had 612 cases, of which 474 were correct, 11 were classied as “Change”, and 127
were classied as “Dropout”.
Table 6. Confusion matrix of the XGBoost algorithm. Here, the actual values (rows) are shown ver-
sus the values predicted by the classier (columns).
Prediction
Actual
Dropout
Change
Pass
Total
Error Type II (β)
Dropout
741
0
127
868
0.8536
Change
0
115
11
126
0.9126
Pass
201
4
474
679
0.6980
Total
942
119
612
1673
µ = 0.8214
Error Type I (α)
0.7866
0.9663
0.7745
µ = 0.8431
Overall, a more ecient predictive model was obtained with the XGBoost classica-
tion algorithm. In the work of [75], they highlight that the random forest algorithm ob-
tained a beer result in accuracy (ACC: 0.81) using only 10 features of the original dataset,
pointing out the importance of improving academic performance and increasing the grad-
uation rate of the students of the educational center. Consequently, it is necessary to con-
sider that the accuracy of the model increases, and its complexity needs to be explainable
as well. In this context, we looked for a way to apply a simple and readable method. The
decision tree provides a simple rule-based model that improves comprehensibility. The
use of the decision tree, although less ecient, is very easy to interpret. Figure 4 shows
Data 2024, 9, x FOR PEER REVIEW 15 of 28
the decision tree generated from the training data and Figure 5 shows the important var-
iables.
Figure 4. The decision tree drawn is based on the rules obtained. The nodes represent the class. The
three decimal values within the node represent the probability of each class with respect to the eval-
uation of the rule. In turn, the total percentage of cases for the rule (cover) is shown. Below the node,
the condition of the rule is displayed.
Figure 5. The importance of the variable is calculated by summing the decrease in error when di-
vided by a variable. Thus, the higher the value, the more the variable contributes to improve the
model, so the values are bounded between 0 and 1.
Data 2024, 9, x FOR PEER REVIEW 16 of 28
4.3. Static Comparison of Several Classiers
Formally, statistical signicance is dened as a probability measure to assess experi-
ments or studies. Ronald Fisher promoted the use of the null hypothesis [76], establishing
a signicance threshold of 0.05 (1/20) to determine the validity of the results obtained in
empirical tests. In this way, it is guaranteed that the provenance of their results is not due
to chance coincidences. In the work of Demšar [77], the statistical signicance of dierent
classication algorithms and real-world datasets was validated by dierent empirical
tests. In this context, the nonparametric Friedman and Wilcoxon tests were used, which
are suitable for this type of analysis because they both do not skimp on the normal distri-
bution of the data or on the homogeneity of variances, making them suitable for studies
with data of a real or unmanipulated nature.
Prior to the calculation of the nonparametric tests, the results matrix of the group of
algorithms and the datasets was organized, using the area under the curve (AUC, see Ap-
pendix D Table A6) as the metric. The signicance threshold was set at 0.05 for the Fried-
man and Wilcoxon tests to determine if there were signicant dierences between more
than two dependent groups. To perform the empirical tests, we used the null hypothesis
H0: there are no signicant dierences between the groups of algorithms, and the alter-
native hypothesis, Ha: there is at least one signicant dierence between the groups of
algorithms. The results of the Friedman test yielded a chi-square 2) of 52.305 with 8 de-
grees of freedom and a p-value of 1.47 × 10−8 (See Appendix D, Table A4). Since the p-value
was below the threshold, the null hypothesis was rejected, and the alternative hypothesis
was accepted, conrming the existence of signicant dierences. Next, a pairwise com-
parison of algorithms will be performed using the Wilcoxon test to assess the signicance
of these dierences.
The above analysis established that there were signicant dierences, so a test was
performed for each pair of algorithms using the Wilcoxon test, which is a Friedman post
hoc test and is presented in Table 7, where the p-values obtained are shown.
Table 7. Wilcoxon signed rank test.
XGBoost
RF
Gbm
Bagging
C4.5
NaiveBayes
SvmRadial
SvmPoly
RF
0.018
-
Gbm
0.018
0.063 *
-
Bagging
0.018
0.018
0.018
-
C4.5
0.018
0.018
0.018
0.018
-
NaiveBayes
0.018
0.018
0.018
0.018
0.612 *
-
SvmRadial
0.018
0.018
0.018
0.018
0.028
0.018
-
SvmPoly
0.018
0.018
0.018
0.018
0.028
0.018
0.398 *
-
MLP
0.018
0.018
0.018
0.018
0.043
0.018
0.091 *
0.128 *
* Reject the null hypothesis.
According to the results, signicant dierences were found in RF vs. Gbm (0.063);
C4.5 vs. NaiveBayes (0.612); SVMRadial vs. SVMPoly (0.398); SVMRadial vs. MLP (0.091);
and SVMPoly vs. MLP (0.128) (See Appendix D, Table A5 for detailed results). In [78–80],
opinions on statistics and signicance tests have been discussed, because they are often
misused, either by misinterpretation or by overemphasizing their results. It should be
stated that statistical tests provide some assurance of the validity and non-randomness of
the results [77].
5. Discussion
This paper explores and discusses three research questions related to machine learn-
ing techniques that are applied to achieve a predictive model with greater accuracy and
readability, in addition to the study of factors that lead to the academic success of
Data 2024, 9, x FOR PEER REVIEW 17 of 28
university students when they nish the rst course. The answers to the questions posed
are detailed.
RQ1: Which balancing and feature selection technique is relevant for supervised clas-
sication algorithms? In general, it is evident that with the increase in variables, the accu-
racy of the model increases, and so does its complexity, since the classication algorithms
improve performance, although the readability of the model decreases. Against this in the
work of Alwarthan [24], they apply recursive feature elimination (RFE) with Pearson cor-
relation coecient, RFE with mutual information and GA to nd relevant features, in ad-
dition to class balancing using SMOTE-TomekLink to build the nal prediction model.
The relevant variables were related to English courses and GPA, as well as students’ social
variables. Alwarthan [24] used 68 features and achieved 93% accuracy with the initial re-
sults, while feature ltering detected 44 relevant variables and 90% accuracy. On the other
hand, they analyzed eight relevant characteristics that achieved 77% accuracy; the varia-
bles were directly related to the academic performance of the student body.
In [6], the ltering of characteristics using the Gini index was proposed, from which
seven characteristics were selected, achieving 79% accuracy using the random-forest al-
gorithm. These results were very similar to ours, but far from being explainable, due to
the bias derived from the imbalance of the data. In the proposal made in this study, dif-
ferent data processing techniques were used to obtain an expeditious dataset. On the one
hand, the instance ltering method was considered to reduce duplicate or noisy observa-
tions by 5%. On the other hand, for feature group ltering, six methods were used, and
ve lters were applied, with which an accuracy between 58% and 78% was achieved. On
the other hand, when applying the “ReliefF” method, 10 features were obtained with an
accuracy of 79% (algorithm C4.5). In contrast, with the literature presented, the analyzed
datasets had accuracy values below 84% and 32 features on average. The dierence with
what is proposed in this work is greater than 5% in accuracy, initially aractive. However,
the handling of 22 additional features generates a robust and poorly explainable model
for decision support.
Consequently, data balancing as part of data preprocessing was crucial to achieve a
robust predictive model. The literature reviewed generically posits data balancing as a
step prior to feature ltering. The approach taken so far is to obtain a ltered dataset (in-
stances and features) and then apply data balancing. Among the best classication accu-
racies achieved by the data balancing methods, a range between 73% and 79% was ob-
tained. The “EasyEnsemble” method obtained the best accuracy, AUC and logloss. The
laer was far from the original data, as the imbalance rate was high. For example, the
imbalance rates (IR) of the original data (7.35 IR) for undergraduate academic statuses
(dropout, change and pass) were 57%, 7% and 36%, while for the balanced data (1.75 IR),
they were 23%, 40% and 37% with synthetic observations. The accuracy of the XGBoost
model with balanced data was approximately 80%. In summary, the proposed data pre-
processing made the dataset unbiased and the predictive model simple and explainable.
RQ2: Which predictive model best discriminates students’ academic success? Cur-
rently, there are several supervised algorithms used in higher education to predict dier-
ent educational contexts in higher education. Specically, the best discrimination was per-
formed by the XGBoost algorithm. This criterion was based rst on the values collected
with the predictive model, where the accuracy value was 79.49% and the AUC was 87.75%.
Sensitivity = 84.25%, which indicated the rate of positive examples that the algorithm was
able to classify, while specicity = 87.53% for negative examples. Next, the logloss metric
measuring computational cost had 0.3736 and an imbalance rate of 7.18 with the original
dataset. However, the logloss value went to 6.34 with the preprocessed dataset and an
imbalance rate of 1.775, i.e., lower computational cost and a higher data imbalance rate
were inversely proportional to the performance of the predictive model. Although the
predictive model obtained using XGBoost is poorly explainable due to its high complex-
ity, it performed beer by classifying examples from the test set. Explainability of the
Data 2024, 9, x FOR PEER REVIEW 18 of 28
predictive model was obtained when the decision tree was applied to the training set to
obtain a predictive model based on rules (If, Then) and readable for decision makers.
Similarly, [6,16–19,24,75] converge in their predictions on higher education data us-
ing classiers such as Random Forest (RF), SVM, Neural Networks and decision trees.
Likewise, linear regression or logistic regression was used to obtain predictive models that
detect failure, success, or academic performance early enough [1,81], or in turn, semi-su-
pervised learning to obtain paerns in students who managed to pass the courses for a
university degree [22]. Being the main objective to achieve very aractive and reliable ac-
curacies, undoubtedly, accuracy always comes hand in hand with the quantity and quality
of the data. For example, Gil [38] obtained accuracy rates with “random forest” of 77%,
91% and 94% with features of 30, 44 and 68, respectively, where the positive correlation
between number of features and accuracy was evidenced. That said, in our results, accu-
racies very close to 80% were achieved with only 10 features and a completely readable
model (10 rules).
RQ3: Which factors are determinants of students’ academic success? As part of the
development of this study, variables that play a signicant role in the academic success of
students were found. Specically, the variables ChangeDegree, RateApproval, Average,
and Degree were determinants for the prediction model obtained. These ndings are close
to the results obtained by Alturki [34], where individual results from the third and fourth
semester were examined, both with accuracies of 63.33% (six variables) and 92.6% (nine
variables), respectively. The inuential variables were grade point average, number of
credits taken and academic assessment performance, applying the selection of character-
istics for each academic semester. Similarly, Alyahyan [23] identied variables related to
GPA and key subjects that detect student performance early enough. As detailed by Beau-
lac [39] in their study, they identied variables associated with undergraduate degree
completion as a rst group of variables, whereas the second group of variables was related
to the type of major. In summary, the rst-year students opt for computer and English
related subjects to reach their academic achievement, i.e., characteristics related to aca-
demic performance.
Specically, data preprocessing provides as input an expedited dataset for classica-
tion algorithms to achieve an adequate predictive model. Although the results in the re-
viewed literature resemble ours, and these can be improved by inducing endogenous or
exogenous variables for the model to achieve more optimal results, the results can also be
improved by over-ing parameters in the algorithms. It is also worth mentioning that,
for example, Ismanto [82] obtained an RF prediction model with an accuracy higher than
90% without preprocessing the data, which resulted in a complex predictive model due
to its explainability. Therefore, even if the model obtains the highest accuracy, the predic-
tion bias can also be extended if the parameters are over-ed or the data preprocessing
phase is omied.
Kaushik [83] has dened feature selection as increasing the quality in the data to fa-
cilitate beer results, all according to the proposed method set of techniques for feature
selection in educational data. What is applied in this paper ts with Kaushik’s perspective.
It is important to anticipate early enough and with general quality characteristics to take
eective countermeasures, providing timely warnings to students to achieve academic
success. In this way, the percentage of underachieving students can be reduced, and ap-
propriate counseling and intervention can be provided to them by the college.
The results provide conclusive support for the anticipation of college completion [84–
86], which is essential to assist students in the learning process and ensure their academic
success. Thus, taking advantage of the fact that predictions made early enough by ma-
chine learning manage to reveal possible diculties or improvements from students’ his-
torical data, its eective use requires building specic strategies [84]. Consequently, the
application of the knowledge obtained from the data is leveraged, for example, in constant
monitoring or continuous tracking that acts as a tool to assess progress in academic per-
formance, class aendance, extracurricular activities and other key indicators [87]. Other
Data 2024, 9, x FOR PEER REVIEW 19 of 28
strategies include personalized tutorial support or intervention plans, remediation and
other resources for students who have demonstrated compelling needs [88,89]. Machine
learning, along with other data analysis techniques, oers valuable suggestions for tar-
geted interventions for the benet of students, with the goal of helping them achieve aca-
demic success in the shortest possible time. The results presented support the authenticity
of the analyses performed, as the information is not based on mere coincidences, but on
real data. In this context, signicant tests were performed using statistical methods such
as the nonparametric Friedman and Wilcoxon test, which are widely recognized for com-
paring the performance of machine learning algorithms [77,90,91]. Although these tests
are not recommended for a comprehensive study, due to the need to conform to other
assumptions, some authors have deepened their analysis and proposed alternatives to the
tests [92,93]. In summary, signicant tests are essential for a solid and objective interpre-
tation of the results obtained.
6. Conclusions
In response to the research questions, the eectiveness of the prediction model lies in
the good practice conducted in the data preprocessing phase. Hence, the importance of
obtaining an expeditious dataset is crucial. Unlike the methodologies reviewed in the lit-
erature, our applied methodology avoided bias in the accuracy rates of the predictive
model, as well as in the academic status (class). In fact, both the robust predictive model
achieved by means of XGBoost as well as the simplied decision tree model proved to be
eective. The simplied predictive model was able to detect students with high potential
for academic success in seven out of ten cases, while the robust model detected them in
eight out of ten cases. The simplication and explainability of the model were based on a
set of rules obtained from the decision tree used, to make them understandable and pro-
vide them to academic experts as suggestions for decision making. Overall, this study
provides valuable information on the factors underlying college students’ academic suc-
cess expectations and highlights the importance of eective data preprocessing and model
simplication techniques for making accurate, meaningful, and understandable predic-
tions about college students’ academic success.
7. Limitations
The main limitation of this work was the absence of variables that help to have con-
sistent measurements in the classication algorithms in terms of gender, scholarships, and
nancial aid, since it is important to analyze the evaluation of equity and discrimination
aspects in the decisions made by the algorithms to build the predictive model.
8. Future Work
Looking ahead, we intend to explore how the knowledge extracted in this work and
the university practices applied with this knowledge can inuence classroom manage-
ment, with the aim of improving students’ academic outcomes and reducing the disparity
in educational opportunities. To this end, we propose studies related to (i) examining how
the personalization of predictive models can be adapted to the phenotype (characteristics)
of the student body, where the objective is to examine the use of fuzzy logic to make un-
certainty exible and how the fuzzy model can manage the university context; (ii) design-
ing early warning systems to intervene early and prevent failure or dropout; and (iii) other
approaches, such as longitudinal studies, that aid evaluation and eectiveness over time
to adjust the models as needed.
Author Contributions: Individual contributions: J.H.G.-F. and J.C.: conceptualization, methodology,
software, validation, formal analysis, research, writing—writing of the original draft, writing—re-
vising and editing, visualization, supervision; J.G.-M.: resources, writing—revising and editing, vis-
ualization, support, project administration. All authors have read and agreed to the published ver-
sion of the manuscript.
Data 2024, 9, x FOR PEER REVIEW 20 of 28
Funding: This research received no external funding.
Institutional Review Board Statement: The work is supported as part of the UTEQ-FOCICYT IX-
2023/29 project “Factors that aect the completion of time to degree and aect the retention of UTEQ
students” approved by the twentieth resolution made by the Honorable University Council dated
14 February 2023. The study keeps the respective condentiality of the information stipulated in the
Organic Law for the Protection of Personal Data of the Republic of Ecuador, in addition to the ap-
plication of the respective “Code of Ethics for Ocials and Servants Designated or Contracted by
the Universidad Tecnica Estatal de Quevedo” approved by the Honorable University Council on 6
September 2011. Therefore, the research group has declared this research approved for publication
in any journal with the document CERT-ETHIC-001-2023.
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset is not available but can be obtained from the correspond-
ing author upon reasonable request.
Acknowledgments: We would like to express our deep appreciation to the authorities of the IES for
authorizing and allowing access, exploration, and analysis of the information, especially for the
support provided by the project “Factors that inuence the completion of the time to degree and
aect the retention of students at UTEQ”, headed by Javier Guaña-Moya, and Efraín Díaz Macías.
The work is supported as part of the UTEQ-FOCICYT IX-2023/29 project.
Conicts of Interest: The authors declare no conicts of interest.
Appendix A
This section presents the information used in the work. The dataset used consists of
data such as career, class aendance, students’ academic performance and socioeconomic
information. Numerical and categorical data are according to each variable.
Table A1. Description of the dataset used for the study.
Variable Names
Values
Description
Type
Faculty
1–5
Names of the faculties.
Categorical
Degree
1–27
Names of the university degrees.
Categorical
Sex
1. Male,
2. Female
Sex of students.
Categorical
Age Entrance
16–50
Age at entrance to university.
Numeric
Support
1. Public
2. Private
Type of nancial support from the high school
where the student completed high school.
Categorical
Localization
1. Local,
2. Outside of Quevedo,
3. Other Province
The geographical area of the school where the
student nished high school.
Categorical
AveragePre
0–10
Average of the grades of the university leveling
program (Pre-university/Admission/Selectivity).
Numeric
AendancePre
0–100
Pre-university aendance percentage.
Numeric
Average
0–10
Average of the subjects taught in the rst year.
Numeric
Aendance
0–100
Average of the student’s aendance percentage
in all subjects enrolled. Must meet the minimum
aendance percentage of 70%.
Numeric
TimeApproval
1–3
Number of enrollments used by the student to
pass the rst course.
Numeric
RateApproval
0–3
Weighting of the eort in the exams to pass the
subjects; the rst exam (recovery) has a value of
0.25, while the second one has a value of 0.75.
Numeric
CounterDegree
0–2
The number of college courses in which the stu-
dent was enrolled.
Numeric
Data 2024, 9, x FOR PEER REVIEW 21 of 28
StructureFamily
1. I am independent
2. Only with mom,
3. Only with dad,
4. Both parents,
5. Couple,
6. Other relative.
Variable associated with the student’s family
structure.
Categorical
Job
1. Does not work,
2. Full time,
3. Part-time,
4. Part-time by the hour,
5. Occasionally.
This variable is linked to the student’s work or
occupational situation.
Categorical
Financing
1. Family support, with 1
or 2 children studying.
2. Self-employed (own ac-
count).
3. Family support, with
more than three children
studying.
4. Loan, scholarship, or
current credit.
This variable is related to the student’s economic
disposition to pay for the academic year.
Categorical
Zone
1. Outside of Quevedo,
2. Urban,
3. Slum,
4. Rural.
Describes the geographic district where the stu-
dent lives.
Categorical
Income
1. More than $400,
2. Between $399 and $200,
3. Between $199 and $100,
4. Less than or equal to
$99.
Monthly cash income (approximate) of the fam-
ily nucleus.
Categorical
Housing
1. Own housing,
2. Rental,
3. Mortgaged,
4. Borrowed.
This variable is related to the usufruct of the
housing where the student and his family live.
Categorical
ChangeDegree
1. Yes,
2. No.
This variable describes whether the student has
changed degrees when repeating the rst year.
Categorical
Class
1. Dropout,
2. Change,
3. Pass.
Variable with the student’s academic status at
the end of the university degree.
Categorical
Appendix B
Table A2 presents various results from the calculation of the metrics applied to the
group of classication algorithms. The results presented in this appendix are complemen-
tary trainings, as six dierent balancing techniques were used to generate new datasets
that were contributed to train and achieve eective predictive models. Each technique ap-
plied balancing methods related to oversampling, undersampling and combined balanc-
ing based on the SMOTE algorithm. The “EasyEnsemble” data balancing was the best
performing of the algorithms and has been presented in the Results section as part of the
data input supply for the group of classication algorithms to obtain the predictive model.
Data 2024, 9, x FOR PEER REVIEW 22 of 28
Table A2. Performance results of the classication algorithms that trained and tested the predictive
models using new datasets constructed using the data balancing algorithms.
Bal.
Algorithms
Acc.
Kappa
Sensi.
Speci.
Preci.
Recall
F1
AUC
LogLoss
SMOTE
XGBoost
0.7878
0.6214
0.8472
0.8740
0.8237
0.8472
0.8318
0.8743
6.7890
RF
0.7812
0.6118
0.8418
0.8720
0.8143
0.8418
0.8229
0.8671
Gbm
0.7723
0.5984
0.8446
0.8679
0.8105
0.8446
0.8205
0.8575
5.0183
Bagging
0.7687
0.5887
0.8315
0.8633
0.8045
0.8315
0.8135
0.8546
C4.5
0.7639
0.5771
0.8248
0.8577
0.8008
0.8248
0.8101
0.7936
SvmPoly
0.6999
0.4740
0.7819
0.8242
0.7618
0.7819
0.7631
0.7640
5.8172
SvmRadial
0.6970
0.4681
0.7835
0.8215
0.7587
0.7835
0.7630
0.7649
MLP
0.6545
0.4190
0.7779
0.8087
0.7552
0.7779
0.7350
0.7512
5.0928
NaiveBayes
0.6198
0.3802
0.7478
0.7990
0.7817
0.7478
0.6957
0.8040
5.0538
KMeans.SMOTE
XGBoost
0.7956
0.6366
0.8565
0.8802
0.8262
0.8565
0.8367
0.8702
6.3079
RF
0.7794
0.6080
0.8420
0.8702
0.8125
0.8420
0.8226
0.8600
Gbm
0.7693
0.5929
0.8396
0.8660
0.8060
0.8396
0.8160
0.8515
5.1901
Bagging
0.7681
0.5860
0.8228
0.8620
0.8049
0.8228
0.8103
0.8467
C4.5
0.7663
0.5828
0.8259
0.8605
0.8036
0.8259
0.8113
0.7979
SvmPoly
0.6946
0.4613
0.7751
0.8187
0.7548
0.7751
0.7584
0.7616
5.8277
SvmRadial
0.6892
0.4499
0.7717
0.8139
0.7492
0.7717
0.7551
0.7591
MLP
0.6712
0.4229
0.7703
0.8045
0.7353
0.7703
0.7452
0.7424
4.9865
NaiveBayes
0.6067
0.3644
0.7505
0.7933
0.7804
0.7505
0.6862
0.7970
5.0905
SMOTE.Tomek
XGBoost
0.7914
0.6278
0.8474
0.8766
0.8241
0.8474
0.8320
0.8665
6.5445
Bagging
0.7747
0.5970
0.8269
0.8656
0.8090
0.8269
0.8148
0.8468
Gbm
0.7741
0.6029
0.8430
0.8705
0.8137
0.8430
0.8199
0.8577
4.9046
RF
0.7717
0.5922
0.8295
0.8639
0.8050
0.8295
0.8139
0.8562
C4.5
0.7579
0.5639
0.8088
0.8526
0.7947
0.8088
0.8001
0.7623
SvmPoly
0.6975
0.4722
0.7822
0.8242
0.7627
0.7822
0.7619
0.7634
5.7663
SvmRadial
0.6910
0.4579
0.7749
0.8182
0.7550
0.7749
0.7565
0.7633
MLP
0.6724
0.4459
0.7885
0.8182
0.7631
0.7885
0.7483
0.7592
4.8305
NaiveBayes
0.6372
0.4053
0.7622
0.8079
0.7805
0.7622
0.7104
0.7959
SMOTE.ENN
XGBoost
0.7478
0.5573
0.8216
0.8542
0.7933
0.8216
0.7988
0.8335
5.9690
Gbm
0.7406
0.5481
0.8239
0.8519
0.7923
0.8239
0.7965
0.8230
5.2251
RF
0.7394
0.5492
0.8285
0.8534
0.7962
0.8285
0.7972
0.8192
Bagging
0.7352
0.5387
0.8179
0.8485
0.7908
0.8179
0.7926
0.8165
C4.5
0.7310
0.5274
0.8109
0.8429
0.7828
0.8109
0.7888
0.7548
SvmRadial
0.6880
0.4590
0.7846
0.8198
0.7594
0.7846
0.7589
0.7511
SvmPoly
0.6880
0.4598
0.7809
0.8206
0.7608
0.7809
0.7568
0.7490
5.5081
MLP
0.6665
0.4398
0.7875
0.8169
0.7661
0.7875
0.7436
0.7650
4.7577
NaiveBayes
0.6186
0.3810
0.7615
0.7989
0.7428
0.7615
0.6858
0.7764
4.2434
RUS
Gbm
0.7346
0.5346
0.8188
0.8455
0.7861
0.8188
0.7938
0.8102
5.1165
XGBoost
0.7328
0.5344
0.8205
0.8466
0.7886
0.8205
0.7931
0.8173
4.4343
RF
0.7304
0.5303
0.8187
0.8450
0.7868
0.8187
0.7914
0.8153
Bagging
0.7197
0.5062
0.8034
0.8346
0.7731
0.8034
0.7813
0.7962
C4.5
0.6987
0.4954
0.8131
0.8387
0.7957
0.8131
0.7667
0.7764
SvmRadial
0.6629
0.4081
0.7634
0.7992
0.7249
0.7634
0.7368
0.7360
SvmPoly
0.6605
0.4040
0.7622
0.7976
0.7275
0.7622
0.7374
0.7348
4.3219
MLP
0.6402
0.3871
0.7610
0.7950
0.7320
0.7610
0.7251
0.7304
2.9982
NaiveBayes
0.6031
0.3575
0.7428
0.7907
0.7773
0.7428
0.6827
0.7841
Data 2024, 9, x FOR PEER REVIEW 23 of 28
Appendix C
This table presents the results of the ltering of characteristics using the dierent
methods proposed in the study. Each method, according to its nature, ltered the group
of variables that best represented the data. Then, the group of variables was evaluated
with the C4.5 classication algorithm.
Table A3. Selection of characteristics used for the evaluation of the best group of variables. The best
group of variables was selected by the RelefFbestK algorithm.
Filter
Var.
Method
Value
Acc.
Kappa
Sensi.
Speci.
Preci.
Recall
F1
Roughset con-
sistency
11
Las Vegas
1.00
0.68
0.39
0.53
0.79
0.65
0.53
0.56
9
SelectKBest
0.02
0.67
0.36
0.47
0.78
0.47
8
HillClimbing
1.00
0.62
0.21
0.41
0.73
0.41
9
Sequential Backward
1.00
0.67
0.36
0.47
0.78
0.47
9
Sequential Floating
Forward
1.00
0.67
0.36
0.47
0.78
0.47
10
Genetic Algorithm
1.00
0.67
0.34
0.47
0.78
0.47
20
AntColony
1.00
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Determination coef-
cient
13
Las Vegas
0.48
0.72
0.46
0.60
0.82
0.70
0.60
0.62
9
SelectKBest
0.06
0.67
0.36
0.47
0.78
0.47
20
HillClimbing
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Sequential Backward
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Sequential Floating
Forward
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
Genetic Algorithm
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
20
AntColony
0.48
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Gini index
11
Las Vegas
1.00
0.68
0.39
0.53
0.79
0.65
0.53
0.56
9
SelectKBest
0.51
0.67
0.36
0.47
0.78
0.47
8
HillClimbing
1.00
0.62
0.21
0.41
0.73
0.41
9
sequentialBackward
1.00
0.67
0.36
0.47
0.78
0.47
9
sequential Floating For-
ward
1.00
0.67
0.36
0.47
0.78
0.47
11
Genetic Algorithm
1.00
0.62
0.21
0.41
0.73
0.41
20
AntColony
1.00
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Mutual information
12
Las Vegas
1.27
0.72
0.46
0.56
0.82
0.69
0.56
0.58
9
SelectKBest
0.16
0.67
0.36
0.47
0.78
0.47
6
HillClimbing
1.27
0.58
0.09
0.37
0.69
0.37
8
Sequential Backward
1.27
0.62
0.21
0.41
0.73
0.41
8
Sequential Floating
Forward
1.27
0.62
0.21
0.41
0.73
0.41
4
GeneticAlgorithm
1.27
0.67
0.34
0.47
0.78
0.47
20
AntColony
1.27
0.78
0.60
0.83
0.86
0.81
0.83
0.82
Gain ratio
7
Las Vegas
0.10
0.59
0.15
0.39
0.71
0.39
9
SelectKBest
0.13
0.67
0.36
0.47
0.78
0.47
7
HillClimbing
0.10
0.59
0.15
0.39
0.71
0.39
11
SequentialBackward
0.10
0.68
0.39
0.53
0.79
0.65
0.53
0.56
11
Sequential Floating
Forward
0.10
0.68
0.39
0.53
0.79
0.65
0.53
0.56
1
GeneticAlgorithm
0.10
0.59
0.15
0.39
0.71
0.39
19
AntColony
0.10
0.72
0.48
0.60
0.82
0.71
0.60
0.62
Data 2024, 9, x FOR PEER REVIEW 24 of 28
Appendix D
This section presents the results of the nonparametric Friedman and Wilcoxon tests
performed. For this purpose, the value of the AUC metric was used. The calculation was
performed using the R statistical program. Table A4 presents the values obtained from the
calculation of the Friedman test. Table A5 presents the matrix of the Wilcoxon test results,
both the Z-value on the left and the p-value on the right. Table A6 is the matrix used for
the calculation of the tests.
Table A4. Average Rankings of the algorithms.
Algorithm
Ranking
XGBoost
0.9999999999999998
RandomForest
2.2857142857142856
Gbm
2.714285714285714
Bagging
3.999999999999999
C4.5
5.999999999999999
NaiveBayes
5.428571428571429
SvmRadial
7.428571428571429
SvmPoly
7.571428571428571
MLP
8.571428571428571
Friedman statistic considering reduction performance (distributed according to chi-square with 8
degrees of freedom: 52.3047619047619 p-value computed by Friedman test: 1.474479383034577 × 10−8.
Table A5. Z Score and signicance on Wilcoxon test (Z/p-value, within table).
Algorithms
XGBoost
RF c
Gbm
Bagging
C4.5
NaiveBayes
SvmRadial
SvmPoly
RF
−2.366 a/0.018
-
Gbm
−2.366 a/0.018
−1.859 a/0.063 *
-
Bagging
−2.371 a/0.018
−2.366 a/0.018
−2.366 a/0.018
-
C4.5
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
-
NaiveBayes
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−0.507 b/0.612 *
-
SvmRadial
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.197 a/0.028
−2.366 a/0.018
-
SvmPoly
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.197 a/0.028
−2.366 a/0.018
−0.845 a/0.398 *
-
MLP
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.366 a/0.018
−2.028 a/0.043
−2.366 a/0.018
−1.690 a/0.091 *
−1.521 a/0.128 *
a Based on positive rankings. b Based on negative rankings. c Random Forest. * Reject the null hy-
pothesis.
Table A6. AUC value metrics with dierent classiers and dataset.
DataSet
Algorithms
XGBoost
RF
Gbm
Bagging
C45
NaiveBayes
SvmRadial
SvmPoly
MLP
RawData
0.8997
0.8978
0.8930
0.8781
0.8308
0.8168
0.7973
0.7754
0.7621
EasyEnsemble
0.8775
0.8744
0.8606
0.8591
0.8249
0.8018
0.7676
0.7679
0.7446
SMOTE
0.8743
0.8671
0.8575
0.8546
0.7936
0.8040
0.7649
0.7640
0.7512
KmeansSMOTE
0.8702
0.8600
0.8515
0.8467
0.7979
0.7970
0.7591
0.7616
0.7424
SMOTETomek
0.8665
0.8562
0.8577
0.8468
0.7623
0.7959
0.7633
0.7634
0.7592
SMOTEENN
0.8335
0.8192
0.8230
0.8165
0.7548
0.7764
0.7511
0.7490
0.7650
RUS
0.8173
0.8153
0.8102
0.7962
0.7764
0.7841
0.7360
0.7348
0.7304
References
1. Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predicting Student Dropout and Academic Success. Data 2022, 7, 146.
https://doi.org/10.3390/data7110146.
2. Ortiz-Lozano, J.M.; Rua-Vieites, A.; Bilbao-Calabuig, P.; Casadesús-Fa, M. University student retention: Best time and data to
identify undergraduate students at risk of dropout. Innov. Educ. Teach. Int. 2018, 57, 7485.
https://doi.org/10.1080/14703297.2018.1502090.
Data 2024, 9, x FOR PEER REVIEW 25 of 28
3. Urbina-Nájera, A.B.; Téllez-Velázquez, A.; Barbosa, R.C. Paerns to Identify Dropout University Students with Educational
Data Mining. Rev. Electron. De Investig. Educ. 2021, 23,1-15. hps://doi.org/10.24320/REDIE.2021.23.E29.3918.
4. Lopes Filho JA, B.; Silveira, I.F. Early detection of students at dropout risk using administrative data and machine learning.
RISTIRev. Iber. De Sist. E Tecnol. De Inf. 2021, 40, 480495.
5. Guanin-Fajardo, J.H.; Barranquero, J.C. Contexto universitario, profesores y estudiantes: Vínculos y éxito académico. Rev.
Iberoam. De Educ. 2022, 88, 127146. https://doi.org/10.35362/rie8814733.
6. Zeineddine, H.; Braendle, U.; Farah, A. Enhancing prediction of student success: Automated machine learning approach. Com-
put. Electr. Eng. 2020, 89, 106903. https://doi.org/10.1016/j.compeleceng.2020.106903.
7. Guerrero-Higueras, M.; Llamas, C.F.; González, L.S.; Fernández, A.G.; Costales, G.E.; González, M..C. Academic Success As-
sessment through Version Control Systems. Appl. Sci. 2020, 10, 1492. https://doi.org/10.3390/app10041492.
8. Rafik, M. Artificial Intelligence and the Changing Roles in the Field of Higher Education and Scientific Research. In Artificial
Intelligence in Higher Education and Scientific Research. Bridging Human and Machine: Future Education with Intelligence; Springer:
Singapore, 2023; pp. 3546. https://doi.org/10.1007/978-981-19-8641-3_3.
9. BOE. BOE-A-2023-7500 Ley Orgánica 2/2023, de 22 de marzo, del Sistema Universitario. 2023. Available online:
https://www.boe.es/buscar/act.php?id=BOE-A-2023-7500 (accessed on 23 March 2024).
10. Guney, Y. Exogenous and endogenous factors influencing students’ performance in undergraduate accounting modules. Ac-
count. Educ. 2009, 18, 5173. https://doi.org/10.1080/09639280701740142.
11. Tamada, M.M.; Giusti, R.; Netto, J.F.d.M. Predicting Students at Risk of Dropout in Technical Course Using LMS Logs. Electron-
ics 2022, 11, 468. https://doi.org/10.3390/electronics11030468.
12. Contini, D.; Cugnata, F.; Scagni, A. Social selection in higher education. Enrolment, dropout and timely degree attainment in
Italy. High. Educ. 2017, 75, 785808. https://doi.org/10.1007/s10734-017-0170-9.
13. Costa, E.B.; Fonseca, B.; Santana, M.A.; De Araújo, F.F.; Rego, J. Evaluating the effectiveness of educational data mining tech-
niques for early prediction of students' academic failure in introductory programming courses. Comput. Hum. Behav. 2017, 73,
247256. https://doi.org/10.1016/j.chb.2017.01.047.
14. Márquez-Vera, C.; Cano, A.; Romero, C.; Noaman, A.Y.M.; Fardoun, H.M.; Ventura, S. Early dropout prediction using data
mining: A case study with high school students. Expert Syst. 2015, 33, 107124. https://doi.org/10.1111/exsy.12135.
15. Fernández, A.; del Río, S.; Chawla, N.V.; Herrera, F. An insight into imbalanced Big Data classification: Outcomes and chal-
lenges. Complex Intell. Syst. 2017, 3, 105120. https://doi.org/10.1007/s40747-017-0037-9.
16. Rodríguez-Hernández, C.F.; Musso, M.; Kyndt, E.; Cascallar, E. Artificial neural networks in academic performance prediction:
Systematic implementation and predictor evaluation. Comput. Educ. Artif. Intell. 2021, 2, 100018.
https://doi.org/10.1016/j.caeai.2021.100018.
17. Contreras, L.E.; Fuentes, H.J.; Rodríguez, J.I. Academic performance prediction by machine learning as a success/failure indi-
cator for engineering students. Form. Univ. 2020, 13, 233246.
18. Hassan, H.; Anuar, S.; Ahmad, N.B.; Selamat, A. Improve student performance prediction using ensemble model for higher
education. In Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2019; Volume 318, pp.
217230.
19. Bolón-Canedo, V.; Alonso-Betanzos, A. Ensembles for feature selection: A review and future trends. Inf. Fusion 2018, 52, 112.
https://doi.org/10.1016/j.inffus.2018.11.008.
20. Meghji, A.F.; Mahoto, N.A.; Unar, M.A.; Shaikh, M.A. The role of knowledge management and data mining in improving edu-
cational practices and the learning infrastructure. Mehran Univ. Res. J. Eng. Technol. 2020, 39, 310323.
https://doi.org/10.22581/muet1982.2002.08.
21. Crivei, L.; Czibula, G.; Ciubotariu, G.; Dindelegan, M. Unsupervised learning based mining of academic data sets for students’
performance analysis. In Proceedings of the SACI 2020IEEE 14th International Symposium on Applied Computational In-
telligence and Informatics, Proceedings, Timisoara, Romania, 2123 May 2020; Volume 17, pp. 1116.
22. Guanin-Fajardo, J.; Casillas, J.; Chiriboga-Casanova, W. Semisupervised learning to discover the average scale of graduation of
university students. Rev. Conrado 2019, 15, 291299.
23. Alyahyan, E.; Düşteargör, D. Decision trees for very early prediction of student’s achievement, In Proceedings of the 2020 2nd
International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 315 October 2020; pp. 17.
24. Alwarthan, S.; Aslam, N.; Khan, I.U. An Explainable Model for Identifying At-Risk Student at Higher Education. IEEE Access
2022, 10, 107649107668. https://doi.org/10.1109/access.2022.3211070.
25. Adekitan, A.I.; Noma-Osaghae, E. Data mining approach to predicting the performance of first year student in a university
using the admission requirements. Educ. Inf. Technol. 2018, 24, 15271543. https://doi.org/10.1007/s10639-018-9839-7.
26. Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In Pro-
ceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 24 August
1996; pp. 8288.
27. Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods
and applications. Expert Syst. Appl. 2017, 73, 220239. https://doi.org/10.1016/j.eswa.2016.12.035.
28. Chawla, N.; Bowyer, K. SMOTE: Synthetic Minority Over-sampling Technique Nitesh. J. Artif. Intell. Res. 2002, 16, 321357.
29. Bertolini, R.; Finch, S.J.; Nehm, R.H. Enhancing data pipelines for forecasting student performance: Integrating feature selection
with crossvalidation. Int. J. Educ. Technol. High. Educ. 2021, 18, 44.
Data 2024, 9, x FOR PEER REVIEW 26 of 28
30. Febro, J.D. Utilizing Feature Selection in Identifying Predicting Factors of Student Retention. Int. J. Adv. Comput. Sci. Appl. 2019,
10, 269274. https://doi.org/10.14569/ijacsa.2019.0100934.
31. Ghaemi, M.; Feizi-Derakhshi, M.-R. Feature selection using Forest Optimization Algorithm. Pattern Recognit. 2016, 60, 121129.
https://doi.org/10.1016/j.patcog.2016.05.012.
32. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
Vienna, Austria, 2020.
33. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg,
V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 28252830.
34. Alturki, S.; Alturki, N.; Stuckenschmidt, H. Using Educational Data Mining to Predict Students’ Academic Performance For
Applying Early Interventions. J. Inf. Technol. Educ. JITE. Innov. Pract. IIP 2021, 20, 121137.
35. Fernández-García, A.J.; Rodríguez-Echeverría, R.; Preciado, J.C.; Manzano, J.M.C.; Sánchez-Figueroa, F. Creating a recom-
mender system to support higher education students in the subject enrollment decisión. IEEE Access 2020, 8, 189069189088.
36. Helal, S.; Li, J.; Liu, L.; Ebrahimie, E.; Dawson, S.; Murray, D.J.; Long, Q. Predicting academic performance by considering
student heterogeneity. Knowl.-Based Syst. 2018, 161, 134146. https://doi.org/10.1016/j.knosys.2018.07.042.
37. Yağci, M. Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart
Learn. Environ. 2022, 9, 119.
38. Gil, P.D.; Martins, S.d.C.; Moro, S.; Costa, J.M. A data-driven approach to predict first-year students’ academic success in higher
education institutions. Educ. Inf. Technol. 2020, 26, 21652190. https://doi.org/10.1007/s10639-020-10346-6.
39. Beaulac, C.; Rosenthal, J.S. Predicting University Students’ Academic Success and Major Using Random Forests. Res. High. Educ.
2019, 60, 10481064. https://doi.org/10.1007/s11162-019-09546-y.
40. Fernandes, E.R.; de Carvalho, A.C. Evolutionary inversion of class distribution in overlapping areas for multiclass imbalanced
learning. Inf. Sci. 2019, 494, 141154.
41. Morales, P.; Luengo, J.; García, L.P.F.; Lorena, A.C.; de Carvalho, A.C.P.L.F.; Herrera, F.; Ciencias, I.D.; Paulo, U.D.S.; Av, T.S.-
C.; Carlos, S.; et al. Noisefiltersr the noise-filtersr package. R J. 2017, 9, 18.
42. Zeng, X.; Martinez, T. A noise filtering method using neural networks, In Proceedings of the IEEE International Workshop on
Soft Computing Techniques in Instrumentation and Measurement and Related Applications (SCIMA2003), Provo, UT, USA, 17
May 2003; pp. 2631.
43. Verbaeten, S.; Assche, A. Ensemble methods for noise elimination in classification problems. In Multiple Classifier Systems. MCS
2003; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; pp. 317325.
44. Ali, A.; Jayaraman, R.; Azar, E.; Maalouf, M. A comparative analysis of machine learning and statistical methods for evaluating
building performance: A systematic review and future benchmarking framework. J. Affect. Disord. 2024, 252, 111268.
https://doi.org/10.1016/j.buildenv.2024.111268.
45. Rajula, H.S.R.; Verlato, G.; Manchia, M.; Antonucci, N.; Fanos, V. Comparison of Conventional Statistical Methods with Machine
Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina 2020, 56, 455. https://doi.org/10.3390/medic-
ina56090455.
46. García, S.; Luengo, J.; Herrera, F. Tutorial on practical tips of the most influential data preprocessing algo-rithms in data mining.
Knowl.-Based Syst. 2016, 98, 129.
47. Cruz RM, O.; Sabourin, R.; Cavalcanti GD, C. Dynamic classifier selection: Recent advances and perspectives. Inf. Fusion 2018,
41, 195216.
48. Yadav, S.K.; Pal, S. Data Mining: A Prediction for Performance Improvement of Engineering Students using Classification. arXiv
2012, arXiv:1203.3832. https://doi.org/10.48550/arXiv.1203.3832.
49. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997,
30, 11451159. https://doi.org/10.1016/s0031-3203(96)00142-2.
50. Nájera, A.B.U.; de la Calleja, J.; Medina, M.A. Associating students and teachers for tutoring in higher education using clustering
and data mining. Comput. Appl. Eng. Educ. 2017, 25, 823832. https://doi.org/10.1002/cae.21839.
51. Kononenko, I. Estimating Attributes: Analysis and Extensions of RELIEF. In European Conference on Machine Learning; Springer:
Berlin/Heidelberg, Germany, 1994; pp. 171182.
52. Liu, H.; Setiono, R. Feature selection and classification: A probabilistic wrapper approach. In Proceedings of the 9th Interna-
tional Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEAAIE´96), Fuku-
oka, Japan, 47 June 1996; pp.419424.
53. Zhu, Z.; Ong, Y.-S.; Dash, M. WrapperFilter Feature Selection Algorithm Using a Memetic Framework. IEEE Trans. Syst. Man
Cybern. Part B 2007, 37, 7076. https://doi.org/10.1109/tsmcb.2006.883267.
54. Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng.
2005, 17, 491502. https://doi.org/10.1109/tkde.2005.66.
55. Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell.
2003, 17, 519533. https://doi.org/10.1080/713827181.
56. Kira, K.; Rendell, L. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the AAAI'92:
Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, USA, 1216 July 1992; pp. 129134.
57. Qian, W.; Shu, W. Mutual information criterion for feature selection from incomplete data. Neurocomputing 2015, 168, 210220.
https://doi.org/10.1016/j.neucom.2015.05.105.
Data 2024, 9, x FOR PEER REVIEW 27 of 28
58. Sheinvald, J.; Dom, B.; Niblack, W. A modeling approach to feature selection. In Proceedings of the 10th International Confer-
ence on Pattern Recognition, Atlantic City, NJ, USA, 1621 June 1990; Volume i, pp. 535539.
59. Coefficient of Determination. In The Concise Encyclopedia of Statistics; Springer: New York, NY, USA, 2008; pp. 8891.
https://doi.org/10.1007/978-0-387-32833-1_62.
60. Quinlan, J. Induction of decision trees. Mach. Learn. 1986, 1, 81106.
61. Ceriani, L.; Verme, P. The origins of the Gini index: Extracts from Variabilità e Mutabilità (1912) by Corrado Gini. J. Econ. Inequal.
2011, 10, 421443. https://doi.org/10.1007/s10888-011-9188-x.
62. Pawlak, Z. Imprecise Categories, Approximations and Rough Sets, Springer: Dordrecht, The Netherlands, 1991; Volume 19, pp. 9
32.
63. Wang, D.; Zhang, Z.; Bai, R.; Mao, Y. A hybrid system with filter approach and multiple population genetic algorithm for feature
selection in credit scoring. J. Comput. Appl. Math. 2018, 329, 307321. https://doi.org/10.1016/j.cam.2017.04.036.
64. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and
SMOTE. Inf. Sci. 2018, 465, 120. https://doi.org/10.1016/j.ins.2018.06.056.
65. Batista, G.E.; Bazzan, A.L.; Monard, M.C. Balancing training data for automated annotation of keywords: A case study. WOB
2003, 3, 1018.
66. Ivan, T. Two modifications of cnn. IEEE Trans. Syst. Man Commun. SMC 1976, 6, 769772.
67. Liu, X.-Y.; Wu, J.; Zhou, Z.-H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B
2008, 39, 539550. https://doi.org/10.1109/tsmcb.2008.2007853.
68. Hearst, M.A. Support vector machines. IEEE Intell. Syst. 1998, 13, 1828.
69. Almeida, L.B. C1. 2 multilayer perceptrons. In Handbook of Neural Computation; Oxford University Press: New York, NY, USA,
1997; pp. 130.
70. Breiman, L. Random forests. Ensemble Mach. Learn. 2001, 45, 532.
71. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 11891232.
https://doi.org/10.1214/aos/1013203451.
72. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 1317 August 2016; pp. 785794.
73. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123140.
74. Webb, G.I. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713714.
75. Shetu, S.F.; Saifuzzaman, M.; Moon, N.N.; Sultana, S.; Yousuf, R. Student’s performance prediction using data mining technique
depending on overall academic status and environmental attributes. In Advances in Intelligent Systems and Computing; Springer:
Berlin/Heidelberg, Germany, 2021; Volume 1166, pp. 757769.
76. Fisher, R.A. The Design of Experiments; Oliver & Boyd: Thomas Oliver, NY, USA, 1935.
77. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 130. Available online:
http://jmlr.org/papers/v7/demsar06a.html (accessed on 9 April 2024).
78. Cohen, J. The eart is round (p < 0.05). Am. Psychol. 1994, 49, 9971003. https://doi.org/10.1037/0003-066X.49.12.997
79. Schmidt, F.L. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.
Psychol. Methods 1996, 1, 115129.
80. Harlow, L.L.; Mulaik, S.A.; Steiger, J.H. (Eds.) Multivariate Applications Book Series. In What If There Were No Significance Tests?
Lawrence Erlbaum Associates Publishers: Mahwah, NJ, USA, 1997.
81. Al-Fairouz, E.I.; Al-Hagery, M.A. Students Performance: From Detection of Failures and Anomaly Cases to the Solutions-Based
Mining Algorithms. Int. J. Eng. Res. Technol. 2020, 13, 28952908. https://doi.org/10.37624/ijert/13.10.2020.2895-2908.
82. Ismanto, E.; Ghani, H.A.; Saleh, N.I.B.M. A comparative study of machine learning algorithms for virtual learning environment
performance prediction. IAES Int. J. Artif. Intell. 2023, 12, 16771686. https://doi.org/10.11591/ijai.v12.i4.pp1677-1686.
83. Kaushik, Y.; Dixit, M.; Sharma, N.; Garg, M.. Feature Selection Using Ensemble Techniques. In Futuristic Trends in Network and
Communication Technologies; FTNCT 2020. Communications in Computer and Information Science; Springer: Singapore, 2021;
Volume 1395, pp. 288298. https://doi.org/10.1007/978-981-16-1480-4_25.
84. Mayer, A.-K.; Krampen, G. Information literacy as a key to academic success: Results from a longitudinal study. Commun. Com-
put. Inf. Sci. 2016, 676, 598607. https://doi.org/10.1007/978-3-319-52162-6_59.
85. Harackiewicz, J.M.; Barron, K.E.; Tauer, J.M.; Elliot, A.J. Predicting success in college: A longitudinal study of achievement goals
and ability measures as predictors of interest and performance from freshman year through graduation. J. Educ. Psychol. 2002,
94, 562575. https://doi.org/10.1037/0022-0663.94.3.562.
86. Meier, Y.; Xu, J.; Atan, O.; van der Schaar, M. Predicting Grades. IEEE Trans. Signal Process. 2015, 64, 959972.
https://doi.org/10.1109/tsp.2015.2496278.
87. Lord, S.M.; Ohland, M.W.; Orr, M.K.; Layton, R.A.; Long, R.A.; Brawner, C.E.; Ebrahiminejad, H.; Martin, B.A.; Ricco, G.D.;
Zahedi, L. MIDFIELD: A Resource for Longitudinal Student Record Research. IEEE Trans. Educ. 2022, 65, 245256.
https://doi.org/10.1109/te.2021.3137086.
88. Tompsett, J.; Knoester, C. Family socioeconomic status and college attendance: A consideration of individual-level and school-
level pathways. PLoS ONE 2023, 18, e0284188. https://doi.org/10.1371/journal.pone.0284188.
89. Ma, Y.; Cui, C.; Nie, X.; Yang, G.; Shaheed, K.; Yin, Y. Pre-course student performance prediction with multi-instance multi-
label learning. Sci. China Inf. Sci. 2018, 62, 29101. https://doi.org/10.1007/s11432-017-9371-y.
Data 2024, 9, x FOR PEER REVIEW 28 of 28
90. Berrar, D. Confidence curves: An alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn.
2017, 106, 911949. https://doi.org/10.1007/S10994-016-5612-6/FIGURES/12.
91. Berrar, D.; Lozano, J.A. Significance tests or confidence intervals: Which are preferable for the comparison of classifiers? J. Exp.
Theor. Artif. Intell. 2013, 25, 189206. https://doi.org/10.1080/0952813x.2012.680252.
92. García, S.; Herrera, F. An Extension on Statistical Comparisons of Classifiers over Multiple Data Sets for all Pairwise Com-
parisons. J. Mach. Learn. Res. 2008, 9, 26772694.
93. Biju, V.G.; Prashanth, C. Friedman and Wilcoxon Evaluations Comparing SVM, Bagging, Boosting, K-NN and Decision Tree
Classifiers. J. Appl. Comput. Sci. Methods 2017, 9, 2347. https://doi.org/10.1515/jacsm-2017-0002.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual au-
thor(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any in jury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
p>Virtual learning environment is becoming an increasingly popular study option for students from diverse cultural and socioeconomic backgrounds around the world. Although this learning environment is quite adaptable, improving student performance is difficult due to the online-only learning method. Therefore, it is essential to investigate students' participation and performance in virtual learning in order to improve their performance. Using a publicly available Open University learning analytics dataset, this study examines a variety of machine learning-based prediction algorithms to determine the best method for predicting students' academic success, hence providing additional alternatives for enhancing their academic achievement. Support vector machine, random forest, Nave Bayes, logical regression, and decision trees are employed for the purpose of prediction using machine learning methods. It is noticed that the random forest and logistic regression approach predict student performance with the highest average accuracy values compared to the alternatives. In a number of instances, the support vector machine has been seen to outperform the other methods.</p
Article
Full-text available
Inequality research has found that a college education can ameliorate intergenerational disparities in economic outcomes. Much attention has focused on how family resources impact academic achievement, though research continues to identify how mechanisms related to social class and structural contexts drive college attendance patterns. Using the Education Longitudinal Study and multilevel modeling techniques, this study uniquely highlights how extracurricular activities relate to family socioeconomic status and school contexts to influence college attendance. Altogether, sport and non-sport extracurricular participation, college expectations, and academic achievement scores, situated within unique school contexts that are driven by residential social class segregation, contribute to the cumulative advantages of children from higher SES families. The results from this study show that these cumulative advantages are positively associated with college attendance and an increased likelihood of attending a more selective school.
Article
Full-text available
Nowadays, researchers from various fields have shown great interest in improving the quality of learning in educational institutes in order to improve student achievement and learning outcomes. The main objective of this study was to predict the at-risk student of failing the preparatory year at an early stage. This study applies several educational data mining algorithms including RF, ANN, and SVM to build three classification models to meet the objectives of this study. Moreover, different features selection methods namely RFE, and GA have been examined to find the best subset of the highest influential features. Furthermore, several sampling approaches are applied to balance the dataset used in this study, including SMOTE, and SMOTE-Tomek Link. Three datasets related to the preparatory year student from the humanities track at IAU were used in this study. The collected datasets are imbalanced datasets, SMOTE-Tomek Link technique has been used to balance the three proposed datasets. The results showed that RF outperformed other techniques as it records the highest performance for building the models. Moreover, RFE with Mutual Information finds the best subset of features to build the first model. Finally, this study not only developed several classification models to identify at-risk students, but it also went a step further by employing XAI techniques such as LIME, SHAP, and the global surrogate model to explain the proposed prediction models, explaining the output and highlighting the reasons for the student failure.
Article
Full-text available
Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest neighbour, support vector machines, logistic regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. The dataset consisted of the academic achievement grades of 1854 students who took the Turkish Language-I course in a state University in Turkey during the fall semester of 2019–2020. The results show that the proposed model achieved a classification accuracy of 70–75%. The predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. Such data-driven studies are very important in terms of establishing a learning analysis framework in higher education and contributing to the decision-making processes. Finally, this study presents a contribution to the early prediction of students at high risk of failure and determines the most effective machine learning methods.
Article
Full-text available
La promoción de una educación de calidad en las instituciones de enseñanza superior promueve la autoeficacia. La utilidad del trabajo se ha dirigido al análisis de las características del profesorado y el éxito académico de los estudiantes al final del primer año en el contexto universitario. La población estudiada fue de 6690 estudiantes y 256 profesores, el conjunto de datos tenía 15 variables entre numéricas y categóricas. Se utilizó estadística descriptiva, métricas diseñadas para evaluar datos significativos y técnicas avanzadas de visualización. Los resultados revelaron el perfil esencial de los profesores experimentados y maduros, tanto en la enseñanza como en los grupos de edad. Los profesores experimentados que participaron en la enseñanza en un porcentaje superior al 66%, influyeron con un 72% de certeza en el éxito académico del alumnado. A corto plazo, los profesores noveles cuya tasa de participación fue del 33% mostraron un efecto positivo. A largo plazo, los estudiantes cambiaron (8%) o abandonaron (59%) la carrera universitaria. La utilidad de estos resultados proporciona sugerencias para una enseñanza significativa y oportuna, siempre que la distribución del profesorado experimentado y maduro corresponda a dos o tres tercios del total de profesores del primer año de la titulación universitaria.
Article
Full-text available
Educational data mining is a process that aims at discovering patterns that provide insight into teaching and learning processes. This work uses Machine Learning techniques to create a student performance prediction model, using academic data and records from a Learning Management System, that correlates with success or failure in completing the course. Six algorithms were employed, with models trained at three different stages of their two-year course completion. We tested the models with records of 394 students from 3 courses. Random Forest provided the best results with 84.47% on the F1 score in our experiments, followed by Decision Tree obtaining similar results in the first subjects. We also employ clustering techniques and find different behavior groups with a strong correlation to performance. This work contributes to predicting students at risk of dropping out, offers insight into understanding student behavior, and provides a support mechanism for academic managers to take corrective and preventive actions on this problem.
Article
Full-text available
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Contribution: This work provides evidence of various approaches to studying longitudinal student unit record data in undergraduate education in the USA and the outcomes that can be realized using a large multi-institutional longitudinal dataset, Multiple-Institution Database for Investigating Longitudinal Development (MIDFIELD). Background: Cross-sectional studies introduce a variety of sources of error in estimating student pathways and outcomes. Longitudinal outcomes that ignore pathways also miss important information, and some populations are systematically excluded (such as transfer students). Intended Outcomes: By providing examples of how longitudinal student unit-record data can be analyzed and the results that can be expected, this work aims to deepen the research toolbox in engineering education. Findings: MIDFIELD is being used to support studies of demographic and financial trends among universities in the southeastern USA, required math and science course grades and disciplinary cultures, time to find graduation major, educational data mining, and applications of selected advanced models.
Article
Full-text available
Educators seek to harness knowledge from educational corpora to improve student performance outcomes. Although prior studies have compared the efficacy of data mining methods (DMMs) in pipelines for forecasting student success, less work has focused on identifying a set of relevant features prior to model development and quantifying the stability of feature selection techniques. Pinpointing a subset of pertinent features can (1) reduce the number of variables that need to be managed by stakeholders, (2) make "black-box" algorithms more interpretable, and (3) provide greater guidance for faculty to implement targeted interventions. To that end, we introduce a methodology integrating feature selection with cross-validation and rank each feature on subsets of the training corpus. This modified pipeline was applied to forecast the performance of 3225 students in a baccalaureate science course using a set of 57 features, four DMMs, and four filter feature selection techniques. Correlation Attribute Evaluation (CAE) and Fisher's Scoring Algorithm (FSA) achieved significantly higher Area Under the Curve (AUC) values for logistic regression (LR) and elastic net regression (GLMNET), compared to when this pipeline step was omitted. Relief Attribute Evaluation (RAE) was highly unstable and produced models with the poorest prediction performance. Borda's method identified grade point average, number of credits taken, and performance on concept inventory assessments as the primary factors impacting predictions of student performance. We discuss the benefits of this approach when developing data pipelines for predictive modeling in undergraduate settings that are more interpretable and actionable for faculty and stakeholders. Supplementary information: The online version contains supplementary material available at 10.1186/s41239-021-00279-6.
Chapter
The university as we know it today is going to die. Indeed, we are now seeing the chaos that precedes any change. The influence of new players, such as artificial intelligence, is incompatible with a university that is an essential element of the contemporary industrial, financial, and ideological apparatus (Deneault in La médiocratie. Lux Editeur, Canada, 2015). The revolution of the academic world is imperative, given that the relations between academia and knowledge reflect the evolution of societies. As a result, we will have to guide innovation, which in itself holds no particular moral value. Innovation is as good as what we decide to do with it. However, the absence of a social project prevents us from creating a transversal policy in the economic, social, and cultural fields. This is why the new university that we are going to invent will allow us to take up the immense challenge of serving us in a world soon to be saturated with artificial intelligence. The objective of this research is to analyze the addition of digital technology to the world of conservative universities and to propose an optimal way of orienting scientific research and higher education represented by professor-researchers, to adapt to a digital future that is certainly approaching. This article is organized into three main sections. The first section will expose the changes in the profession of academic professors in both their informational and financial capacities; the second section will focus on the changes in the profession of researchers, also in their informational and financial capacities; and the last section will offer some suggestions to optimize the profession of researchers and professors in the context of their interaction with artificial intelligence.