ArticlePDF Available

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

Authors:

Abstract

The evolution of big data analytics through machine learning and artificial intelligence techniques has caused organizations in a wide range of sectors including health, manufacturing, e-commerce, governance, and social welfare to realize the value of massive volumes of data accumulating on web-based repositories daily. This has led to the adoption of data-driven decision models; for example, through sentiment analysis in marketing where produces leverage customer feedback and reviews to develop customer-oriented products. However, the data generated in real-world activities is subject to errors resulting from inaccurate measurements or fault input devices, which may result in the loss of some values. Missing attribute/variable values make data unsuitable for decision analytics due to noises and inconsistencies that create bias. The objective of this paper was to explore the problem of missing data and develop an advanced imputation model based on Machine Learning and implemented on K-Nearest Neighbor (KNN) algorithm in R programming language as an approach to handle missing values. The methodology used in this paper relied on the applying advanced machine learning algorithms with high-level accuracy in pattern detection and predictive analytics on the existing imputation techniques, which handle missing values by random replacement or deletion.. According to the results, advanced imputation technique based on machine learning models replaced missing values from a dataset with 89.5% accuracy. The experimental results showed that pre-processing by imputation delivers high-level performance efficiency in handling missing data values. These findings are consistent with the key idea of paper, which is to explore alternative imputation techniques for handling missing values to improve the accuracy and reliability of decision insights extracted from datasets.
Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
Vol. 10, No. 2, June 2022, pp. 375~384
ISSN: 2089-3272, DOI: 10.52549/ijeei.v10i2.3730 375
Journal homepage: http://section.iaesonline.com/index.php/IJEEI/index
The Effect of Using Data Pre-Processing by Imputations in
Handling Missing Values
Abdelrahman Elsharif Karrar
College of Computer Science and Engineering, Taibah University, Saudi Arabia
Article Info
ABSTRACT
Article history:
Received Feb 19, 2022
Revised Apr 4, 2022
Accepted Apr 8, 2022
The evolution of big data analytics through machine learning and artificial
intelligence techniques has caused organizations in a wide range of sectors
including health, manufacturing, e-commerce, governance, and social welfare
to realize the value of massive volumes of data accumulating on web-based
repositories daily. This has led to the adoption of data-driven decision models;
for example, through sentiment analysis in marketing where produces leverage
customer feedback and reviews to develop customer-oriented products.
However, the data generated in real-world activities is subject to errors
resulting from inaccurate measurements or fault input devices, which may
result in the loss of some values. Missing attribute/variable values make data
unsuitable for decision analytics due to noises and inconsistencies that create
bias. The objective of this paper was to explore the problem of missing data
and develop an advanced imputation model based on Machine Learning and
implemented on K-Nearest Neighbor (KNN) algorithm in R programming
language as an approach to handle missing values. The methodology used in
this paper relied on the applying advanced machine learning algorithms with
high-level accuracy in pattern detection and predictive analytics on the existing
imputation techniques, which handle missing values by random replacement
or deletion.. According to the results, advanced imputation technique based
on machine learning models replaced missing values from a dataset with
89.5% accuracy. The experimental results showed that pre-processing by
imputation delivers high-level performance efficiency in handling missing
data values. These findings are consistent with the key idea of paper, which is
to explore alternative imputation techniques for handling missing values to
improve the accuracy and reliability of decision insights extracted from
datasets.
Keyword:
Data Pre-Processing
Imputation Model
Machine Learning
k-Nearest Neighbor Algorithm
Missing Values
Copyright © 2022 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Abdelrahman Elsharif Karrar
College of Computer Science and Engineering, Taibah University, Saudi Arabia
Email: akarrar@taibahu.edu.sa
1. INTRODUCTION
Technological growth over the recent years has caused unprecedented growth in information and
communication capabilities and mobility, especially in the field of Artificial Intelligence, which enables smart
devices to perform automated functions even beyond human ability. Computer systems embedded with
advanced capabilities such as automation and predictive analytics have been adopted in a wide range of areas
such as transportation, manufacturing, healthcare, marketing, finance, and robotics. These applications
generate huge quantities of data, which presents significant growth opportunities in the field of Machine
Learning [1]. Machine learning is a subset of artificial intelligence in which specialized algorithms are designed
to analyze and extract insights from unstructured datasets. Machine learning utilizes mathematical models to
discover trends and make data-driven predictions concerning the variables under study. According to [2],
predictive modelling refers to the formulation of computerized algorithms to perform data analytics operations
and make forecasts about a problem by regression or classification.
ISSN: 2089-3272
IJEEI, Vol. 10, No. 2, June 2022: 375 384
376
The data generated from the devices and applications with AI and ML capabilities are stored in
dynamic servers and databases, which require routine updating and upgrading due to the complex nature of the
data and sources. Unstructured data from certain sources are subject to errors due to the possibility of storing
null values, which are also analyzed during decision-making processes. Data may also contain inconsistencies
and discrepancies, and noises, which may lead to unreliable output. Therefore, data cleansing is an essential
aspect of database management to minimize the risks of bias resulting from the storage inconsistencies,
especially for large data sets [3], [4]. In the process of data analytics by machine learning, shortage of data is
the most prevalent challenge faced in developing predictive models.
Mean imputation is one of the most efficient methods for addressing the problem of missing data
values in advanced analytics systems by reducing the strength of association while the imputation of regression
can increase the strength of association although it poses ambiguity risks to the data values. Missing
Completely at Random (MCAR) technique based on the K-Nearest Neighbors (K-NN) algorithm [5] is
generally applied to the replacement of missing data values by increasing the likelihood of interpretative and
incomplete failure. Imputation techniques correct data inconsistencies by replacing the missing values with
mean observed values or the last observed value. The relative advantage of imputation based on ML algorithms
over other existing methods for handling missing values is [6] that the replacement values rely on a combination
of computational, mathematical, and statistical models rather than random parameters from the dataset. This
makes it comparatively a more reliable, accurate, and applicable approach for data-driven decision-making,
especially in high-precision fields such as manufacturing and healthcare.
1.1. Problem Statement
Structured and unstructured data collected from digital devices and computer systems in their real-
world applications may have missing values, noise, and other inconsistencies. Subjecting such datasets to
analytics increases the risks of inaccurate output leading to biased decisions and erroneous insights, which may
have adverse negative implications on the enterprise. The existing techniques of handling missing values by
imputation rely on random sampling technique to estimate and replace the missing values; making then
potentially inaccurate and unreliable, especially when applied to data-driven decision-making in high-precision
fields such as medical surgeries and manufacturing. Therefore, there is a need to adopt effective approaches
based on computational, mathematical, and statistical models to address the problem of missing values in large
datasets. This paper sought to apply the imputation technique to pre-process and classify data with missing
values to reduce the risks of inconstancy or errors in decision-making processes.
1.2. Data Mining
Data mining [7] refers to a process in which valuable information is extracted from unstructured or
structured datasets stored in databases or cloud platforms. According to [8]. data mining entails extraction,
classification, and transfer of various data through a series of processes including data cleaning,
standardization, and testing. The process of data mining entails a series of sequences, which include the
following steps shown in Figure 1;
1. Cleaning the data to remove inconsistencies and noise.
2. Integration to separate data from overlapping sources.
3. Selecting and retrieving data sets that are appropriate for analysis from the database.
4. Extracting useful data patterns from the database.
5. Identifying and evaluating the variable patterns representing knowledge based on the set parameters.
6. Preprocessing and representing knowledge through visualization and presentation techniques such as
schemas [9].
Figure 1. Steps Involved in the Process of Data Mining
IJEEI ISSN: 2089-3272
The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)
377
1.3. Pre-processing
Data pre-processing [10] entails a series of preparations aimed at converting the data into a format
that is easier to analyze since most of the information collected from day-to-say activities is largely unstructured
and may be difficult to analyze due to missing values [11]. Pre-processing raw data involves various processes
including cleansing, integration, transformation, and reduction as illustrated in Figure 2;
Figure 2. Stages of Data Pre-processing
The objective of data cleaning is to identify and complete null values, fixing incoherence, and
standardizing outliers to reduce noise. When the data sets are imported from storage resources such as servers
and databases, the first step of data cleaning [12] is merging the datasets with related information, which are
then rebuilt to recreate missing values and de-duplicated to normalize the sets. The next step is data verification
and enrichment, which entail confirming the validity and making further improvements to prepare the data for
analysis and decision making.
1.4. Missing Values
Missing Values (MVs) in machine learning refers to the data attributes that may lack in a dataset due
to errors that may arise in the input process due to improper measurements or device failure [13]. A missing
value algorithm is used to determine whether a correlational link exists between the missing values and other
variables in a dataset. For instance, if X represents a dataset (a, b) such that a is X’s observed value while b is
the missing value in X and Y is a random variable. Assuming that Y = 1 regardless of whether X’s are observed
or missing values, the observed value can be determined by letting X = 0 expressed in the form of a model P
(Y/x, Ø) such that Ø represents the missing. The mechanism for fixing the missing values is based on Y’s
dependence on the variables contained in the dataset [14].
1.4.1. Mechanisms for Computing Missing Values
Missing Completely at Random (MCAR) Technique
MCAR technique determines the missing values through the (1);
𝑃(𝑌|𝑋, )=(𝑌, ) (1)
From (1), it is evident that the probability of missing values being proximate to the observed values
is low and does not depend on the variables contained in the dataset X. Therefore, the most feasible option for
dealing with missing values based on the MCAR approach is developing an algorithm that can randomly delete
values to normalize the dataset.
Missing at Random (MAR)
The MAR technique is used to fix the problem of missing data based on the (2);
𝑃(𝑌|𝑋, )= 𝑃(𝑌𝑎|) (2)
From (2), the probability that the value of missing variables depends on the values noticed in a subset
of the X and not entire dataset Y. This implies that the MAR approach conducts a systematic inquiry of the
larger dataset to determine the values that do not correspond to the variable under inquiry hence determines
the missing values based on covariance.
Missing Not at Random (MNAR)
The MNAR technique formulates a solution for missing values in a dataset based on the (3):
𝑃(𝑌|𝑋, )= 𝑃(𝑌|𝑎, 𝑏, ) (3)
ISSN: 2089-3272
IJEEI, Vol. 10, No. 2, June 2022: 375 384
378
From (3), it is observed that the prerequisites for handling the missing values based on the MAR
technique are violated such that the probability of missing values is dependent on b or other unexpected
covariates within the dataset. For instance, when applied to tax computations, MNAR is able to determine that
missing data values are dependent on unobserved revenue declarations by the tax payers [15]. The advantage
of MNAR technique is that it separates data, which was never provided from that which was incorrectly input
due to measurement error.
1.5. Imputation as a Solution for Missing Data Values
Imputation is a computational technique that utilizes mathematical and statistical algorithms to resolve
the problem of missing values in a dataset by replacement without interfering with the attributes and values of
the entire dataset. Imputation approaches are broadly classified into traditional and advanced categories
depending on the applied method for replacing the missing values. Under the traditional imputation technique
[16], the problem is solved through pairwise and listwise deletion of the missing values, especially under the
condition of Missing Completely at Random (MCAR). Traditional imputation entails computational
procedures such as Multiple Imputation, Maximum Likelihood Imputation, Hot Deck Imputation, Mode
Imputation, and Mean Imputation. However, the advanced imputation approach relies on computational
intelligence to learn complex interdependencies in large data sets and determine the optimal method for
handling missing values based on the observed characteristics. Computational models such as Decision Trees,
Random Forest Algorithm, and k-Nearest Neighbor are implemented in advanced imputation.
The k-Nearest Neighbors (k-NN) algorithm, which is among the most efficient technique for advanced
imputation conceptualizes missing data values as regression and pattern recognition problems. K-NN
algorithms classify datasets based on the memory of the observed values and attributes rather than labeled
vectors. The computational processes used to replace the missing values are based on the observed closest k-
neighbor in the training sets and the highest number of k iterations. The study [17] suggest that the individual
variable classifications define the procedures used to determine the K-Nearest Neighbor in each instance. K-
NN Imputation [18]is applied in the case of incomplete system and unknown data distribution. The technique
imputes missing data values by computing a metric/variable that is distant from the nearest k neighbors based
on computed estimations of data sets that lack the appropriate mean and mode. The numerical value of the
missing parameters is then predicted using the mean rule while the missing categorical variables are computed
using the mode rule hence making it possible to efficiently handle the problem of missing values in large data
sets. This paper is organized as follows: Section 2 briefly introduces related work. Section 3 clarifies on
the proposed methodology for implementing an imputation technique based on R-programming language as a
solution for handling missing values in data sets. Section 4 of this research paper discusses results based on the
experimental works. Then in the subsequent sections, the conclusion comes in Section 5 and Finally, Section
6 provides recommendations for future studies.
2. RELATED WORK
A study [19] proposed a novel K-NN model for handling missing values through two-stage training
scheme based on the training data and missing date. This approach effectively handles missing instances in
heterogeneous data sets by computing Mutual Information (MI) weights between class labels and attributes of
the dataset. This ensures the imputation of missing data values to enhance classification performance,
especially in UCI data samples with varying rates of missing values. The performance efficiency of handling
missing values in biased datasets can be determined by successive simulations of continuous traits and
segmenting response variables through imputation procedures such as complete case analysis [14]. The model
performance is measured based on the degree of deviation/marginal error and the covariance between traits
and responses.
Findings from a research study [20]suggests that the adoption of advanced techniques for handling
missing values delivers high-level performance by allowing multiple imputations on a single dataset. In a study,
20 samples were randomly selected from a dataset of Traumatic Brain Injury data sets. In 8 of the samples, one
variable was deleted to create missing data, which was used to determine the technical performance efficiency
of multiple imputation and single imputation methods [21]. The Multiple Imputation approach demonstrated
higher effectiveness in variable deletion compared to single imputation based on the estimated parameter
comparison [21].
According to [9], auto-encoder neural networks can be applied to the imputation of missing data to
innovatively predict missing values and automatically encode new files without missing values through a two
stage model. A more advanced imputation framework capable of imputing categorical and mixed continuous
variables through formal optimization, predicts missing values using mathematical models such as decision
trees, support vector machines, and closest k-neighbors [22]. The implementation of opti-impute generic
IJEEI ISSN: 2089-3272
The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)
379
algorithm in this framework produces high-quality solutions due to the improved sampling accuracy in multiple
datasets obtained from the UCIML repository. This approach performs precise imputations of missing value
sets through predictive K-NN , mean-matching, and Bayesian techniques with low average absolute error
through cross-validated benchmarking [23].
According to [24], medical datasets with missing values may be difficult to impute as the null values
are often contained in categorical attributes hence complicating pre-processing stages. However, advanced
imputation techniques based on machine learning and decision-tree models are capable of effectively
identifying outliers and replacing the missing values through K-NN computations [25]. Outliers pose
significant risks of bias during statistical estimation procedures by increasing the likelihood of overstated or
understated decision outcomes hence the dependability of imputations techniques is a critical consideration.
According to [26] pattern creedal classification models may be applied to the adaptive imputation of
missing data based on the observed variables. This technique is founded on the belief function theory, which
implies that missing data is fundamentally required for unambiguous and accurate classification of the observed
datasets hence allowing the imputation though the self-organizing map (SOM) algorithms for pattern
extraction. The algorithm classifies data patterns as either altered or original depending on the representation
outcomes from the training classes.
3. METHODOLOGY
The objective of this section is to handle missing data values through the implementation of
imputation techniques based on IBK algorithms. The implementation entails manipulating unstructured
datasets and testing the performance efficiency and accuracy of imputation techniques in replacing the missing
values.
3.1. Dataset
A sample dataset obtained from Juba Insurance & Reinsurance Company was used in this project. An
arbitrary selection of 100 samples were selected and prepared for imputation using MAR, NMAR, and MCAR
techniques to insert fictitious missing values as the table shown in Figure 4. The original dataset was then
implemented on a random data generation algorithm to model missing values into 4 categories with a DocType
attributes having an equal probability of missing rates. The datasets were classified into 3 categories of
percentage missing rates (3%, 6%, and 10%) hence allowing the simulation of missing values from attribute
instances containing 4 classes as the table shown in Figure 3.
Figure 3. The Original Dataset
Figure 4. Dataset with Missing Values
ISSN: 2089-3272
IJEEI, Vol. 10, No. 2, June 2022: 375 384
380
3.2. Imputation of the k-Nearest Neighbors (k-NN)
The next step is implementing the k-Nearest Neighbors algorithm to impute the missing data values
through pattern recognition after which non-parametric regression and classification are performed to pre-
process the dataset. The output contains data values classified into k-NN classes based on the plurality of the
neighbors in which objects allocated to the closest neighbors. Implementing algorithms such as the
Neighborhood Components and Large Margin k-NNs allocates missing data values based on the closest
neighbor classes [27]. Since imputation methods can be used to improve the classification performance of k-
Nearest Neighbors, the ideal value (k) of missing data is dependent on the larger values of k, which minimize
the impact of noises on classification accuracy by creating less distinctive class boundaries. Therefore, the
optimal value of k in each case can be predicted based on the estimates from the nearest training sets (i.e.
assuming k = 1). After the imputation of missing values, a R-programming model was developed to perform
further classification of the missing values through the following steps as illustrated in Figure 5;
Step 1: Organize the data into rows and columns representing records and attributes respectively
Step 2: Segment the dataset into two classes; ‘complete’ and ‘with MVs’
Step 3: Carry out normalization on the dataset
Step 4: Iteratively perform the imputation on the missing values independently
Step 5: Use K-NN algorithm to test variations between the new and original records
Step 6: Replace the missing values with the nearest attribute having the highest similarity
Step 7: Create a column with complete data values
Step 8: Apply the above procedure to impute missing values for all the records.
Figure 5. A Flowchart of the k-NN Imputation Procedure
The above proposed imputation technique utilizing k-NN algorithm is based on the modification of
power distance parameters to determine the appropriate values of the missing data hence the need to specify
the attributes with null values and their nearest neighbors.
3.3. Programming Languages
The implementation of imputation technique for missing values in data sets requires Java and R
programming languages, which are used for statistical modelling and mathematical computation of relational
patterns among the variables.
3.4. Experimental Procedures
The first step of implementing the R model for imputation is preparing the work directory to read and
save the data file through the following commands;
setwd("~/R implementation")
ins <- read.csv("missing enc.csv")
The original dataset contains 100 instances, 7 attributes, and the null values shown in Figure 6. Since
DocType is recorded as an incomplete parameter with randomly missing values, the following r-code is run to
restructure the dataset;
IJEEI ISSN: 2089-3272
The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)
381
Figure 6. Original Dataset Structure
The original dataset is then read and the NA value of DocType parameter is set to allow for the
detection of null values using the R-code as the table shown in Figure 7;
Figure 7. The Dataset After Replacing Missing values with NA
A user-defined function is implemented to analyze missing values in the incomplete dataset using the
R-code shown in Figure 8;
Figure 8. The R-code
The R output for this command is shown in Figure 9;
Figure 9. The R Output
Further analysis of the incomplete dataset using R-packages for pattern identification produced the
following output shown in Figure 10;
Figure 10. Further Analysis Using R-packages for Pattern Identification
From the Figure 10, it is observed that the observed and missing values are represented as binary
values 1 and 0 respectively. The number of observations made from the data file is observed in the first column
while the total of variables with incomplete data is shown in the last column. A plot of the missing and complete
data values is shown in Figure 11;
ISSN: 2089-3272
IJEEI, Vol. 10, No. 2, June 2022: 375 384
382
Figure 11. Complete and Incomplete Datasets
From the graph, it is observed that the initial output shows 81 samples with no missing values and 19
samples with missing values, which are further analyzed using the aggregate plot function in R as shown in the
Histogram (Figure 12).
Figure 12. Complete and Incomplete Datasets
From the histogram, it is observed that the dataset contained 19% missing values (shown in Red) and
81% complete values (shown in Blue) from the DocType attribute.
4. RESULTS AND DISCUSSION
Imputation technique based on the k-NN algorithm and IBK implementation on R programming
language can effectively compute missing values in a dataset. The proposed technique applied multiple
computational and statistical methods on the imputation approach for handling missing values that had been
artificially created in a sample dataset. An insurance dataset was obtained and input into R-studio with variables
(Third-party, comprehensive, marine, and Fire + stolen) clearly defined. Running a series of pattern extraction
and analytical functions in R-studio detected missing values from the dataset with 89.5% classification
accuracy as summarized in Figure 13;
Figure 13. Summary results of the incomplete dataset after imputation
The findings from this study are consistent with the results from similar studies [28], [29] which
showed that K-Nearest Neighbor (KNN) technique is a superior classification algorithm with high accuracy
when applied to the imputation of missing values in a datasets. The relative accuracy of this technique is based
on factors such as weighted estimation, feature relevance to specific datasets under study, predictive detection
of missing values based on a series of statistical analyses. Findings from a related study show that KNN-based
imputation of missing values in a dataset delivers 90% accuracy in the classification of missing values and
86.27% performance in the replacement of numerical values in relatively less computational time and error
compared to other techniques [30]. Imputation of missing values using KNN algorithm attains superior
IJEEI ISSN: 2089-3272
The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)
383
performance in the experimental replacement of numerical values due to its unique ability to classify the
missing parameters and assign cluster ratios for each type unlike other techniques that perform replacement in
whole datasets based on the normalized computation of mean absolute errors and root mean square error. A
study [31] observes that imputation based on computational and statistical models is recommended by scientists
due to its unique ability to determine the missing values by averaging a summarized likelihood function of the
entire dataset over a mathematically defined predictive distribution with considerably high precision.
The implementation of reverse data mining using the IBK classification algorithm has been effectively
demonstrated as a reliable solution for handling missing data values. The pre-processing implementation
functions replace missing values by computing the nearest neighbors with the highest similarity index. This
approach recreates the dataset with missing values into a complete with accurately imputed variables and
attributes for use in data analytics and decision support.
R-studio increases the performance accuracy of the imputation technique by replacing the missing
values in a dataset with several independently imputed values rather than a single randomly imputed value [32]
. This reflects possible uncertainties that may arise due to technical errors with the imputation model. For
instance, if a regression model is applied to imputing the missing values, it is desired that imputations reflect
both sampling variability and uncertainties regarding the regression coefficients utilized in the model.
Independent modelling of the coefficients makes it possible to create a new set of imputed values for each
instance based on the coefficient distribution through multiple imputations. Therefore, R-studio made it
possible to run the standard analysis of the datasets with missing values to generate inferences, which are then
combined to determine the most appropriate value of the missing data.
5. CONCLUSION
In conclusion, the objective of this study was to discuss and implement the imputation technique based
on R-programming language as a solution for handling missing values in data sets. The problem of missing
data may arise due to input or measurement errors, especially in our daily interactions with technology hence
impacting the quality of analytical insights from such data. The implementation of k-NN imputation method
based on the IBK classification algorithm proved a reliable approach to replacing missing data values with the
value of nearest attributes showing the highest similarity.
The experimental results also showed that pre-processing by imputation delivers high-level
performance efficiency in handling missing data values. Where these findings are consistent with the key idea
and objective of paper, which is to explore alternative imputation techniques for handling missing values to
improve the accuracy and reliability of decision insights extracted from datasets.
6. FUTURE WORK
While this paper presents an important knowledge framework for the future of data analytics and data-
driven decision-making, more research is needed to refine the imputation mechanisms for replacing missing
values to eliminate noise and inconsistencies, especially in the massive data generated from the Internet of
Things. Future work should focus on the development of automated imputation techniques based on machine
learning and artificial intelligence for improved efficiency in data pre-processing and analytics.
Since the dataset used in this study is very small, choosing the most common dataset would be useful
for an extended study, especially for comparing results with other methods.
REFERENCES
[1] A. Nikitas, K. Michalakopoulou, E. T. Njoya and D. Karampatzakis, "Artificial Intelligence, Transport and the Smart
City: Definitions and Dimensions of a New Mobility Era," Sustainability, vol. 12, no. 7, p. 2789, 2020.
[2] G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, with Applications in R,
vol. 2, New York, NY: Springer, 2021.
[3] Z. ÇETİNKAYA and F. HORASAN, "Decision Trees in Large Data Sets," International Journal of Engineering
Research and Development, vol. 13, no. 1, pp. 140-151, 2021.
[4] S. Chuprov, I. Viksnin, I. Kim, T. Melnikov, L. Reznik and I. Khokhlov, "Improving Knowledge Based Detection of
Soft Attacks Against Autonomous Vehicles with Reputation, Trust and Data Quality Service Models," in 2021 IEEE
International Conference on Smart Data Services, Chicago, IL, USA, 2021.
[5] S. Pei, H. Chen, F. Nie, R. Wang and X. Li, "Centerless Clustering: An Efficient Variant of K-means Based on K-
NN Graph," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[6] ChaoFu, C. Xu, M. Xue, W. Liu and S. Yang, "Data-driven decision making based on evidential reasoning approach
and machine learning algorithms," Applied Soft Computing, vol. 110, no. 15, 2021.
[7] V. Singh and V. D. Kaushik, "Concepts of Data Mining and Process Mining," in Process Mining Techniques for
Pattern Recognition, CRC Press, 2022.
[8] C. Yuan and H. Yang, "Research on K-Value Selection Method of K-Means Clustering Algorithm," J, vol. 2, no. 2,
pp. 226-235, 2019.
ISSN: 2089-3272
IJEEI, Vol. 10, No. 2, June 2022: 375 384
384
[9] S. J. Choudhury and N. R. Pal, "Imputation of missing data with neural networks for classification," Knowledge-
Based Systems, vol. 182, 2019.
[10] B. C. Wesolowski, "Data Preprocessing and Data Manipulation," in From Data to Decisions in Music Education
Research, 1 ed., 2022.
[11] M. Umair, F. Majeed, M. Shoaib, M. Q. Saleem, M. S. Adrees, A. E. Karrar, S. Khurram, M. Shafiq and J.-G. Choi,
"Main Path Analysis to Filter Unbiased Literature," Intelligent Automation and Soft Computing, vol. 32, no. 2, pp.
1179-1194, 2022.
[12] N. Whitmore, "Data cleaning," in R for Conservation and Development Projects, Chapman and Hall/CRC, 2020.
[13] E. W. Steyerberg, "Missing Values," in Clinical Prediction Models. Statistics for Biology and Health, Cham,
Springer, 2019.
[14] T. F. Johnson, N. J. B. Isaac, A. Paviolo and M. González-Suárez, "Handling Missing Values in Trait Data," Global
Ecology and Biogeography, vol. 30, no. 1, pp. 51-62, 2020.
[15] C. Bonander and U. Strömberg, "Methods to handle missing values and missing individuals," European Journal of
Epidemiology, vol. 34, no. 1, 2019.
[16] A. E. Karrar, "Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classification
Data in the Diagnostic and Treatment Interventions for Prostate Cancer," International Journal of Advanced
Computer Science and Applications, vol. 13, no. 1, pp. 181-188, 2022
[17] Z. Hu and D. Du, "A new analytical framework for missing data imputation and classification with uncertainty:
Missing data imputation and heart failure readmission prediction," PLoS ONE, vol. 15, no. 9, pp. 1-15, 2020.
[18] A. K.S., R. Ramanathan and M. Jayakumar, "Impact of K-NN imputation Technique on Performance of Deep
Learning based DFL Algorithm," in 2021 Sixth International Conference on Wireless Communications, Signal
Processing and Networking, Chennai, India, 2021.
[19] C. Arkopal and M. R. Kosorok, "Missing Data Imputation for Classification Problems," arXiv, 2020
[20] A. Yadav, A. Dubey, A. Rasool and N. Khare, "Data Mining Based Imputation Techniques to Handle Missing Values
in Gene Expressed Dataset," International Journal of Engineering Trends and Technology, vol. 69, no. 9, pp. 242-
250, 2021.
[21] A. E. Karrar, "A Novel Approach for Semi Supervised Clustering Algorithm," International Journal of Advanced
Trends in Computer Science and Engineering, vol. 6, no. 1, pp. 1-7, 2017.
[22] A. Orfanoudaki, A. Giannoutsou, S. Hashim, D. Bertsimas and R. C. Hagberg, "Machine learning models for mitral
valve replacement: A comparative analysis with the Society of Thoracic Surgeons risk score," Journal of Cardiac
Surgery, vol. 37, no. 1, pp. 18-28, 2022.
[23] D. Bertsimas, A. Orfanoudaki and C. Pawlowski, "Imputation of clinical covariates in time series," Machine
Learning, vol. 110, no. 1, pp. 185-248, 2021.
[24] B. M. Bai, N. Mangathayaru, B. P. Rani and S. Aljawarneh, "Mathura (MBI) - A Novel Imputation Measure for
Imputation of Missing Values in Medical Datasets," Recent Advances in Computer Science and Communications,
vol. 14, no. 5, pp. 1358-1369, 2021.
[25] D. Lee and K. Shin, "Robust Factorization of Real-world Tensor Streams with Patterns, Missing Values, and
Outliers," in 2021 IEEE 37th International Conference on Data Engineering, Chania, Greece, 2021.
[26] T. Siswantining, T. Anwar, D. Sarwinda and H. S. Al-Ash, "A Novel Centroid Initialization in Missing Value
Imputation towards Mixed Datasets.," Communications in Mathematical Biology and Neuroscience, vol. 2021, 2021.
[27] F. Yin and F. Shi, "A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model
to a Cluster Architecture," International Journal of Parallel Programming, vol. 50, no. 11, pp. 27-64, 2022.
[28] R. Pan, T. Yang, J. Cao, K. Lu and Z. Zhang, "Missing data imputation by K nearest neighbours based on grey
relational structure and mutual information," Applied Intelligence, vol. 43, no. 3, pp. 614-632, 2015
[29] P. Keerin and T. Boongoen, "Improved KNN Imputation for Missing Values in Gene Expression Data," Computers,
Materials and Continua, vol. 70, no. 2, pp. 4009-4025, 2022.
[30] K. M. Fouad, M. M. Ismail, A. T. Azar and M. M. Arafa, "Advanced methods for missing values imputation based
on similarity learning," PeerJ Computer Science, vol. 7, 2021.
[31] M. Pampaka, G. Hutcheson and J. Williams, "Handling missing data: analysis of a challenging data set using multiple
imputation," International Journal of Research & Method in Education, vol. 39, no. 1, pp. 19-37, 2014
[32] J. C. Jakobsen, C. Gluud, J. Wetterslev and P. Winkel, "When and how should multiple imputation be used for
handling missing data in randomised clinical trials - A practical guide with flowcharts," BMC Medical Research
Methodology, vol. 17, pp. 162-171, 2017.
... These findings are consistent with other studies that have shown that imputation methods, including kNN, are reliable methods for missing values estimation and can help improve the classification performance [14] [35] [36] . A study realized by [36] randomly inserted missing values in an existing dataset to evaluate kNN imputation accuracy and obtained a 89.5% accuracy rate using this method. ...
... These findings are consistent with other studies that have shown that imputation methods, including kNN, are reliable methods for missing values estimation and can help improve the classification performance [14] [35] [36] . A study realized by [36] randomly inserted missing values in an existing dataset to evaluate kNN imputation accuracy and obtained a 89.5% accuracy rate using this method. Reference [14] compared different imputation techniques in a breast cancer dataset and kNN presented the highest accuracy averages to 4 out of 7 classifiers analyzed when compared to other imputation methods. ...
Conference Paper
Missing values and class imbalance are issues frequently found in databases from real-world scenarios, including cancer classification. Impacts on the performance of Machine Learning (ML) models can be observed if these issues are not properly addressed prior to the analysis. In this paper, a combined solution with missing data imputation using kNN and cluster-based undersampling using k-means is proposed, focusing on pancreatic cancer classification. Different data subsets were generated by combining different preprocessing methods and the performance was analyzed using a ML analysis pipeline from a previous study. This pipeline implements ten ML classifiers, including Random Forest (RF), Support Vector Machine (SVM) and Artificial Neural Network (ANN). All data subsets presented a significant improvement (p<0.05 with Students T-Test) in the performance of most ML algorithms when compared with the results obtained when the pipeline was first evaluated. Results suggest that kNN and k-means can be used in the data preprocessing phase to overcome missing values and class imbalance issues and improve the classification accuracy.
... Data preparation is a critical stage when working with problematic datasets in AR environments. In the proposed approach of HYNNA-PBM, we handle three ways of preprocessing: data cleaning, outliers removing, and taking missing values [47] [48] and [49]. Data cleaning will involve locating and removing anomalies in the first step. ...
Article
Full-text available
To improve how digital media art is delivered and experienced, this study proposed a novel way to combine the unique capabilities of 6G networks with the effective Hybrid Neural Network Augmented Physics-based Models (HYNNA-PBM). This research presents a system where HYNNA-PBM plays an essential role in augmenting AR applications’ realism, interaction, and involvement by using the exceptional speed, bandwidth, and low latency of 6G technologies combined with advanced network physics principles. The proposed method solves the essential difficulties in producing incredibly realistic AR environments, which include dynamic light rendering, real-time physical interactions and complex environmental simulations by effectively combining neural networks with physics-based models. This method improves the user experience with digital media art by allowing the smooth combination of virtual elements with the real world, significantly increasing AR simulations’ accuracy and efficiency. In addition to entertainment, this technology provides creative applications in design, education, and cultural preservation by allowing absorbing, cooperative experiences previously unavailable in education and protecting culture. The system’s ability to provide highly complete experiences is shown by experimental results displayed through digital media artworks. This highlights the innovative possibilities of combining HYNNA-PBM with 6G technology in AR applications. With the borders between the virtual and the real-world loss, this research opens up new opportunities for digital art and entertainment that provide never-before-seen levels of involvement and interest.
... This step involves resolving inconsistencies, such as differing units of measurement and temporal or spatial resolutions. Removing or correcting errors, such as missing values, outliers, and duplicates (Karrar et al., 2022). Techniques such as imputation, interpolation, and anomaly detection can be used to address these issues. ...
Article
Full-text available
The integration of Artificial Intelligence (AI) in sustainable accounting represents a transformative approach to enhancing the accuracy, efficiency, and comprehensiveness of environmental impact assessment and reporting. This paper explores the development of AI-driven models aimed at advancing sustainable accounting practices, focusing on environmental impact assessment and transparent reporting. AI technologies, particularly machine learning (ML) and natural language processing (NLP), play a pivotal role in automating and refining data collection, analysis, and reporting processes. These technologies enable the processing of vast amounts of heterogeneous data from multiple sources, including IoT sensors, satellite imagery, and corporate disclosures. By leveraging ML algorithms, organizations can identify patterns, predict trends, and assess the environmental impact of their operations with unprecedented precision. One of the key advantages of AI in sustainable accounting is its ability to enhance data accuracy and reliability. Traditional methods often suffer from manual errors and inconsistencies. AI models, however, can continuously learn and adapt, improving their accuracy over time. For instance, predictive analytics can forecast future environmental impacts based on historical data, allowing companies to implement proactive measures to mitigate adverse effects. Furthermore, AI facilitates real-time monitoring and reporting. IoT devices equipped with environmental sensors can stream data to AI systems, which process and analyze the information instantaneously. This capability is crucial for timely reporting and compliance with environmental regulations. Real-time data analytics also empower organizations to make informed decisions swiftly, optimizing their sustainability strategies and reducing their ecological footprint. Another significant contribution of AI is in enhancing transparency and accountability in environmental reporting. NLP algorithms can analyze and interpret regulatory texts, corporate reports, and public records, ensuring that organizations adhere to sustainability standards and guidelines. Additionally, AI can automate the generation of comprehensive and comprehensible sustainability reports, making them accessible to a broader audience, including stakeholders and regulators. Developing robust AI models for sustainable accounting involves several critical steps. Initially, data preprocessing is essential to clean and harmonize diverse datasets, ensuring quality input for AI algorithms. Next, model training and validation are conducted using historical and real-time data to refine predictive capabilities. Continuous model evaluation and adjustment are necessary to maintain accuracy and relevance in dynamic environmental contexts. Collaboration between AI experts, environmental scientists, and accounting professionals is paramount in this development process. Interdisciplinary teams can ensure that AI models are not only technically sound but also aligned with environmental science principles and accounting standards. This collaboration also fosters innovation, leading to the development of more sophisticated tools for environmental impact assessment and reporting. The adoption of AI-driven sustainable accounting models offers numerous benefits, including enhanced efficiency, accuracy, and compliance. However, challenges such as data privacy, algorithmic transparency, and the need for substantial initial investments must be addressed. Future research should focus on overcoming these obstacles and exploring the potential of emerging AI technologies, such as deep learning and blockchain, to further revolutionize sustainable accounting practices. AI holds significant promise for transforming sustainable accounting by improving environmental impact assessment and reporting. Through advanced data analytics, real-time monitoring, and enhanced transparency, AI can help organizations achieve their sustainability goals, ensuring a more sustainable future. The continuous development and refinement of AI models, supported by interdisciplinary collaboration, are essential for realizing these benefits and addressing the complex challenges of environmental sustainability. Keywords: Sustainable Accounting, Environmental Impact Assessment, AI, Developing Models, Reporting.
... The utilization of the weighted averaging technique in enhanced KNN is based on the assumption that assigning weights to instances that are more similar results in imputations that more accurately depict the fundamental patterns in the data. The incorporation of similarity measurements in the imputation process has been emphasized in (21) and the authors suggested that weighted estimation has significant impact in increasing the efficacy of imputation process (22). In the study (23), the authors have used Multivariate Imputation by Chained Equation (MICE) technique to mitigate the issue of missing data. ...
... Data pre-processing is an essential phase in the data analysis procedure, as it prepares raw data for subsequent processing and analysis. One common challenge in data pre-processing involves dealing with missing data (13) . K-Nearest Neighbors (KNN) imputation is one such technique for handling missing values. ...
... Validation aims to assess the filtered data's completeness and accuracy. Whereas imputation seeks to correct errors and enter missing values, either manually or automatically [16]. In this study, data preprocessing was carried out to transform the observed data into fuzzy sets. ...
Article
Full-text available
Diseases and pests of lowland rice are one of the factors that can cause a decrease in rice yields. Therefore, it is necessary to have a diagnostic system to identify diseases and pests of paddy rice from an early age based on damage symptoms. The process of diagnosis requires expertise, knowledge, and experience from experts. Therefore, this research tries to build an expert system that can diagnose diseases and pests of paddy rice early by applying the Fuzzy Inference System Takagi with the Sugeno method. Fuzzy Inference System Takagi forms fuzzy sets using implication functions (rules). Rule composition is obtained from a data set of relationships between regulations, where the affirmation (defuzzification) and input from defuzzification is a constant or linear equation. The Sugeno method is used to diagnose diseases and pests of rice plants based on the symptoms experienced. This research aims to help plant pest control officers diagnose diseases and pests of paddy rice plants from the symptoms that attack the rice. The testing technique used is system accuracy testing and Mean Opinion Score (MOS) testing. The MOS test was carried out by involving 30 respondents consisting of 10 farmers and 20 extension workers, where 4.27 was obtained on a scale of 5 which was categorized into a good system. while testing the accuracy obtained from testing the system on two experts on diseases and pests of Madura paddy rice plants in 30 different cases has resulted in an accuracy rate of 86.66%. The expert system built in this study was able to diagnose 13 diseases and pests of Madura paddy based on the knowledge of two experts on 38 symptoms, and the plan was feasible to use and categorized into a good system.
... b. Pre-processing data The data pre-processing steps in this study are divided into two parts: (a) preprocessing with mean imputation is a method to handle missing data in the dataset (Karrar, 2022). In this process, the missing values in a feature (column) will be replaced with the mean value of that feature. ...
Article
Full-text available
This research focuses on implementing the Random Forest and Grid Search algorithms for the early detection of diabetes mellitus, aiming to modernize and enhance medical practices using technology. The proposed model achieved an accuracy of 77.06%, a precision of 71.43%, a recall of 47.30%, and a misclassification error of 22.94%. Comparative analysis with other data mining algorithms, including Decision Tree, Random Forest without Grid Search, and Cat Boost, demonstrated that the Random Forest with Grid Search algorithm outperformed the others. By utilizing Grid Search, the accuracy of the Random Forest algorithm increased by 2.03%. These findings indicate the potential effectiveness of machine learning in early diabetes detection. While the research offers promising results, there are limitations in terms of the dataset size and the number of detection variables used. Future studies should explore larger datasets and alternative algorithms to further enhance accuracy and aid in the early detection of diabetes mellitus.
Chapter
The banking industry performs credit score analysis as an efficient credit risk assessment method to determine a customer’s creditworthiness. In the banking industry, machine learning could be used for a variety of uses involving data analysis. A method of data analysis that is capable of self-regulation has been made possible by the development of modern techniques, such as classification approaches. The classification method is a form of supervised learning in which the computer acquires knowledge from the provided input data and then utilizes it to classify the dataset, which is used for training purposes. This study presents a comparative analysis of the various machine learning algorithms that are utilized to evaluate credit risk. The methods are used by utilizing the German Credit dataset that was collected from Kaggle, which consists of 1,000 instances and 11 attributes, all of which are used to determine if transactions are good or bad. The findings of data analysis using Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, K-Nearest Neighbors Classifier, Decision Tree Classifier, Support Vector Machines, and Random Forest are compared and contrasted in this study. The findings demonstrated that the Random Forest algorithm forecasted credit risk effectively.KeywordsCredit RiskBankingMachine LearningPredictionFeatures
Chapter
The quality of healthcare outcomes is influenced by the reliability and accuracy of computational analysis techniques applied to clinical data. Advanced classification systems improve the precision and speed of medical diagnosis, providing critical decision insights for doctors and resource optimization. Classification techniques based on machine learning have been applied as effective and reliable non-surgical techniques for the efficacious diagnosis and treatment of heart disease patients. This paper demonstrates the application of the Mixed Feature Creation (MFC) approach to classify a heart-disease dataset from UCI Cleveland using techniques such as the Recursive Feature Elimination with Random Forest (RFE-RF) feature selection and Least Absolute Shrinkage and Selection Operator (LASSO). Parameters from each technique are optimized through grid-search and cross-validation methods. Further, classifier performance models are used to determine the classification techniques’ F1-scores, precision, sensitivity, specificity, and accuracy based on independent measures such as RMSE and execution time. The findings suggest that ML-driven classification algorism can be used to develop reliable predictive models for the accurate diagnosis of heart diseases.
Article
Full-text available
Class imbalance problem become greatest issue in data mining, imbalanced data appears in daily application, especially in the health care. This research aims at investigating the application of ensemble model by intelligence analysis to improving the classification accuracy of imbalanced data sets on prostate cancer. The primary requirements obtained for this study included the datasets, relevant tools for pre-processing to identify the missing values, models for attribute selection and cross validation, data resembling framework, and intelligent algorithms for base classification. Additionally, the ensemble model and meta-learning algorithms were acquired in preparation for performance evaluation by embedding feature selecting capabilities into the classification model. The experimental results led to the conclusion that the application of ensemble learning algorithm on resampled data sets provides highly accurate classification results on single classifier J48. The study further suggests that gain ratio and ranker techniques are highly effective for attribute selection in the analysis of prostate cancer data. The lowest error rate and optimal performance accuracy in the classification of imbalanced prostate cancer data is achieved using when Adaboost algorithm is combined with single classifier J48.
Article
Full-text available
Citations are references used by researchers to recognize the contributions of researchers in their articles. Citations can be used to discover hidden patterns in the research domain, and can also be used to perform various analyses in data mining. Citation analysis is a quantitative method to identify knowledge dissemination and influence papers in any research area. Citation analysis involves multiple techniques. One of the most commonly used techniques is Main Path Analysis (MPA). According to the specific use of MPA, it has evolved into various variants. Currently, MPA is carried out in different domains, but deep learning in the field of remote sensing has not yet been considered. In this paper, we have used three centrality attributes which are Degree, Betweenness and Close-ness centrality to automatically identify important papers by applying clustering method based on machine learning (i.e., K-means). In addition, the main path is drawn from important papers and compared with existing manual methods. In order to conduct experiments, a data set from Web of Science (WOS) has been established, which contains 538 papers in the field of deep learning. Compared with existing works, our method provides the most relevant papers on the main path.
Article
Full-text available
Background: Current Society of Thoracic Surgeons (STS) risk models for predicting outcomes of mitral valve surgery (MVS) assume a linear and cumulative impact of variables. We evaluated postoperative MVS outcomes and designed mortality and morbidity risk calculators to supplement the STS risk score. Methods: Data from the STS Adult Cardiac Surgery Database for MVS was used from 2008 to 2017. The data included 383,550 procedures and 89 variables. Machine learning (ML) algorithms were employed to train models to predict postoperative outcomes for MVS patients. Each model's discrimination and calibration performance were validated using unseen data against the STS risk score. Results: Comprehensive mortality and morbidity risk assessment scores were derived from a training set of 287,662 observations. The area under the curve (AUC) for mortality ranged from 0.77 to 0.83, leading to a 3% increase in predictive accuracy compared to the STS score. Logistic Regression and eXtreme Gradient Boosting achieved the highest AUC for prolonged ventilation (0.82) and deep sternal wound infection (0.78 and 0.77) respectively. EXtreme Gradient Boosting performed the best with an AUC of 0.815 for renal failure. For permanent stroke prediction all models performed similarly with an AUC around 0.67. The ML models led to improved calibration performance for mortality, prolonged ventilation, and renal failure, especially in cases of reconstruction/repair and replacement surgery. Conclusions: The proposed risk models complement existing STS models in predicting mortality, prolonged ventilation, and renal failure, allowing healthcare providers to more accurately assess a patient's risk of morbidity and mortality when undergoing MVS.
Article
Full-text available
The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.
Article
Although lots of clustering models have been proposed recently, k-means and the family of spectral clustering methods are both still drawing a lot of attention due to their simplicity and efficacy. We first reviewed the unified framework of $k$ -means and graph cut models, and then proposed a clustering method called k-sums where a $k$ -nearest neighbor (k-NN) graph is adopted. The main idea of $k$ -sums is to minimize directly the sum of the distances between points in the same cluster. To deal with the situation where the graph is unavailable, we proposed k-sums-x that takes features as input. The computational and memory overhead of $k$ -sums are both $O(nk)$ , indicating that it can scale linearly w.r.t. the number of objects to group. Moreover, the costs of computational and memory are Irrelevant to the product of the number of points and clusters. The computational and memory complexity of k-sums-x are both linear w.r.t. the number of points. To validate the advantage of k-sums and $k$ -sums-x on facial datasets, extensive experiments have been conducted on 10 synthetic datasets and 17 benchmark datasets. While having a low time complexity, the performance of k-sums is comparable with several state-of-the-art clustering methods.
Chapter
Through this book chapter authors basically tried to illustrate about basic concepts of data mining as well as process mining in education field. Educational data mining (EDM) has been playing a very important role in educational institutions for handling a vast amount of data to infer some useful information which is beneficial in many aspects such as assessing students performance or forecasting the expected results for upcoming semester exams as per data availability of previous semester exam results, it is also helpful in determining the trends of undergraduates with respect to opting further higher education in a educational University or institution. There are many strategies available of educational data mining for assessing various educational decisions helping institutions to grow as well. Through this book chapter we have discussed some strategies and basic concepts.
Article
The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics, especially the analysis of gene expression data that facilitates an early detection of cancer. Many attempts show improvements made by excluding samples with missing information from the analysis process, while others have tried to fll the gaps with possible values. While the former is simple, the latter safeguards information loss. For that, a neighbour-based (KNN) approach has proven more effective than other global estimators. The paper extends this further by introducing a new summarization method to the KNN model. It is the frst study that applies the concept of ordered weighted averaging (OWA) operator to such a problem context. In particular, two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models. Using different ratios of missing values from 1%–20% and a set of six published gene expression datasets, the experimental results suggest that new methods usually provide more accurate estimates than those compared methods. Specifc to the missing rates of 5% and 20%, the best NRMSE scores as averages across datasets is 0.65 and 0.69, while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84, respectively.