ArticlePDF Available

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

June 2022
Indonesian Journal of Electrical Engineering and Informatics (IJEEI) 10(2):375-384

June 2022
10(2):375-384

DOI:10.52549/ijeei.v10i2.3730

Authors:

The evolution of big data analytics through machine learning and artificial intelligence techniques has caused organizations in a wide range of sectors including health, manufacturing, e-commerce, governance, and social welfare to realize the value of massive volumes of data accumulating on web-based repositories daily. This has led to the adoption of data-driven decision models; for example, through sentiment analysis in marketing where produces leverage customer feedback and reviews to develop customer-oriented products. However, the data generated in real-world activities is subject to errors resulting from inaccurate measurements or fault input devices, which may result in the loss of some values. Missing attribute/variable values make data unsuitable for decision analytics due to noises and inconsistencies that create bias. The objective of this paper was to explore the problem of missing data and develop an advanced imputation model based on Machine Learning and implemented on K-Nearest Neighbor (KNN) algorithm in R programming language as an approach to handle missing values. The methodology used in this paper relied on the applying advanced machine learning algorithms with high-level accuracy in pattern detection and predictive analytics on the existing imputation techniques, which handle missing values by random replacement or deletion.. According to the results, advanced imputation technique based on machine learning models replaced missing values from a dataset with 89.5% accuracy. The experimental results showed that pre-processing by imputation delivers high-level performance efficiency in handling missing data values. These findings are consistent with the key idea of paper, which is to explore alternative imputation techniques for handling missing values to improve the accuracy and reliability of decision insights extracted from datasets.

Content uploaded by Abdelrahman Elsharif Karrar

Content may be subject to copyright.

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)

Vol. 10, No. 2, June 2022, pp. 375~384

ISSN: 2089-3272, DOI: 10.52549/ijeei.v10i2.3730  375

Journal homepage: http://section.iaesonline.com/index.php/IJEEI/index

The Effect of Using Data Pre-Processing by Imputations in

Handling Missing Values

Abdelrahman Elsharif Karrar

College of Computer Science and Engineering, Taibah University, Saudi Arabia

Article Info

ABSTRACT

Article history:

Received Feb 19, 2022

Revised Apr 4, 2022

Accepted Apr 8, 2022

The evolution of big data analytics through machine learning and artificial

intelligence techniques has caused organizations in a wide range of sectors

including health, manufacturing, e-commerce, governance, and social welfare

to realize the value of massive volumes of data accumulating on web-based

repositories daily. This has led to the adoption of data-driven decision models;

for example, through sentiment analysis in marketing where produces leverage

customer feedback and reviews to develop customer-oriented products.

However, the data generated in real-world activities is subject to errors

resulting from inaccurate measurements or fault input devices, which may

result in the loss of some values. Missing attribute/variable values make data

unsuitable for decision analytics due to noises and inconsistencies that create

bias. The objective of this paper was to explore the problem of missing data

and develop an advanced imputation model based on Machine Learning and

implemented on K-Nearest Neighbor (KNN) algorithm in R programming

language as an approach to handle missing values. The methodology used in

this paper relied on the applying advanced machine learning algorithms with

high-level accuracy in pattern detection and predictive analytics on the existing

imputation techniques, which handle missing values by random replacement

or deletion.. According to the results, advanced imputation technique based

on machine learning models replaced missing values from a dataset with

89.5% accuracy. The experimental results showed that pre-processing by

imputation delivers high-level performance efficiency in handling missing

data values. These findings are consistent with the key idea of paper, which is

to explore alternative imputation techniques for handling missing values to

improve the accuracy and reliability of decision insights extracted from

datasets.

Keyword:

Data Pre-Processing

Imputation Model

Machine Learning

k-Nearest Neighbor Algorithm

Missing Values

Corresponding Author:

Abdelrahman Elsharif Karrar

College of Computer Science and Engineering, Taibah University, Saudi Arabia

Email: akarrar@taibahu.edu.sa

1. INTRODUCTION

Technological growth over the recent years has caused unprecedented growth in information and

communication capabilities and mobility, especially in the field of Artificial Intelligence, which enables smart

devices to perform automated functions even beyond human ability. Computer systems embedded with

advanced capabilities such as automation and predictive analytics have been adopted in a wide range of areas

such as transportation, manufacturing, healthcare, marketing, finance, and robotics. These applications

generate huge quantities of data, which presents significant growth opportunities in the field of Machine

Learning [1]. Machine learning is a subset of artificial intelligence in which specialized algorithms are designed

to analyze and extract insights from unstructured datasets. Machine learning utilizes mathematical models to

discover trends and make data-driven predictions concerning the variables under study. According to [2],

predictive modelling refers to the formulation of computerized algorithms to perform data analytics operations

and make forecasts about a problem by regression or classification.

 ISSN: 2089-3272

IJEEI, Vol. 10, No. 2, June 2022: 375 – 384

376

The data generated from the devices and applications with AI and ML capabilities are stored in

dynamic servers and databases, which require routine updating and upgrading due to the complex nature of the

data and sources. Unstructured data from certain sources are subject to errors due to the possibility of storing

null values, which are also analyzed during decision-making processes. Data may also contain inconsistencies

and discrepancies, and noises, which may lead to unreliable output. Therefore, data cleansing is an essential

aspect of database management to minimize the risks of bias resulting from the storage inconsistencies,

especially for large data sets [3], [4]. In the process of data analytics by machine learning, shortage of data is

the most prevalent challenge faced in developing predictive models.

Mean imputation is one of the most efficient methods for addressing the problem of missing data

values in advanced analytics systems by reducing the strength of association while the imputation of regression

can increase the strength of association although it poses ambiguity risks to the data values. Missing

Completely at Random (MCAR) technique based on the K-Nearest Neighbors (K-NN) algorithm [5] is

generally applied to the replacement of missing data values by increasing the likelihood of interpretative and

incomplete failure. Imputation techniques correct data inconsistencies by replacing the missing values with

mean observed values or the last observed value. The relative advantage of imputation based on ML algorithms

over other existing methods for handling missing values is [6] that the replacement values rely on a combination

of computational, mathematical, and statistical models rather than random parameters from the dataset. This

makes it comparatively a more reliable, accurate, and applicable approach for data-driven decision-making,

especially in high-precision fields such as manufacturing and healthcare.

1.1. Problem Statement

Structured and unstructured data collected from digital devices and computer systems in their real-

world applications may have missing values, noise, and other inconsistencies. Subjecting such datasets to

analytics increases the risks of inaccurate output leading to biased decisions and erroneous insights, which may

have adverse negative implications on the enterprise. The existing techniques of handling missing values by

imputation rely on random sampling technique to estimate and replace the missing values; making then

potentially inaccurate and unreliable, especially when applied to data-driven decision-making in high-precision

fields such as medical surgeries and manufacturing. Therefore, there is a need to adopt effective approaches

based on computational, mathematical, and statistical models to address the problem of missing values in large

datasets. This paper sought to apply the imputation technique to pre-process and classify data with missing

values to reduce the risks of inconstancy or errors in decision-making processes.

1.2. Data Mining

Data mining [7] refers to a process in which valuable information is extracted from unstructured or

structured datasets stored in databases or cloud platforms. According to [8]. data mining entails extraction,

classification, and transfer of various data through a series of processes including data cleaning,

standardization, and testing. The process of data mining entails a series of sequences, which include the

following steps shown in Figure 1;

1. Cleaning the data to remove inconsistencies and noise.

2. Integration to separate data from overlapping sources.

3. Selecting and retrieving data sets that are appropriate for analysis from the database.

4. Extracting useful data patterns from the database.

5. Identifying and evaluating the variable patterns representing knowledge based on the set parameters.

6. Preprocessing and representing knowledge through visualization and presentation techniques such as

schemas [9].

Figure 1. Steps Involved in the Process of Data Mining

IJEEI ISSN: 2089-3272 

The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)

377

1.3. Pre-processing

Data pre-processing [10] entails a series of preparations aimed at converting the data into a format

that is easier to analyze since most of the information collected from day-to-say activities is largely unstructured

and may be difficult to analyze due to missing values [11]. Pre-processing raw data involves various processes

including cleansing, integration, transformation, and reduction as illustrated in Figure 2;

Figure 2. Stages of Data Pre-processing

The objective of data cleaning is to identify and complete null values, fixing incoherence, and

standardizing outliers to reduce noise. When the data sets are imported from storage resources such as servers

and databases, the first step of data cleaning [12] is merging the datasets with related information, which are

then rebuilt to recreate missing values and de-duplicated to normalize the sets. The next step is data verification

and enrichment, which entail confirming the validity and making further improvements to prepare the data for

analysis and decision making.

1.4. Missing Values

Missing Values (MVs) in machine learning refers to the data attributes that may lack in a dataset due

to errors that may arise in the input process due to improper measurements or device failure [13]. A missing

value algorithm is used to determine whether a correlational link exists between the missing values and other

variables in a dataset. For instance, if X represents a dataset (a, b) such that a is X’s observed value while b is

the missing value in X and Y is a random variable. Assuming that Y = 1 regardless of whether X’s are observed

or missing values, the observed value can be determined by letting X = 0 expressed in the form of a model P

(Y/x, Ø) such that Ø represents the missing. The mechanism for fixing the missing values is based on Y’s

dependence on the variables contained in the dataset [14].

1.4.1. Mechanisms for Computing Missing Values

Missing Completely at Random (MCAR) Technique

MCAR technique determines the missing values through the (1);

𝑃(𝑌|𝑋, ∅)=(𝑌, ∅) (1)

From (1), it is evident that the probability of missing values being proximate to the observed values

is low and does not depend on the variables contained in the dataset X. Therefore, the most feasible option for

dealing with missing values based on the MCAR approach is developing an algorithm that can randomly delete

values to normalize the dataset.

Missing at Random (MAR)

The MAR technique is used to fix the problem of missing data based on the (2);

𝑃(𝑌|𝑋, ∅)= 𝑃(𝑌𝑎|∅) (2)

From (2), the probability that the value of missing variables depends on the values noticed in a subset

of the X and not entire dataset Y. This implies that the MAR approach conducts a systematic inquiry of the

larger dataset to determine the values that do not correspond to the variable under inquiry hence determines

the missing values based on covariance.

Missing Not at Random (MNAR)

The MNAR technique formulates a solution for missing values in a dataset based on the (3):

𝑃(𝑌|𝑋, ∅)= 𝑃(𝑌|𝑎, 𝑏, ∅) (3)

 ISSN: 2089-3272

IJEEI, Vol. 10, No. 2, June 2022: 375 – 384

378

From (3), it is observed that the prerequisites for handling the missing values based on the MAR

technique are violated such that the probability of missing values is dependent on b or other unexpected

covariates within the dataset. For instance, when applied to tax computations, MNAR is able to determine that

missing data values are dependent on unobserved revenue declarations by the tax payers [15]. The advantage

of MNAR technique is that it separates data, which was never provided from that which was incorrectly input

due to measurement error.

1.5. Imputation as a Solution for Missing Data Values

Imputation is a computational technique that utilizes mathematical and statistical algorithms to resolve

the problem of missing values in a dataset by replacement without interfering with the attributes and values of

the entire dataset. Imputation approaches are broadly classified into traditional and advanced categories

depending on the applied method for replacing the missing values. Under the traditional imputation technique

[16], the problem is solved through pairwise and listwise deletion of the missing values, especially under the

condition of Missing Completely at Random (MCAR). Traditional imputation entails computational

procedures such as Multiple Imputation, Maximum Likelihood Imputation, Hot Deck Imputation, Mode

Imputation, and Mean Imputation. However, the advanced imputation approach relies on computational

intelligence to learn complex interdependencies in large data sets and determine the optimal method for

handling missing values based on the observed characteristics. Computational models such as Decision Trees,

Random Forest Algorithm, and k-Nearest Neighbor are implemented in advanced imputation.

The k-Nearest Neighbors (k-NN) algorithm, which is among the most efficient technique for advanced

imputation conceptualizes missing data values as regression and pattern recognition problems. K-NN

algorithms classify datasets based on the memory of the observed values and attributes rather than labeled

vectors. The computational processes used to replace the missing values are based on the observed closest k-

neighbor in the training sets and the highest number of k iterations. The study [17] suggest that the individual

variable classifications define the procedures used to determine the K-Nearest Neighbor in each instance. K-

NN Imputation [18]is applied in the case of incomplete system and unknown data distribution. The technique

imputes missing data values by computing a metric/variable that is distant from the nearest k neighbors based

on computed estimations of data sets that lack the appropriate mean and mode. The numerical value of the

missing parameters is then predicted using the mean rule while the missing categorical variables are computed

using the mode rule hence making it possible to efficiently handle the problem of missing values in large data

sets. This paper is organized as follows: Section 2 briefly introduces related work. Section 3 clarifies on

the proposed methodology for implementing an imputation technique based on R-programming language as a

solution for handling missing values in data sets. Section 4 of this research paper discusses results based on the

experimental works. Then in the subsequent sections, the conclusion comes in Section 5 and Finally, Section

6 provides recommendations for future studies.

2. RELATED WORK

A study [19] proposed a novel K-NN model for handling missing values through two-stage training

scheme based on the training data and missing date. This approach effectively handles missing instances in

heterogeneous data sets by computing Mutual Information (MI) weights between class labels and attributes of

the dataset. This ensures the imputation of missing data values to enhance classification performance,

especially in UCI data samples with varying rates of missing values. The performance efficiency of handling

missing values in biased datasets can be determined by successive simulations of continuous traits and

segmenting response variables through imputation procedures such as complete case analysis [14]. The model

performance is measured based on the degree of deviation/marginal error and the covariance between traits

and responses.

Findings from a research study [20]suggests that the adoption of advanced techniques for handling

missing values delivers high-level performance by allowing multiple imputations on a single dataset. In a study,

20 samples were randomly selected from a dataset of Traumatic Brain Injury data sets. In 8 of the samples, one

variable was deleted to create missing data, which was used to determine the technical performance efficiency

of multiple imputation and single imputation methods [21]. The Multiple Imputation approach demonstrated

higher effectiveness in variable deletion compared to single imputation based on the estimated parameter

comparison [21].

According to [9], auto-encoder neural networks can be applied to the imputation of missing data to

innovatively predict missing values and automatically encode new files without missing values through a two

stage model. A more advanced imputation framework capable of imputing categorical and mixed continuous

variables through formal optimization, predicts missing values using mathematical models such as decision

trees, support vector machines, and closest k-neighbors [22]. The implementation of opti-impute generic

IJEEI ISSN: 2089-3272 

The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)

379

algorithm in this framework produces high-quality solutions due to the improved sampling accuracy in multiple

datasets obtained from the UCIML repository. This approach performs precise imputations of missing value

sets through predictive K-NN , mean-matching, and Bayesian techniques with low average absolute error

through cross-validated benchmarking [23].

According to [24], medical datasets with missing values may be difficult to impute as the null values

are often contained in categorical attributes hence complicating pre-processing stages. However, advanced

imputation techniques based on machine learning and decision-tree models are capable of effectively

identifying outliers and replacing the missing values through K-NN computations [25]. Outliers pose

significant risks of bias during statistical estimation procedures by increasing the likelihood of overstated or

understated decision outcomes hence the dependability of imputations techniques is a critical consideration.

According to [26] pattern creedal classification models may be applied to the adaptive imputation of

missing data based on the observed variables. This technique is founded on the belief function theory, which

implies that missing data is fundamentally required for unambiguous and accurate classification of the observed

datasets hence allowing the imputation though the self-organizing map (SOM) algorithms for pattern

extraction. The algorithm classifies data patterns as either altered or original depending on the representation

outcomes from the training classes.

3. METHODOLOGY

The objective of this section is to handle missing data values through the implementation of

imputation techniques based on IBK algorithms. The implementation entails manipulating unstructured

datasets and testing the performance efficiency and accuracy of imputation techniques in replacing the missing

values.

3.1. Dataset

A sample dataset obtained from Juba Insurance & Reinsurance Company was used in this project. An

arbitrary selection of 100 samples were selected and prepared for imputation using MAR, NMAR, and MCAR

techniques to insert fictitious missing values as the table shown in Figure 4. The original dataset was then

implemented on a random data generation algorithm to model missing values into 4 categories with a DocType

attributes having an equal probability of missing rates. The datasets were classified into 3 categories of

percentage missing rates (3%, 6%, and 10%) hence allowing the simulation of missing values from attribute

instances containing 4 classes as the table shown in Figure 3.

Figure 3. The Original Dataset

Figure 4. Dataset with Missing Values

 ISSN: 2089-3272

IJEEI, Vol. 10, No. 2, June 2022: 375 – 384

380

3.2. Imputation of the k-Nearest Neighbors (k-NN)

The next step is implementing the k-Nearest Neighbors algorithm to impute the missing data values

through pattern recognition after which non-parametric regression and classification are performed to pre-

process the dataset. The output contains data values classified into k-NN classes based on the plurality of the

neighbors in which objects allocated to the closest neighbors. Implementing algorithms such as the

Neighborhood Components and Large Margin k-NNs allocates missing data values based on the closest

neighbor classes [27]. Since imputation methods can be used to improve the classification performance of k-

Nearest Neighbors, the ideal value (k) of missing data is dependent on the larger values of k, which minimize

the impact of noises on classification accuracy by creating less distinctive class boundaries. Therefore, the

optimal value of k in each case can be predicted based on the estimates from the nearest training sets (i.e.

assuming k = 1). After the imputation of missing values, a R-programming model was developed to perform

further classification of the missing values through the following steps as illustrated in Figure 5;

Step 1: Organize the data into rows and columns representing records and attributes respectively

Step 2: Segment the dataset into two classes; ‘complete’ and ‘with MVs’

Step 3: Carry out normalization on the dataset

Step 4: Iteratively perform the imputation on the missing values independently

Step 5: Use K-NN algorithm to test variations between the new and original records

Step 6: Replace the missing values with the nearest attribute having the highest similarity

Step 7: Create a column with complete data values

Step 8: Apply the above procedure to impute missing values for all the records.

Figure 5. A Flowchart of the k-NN Imputation Procedure

The above proposed imputation technique utilizing k-NN algorithm is based on the modification of

power distance parameters to determine the appropriate values of the missing data hence the need to specify

the attributes with null values and their nearest neighbors.

3.3. Programming Languages

The implementation of imputation technique for missing values in data sets requires Java and R

programming languages, which are used for statistical modelling and mathematical computation of relational

patterns among the variables.

3.4. Experimental Procedures

The first step of implementing the R model for imputation is preparing the work directory to read and

save the data file through the following commands;

setwd("~/R implementation")

ins <- read.csv("missing enc.csv")

The original dataset contains 100 instances, 7 attributes, and the null values shown in Figure 6. Since

DocType is recorded as an incomplete parameter with randomly missing values, the following r-code is run to

restructure the dataset;

IJEEI ISSN: 2089-3272 

The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)

381

Figure 6. Original Dataset Structure

The original dataset is then read and the NA value of DocType parameter is set to allow for the

detection of null values using the R-code as the table shown in Figure 7;

Figure 7. The Dataset After Replacing Missing values with NA

A user-defined function is implemented to analyze missing values in the incomplete dataset using the

R-code shown in Figure 8;

Figure 8. The R-code

The R output for this command is shown in Figure 9;

Figure 9. The R Output

Further analysis of the incomplete dataset using R-packages for pattern identification produced the

following output shown in Figure 10;

Figure 10. Further Analysis Using R-packages for Pattern Identification

From the Figure 10, it is observed that the observed and missing values are represented as binary

values 1 and 0 respectively. The number of observations made from the data file is observed in the first column

while the total of variables with incomplete data is shown in the last column. A plot of the missing and complete

data values is shown in Figure 11;

 ISSN: 2089-3272

IJEEI, Vol. 10, No. 2, June 2022: 375 – 384

382

Figure 11. Complete and Incomplete Datasets

From the graph, it is observed that the initial output shows 81 samples with no missing values and 19

samples with missing values, which are further analyzed using the aggregate plot function in R as shown in the

Histogram (Figure 12).

Figure 12. Complete and Incomplete Datasets

From the histogram, it is observed that the dataset contained 19% missing values (shown in Red) and

81% complete values (shown in Blue) from the DocType attribute.

4. RESULTS AND DISCUSSION

Imputation technique based on the k-NN algorithm and IBK implementation on R programming

language can effectively compute missing values in a dataset. The proposed technique applied multiple

computational and statistical methods on the imputation approach for handling missing values that had been

artificially created in a sample dataset. An insurance dataset was obtained and input into R-studio with variables

(Third-party, comprehensive, marine, and Fire + stolen) clearly defined. Running a series of pattern extraction

and analytical functions in R-studio detected missing values from the dataset with 89.5% classification

accuracy as summarized in Figure 13;

Figure 13. Summary results of the incomplete dataset after imputation

The findings from this study are consistent with the results from similar studies [28], [29] which

showed that K-Nearest Neighbor (KNN) technique is a superior classification algorithm with high accuracy

when applied to the imputation of missing values in a datasets. The relative accuracy of this technique is based

on factors such as weighted estimation, feature relevance to specific datasets under study, predictive detection

of missing values based on a series of statistical analyses. Findings from a related study show that KNN-based

imputation of missing values in a dataset delivers 90% accuracy in the classification of missing values and

86.27% performance in the replacement of numerical values in relatively less computational time and error

compared to other techniques [30]. Imputation of missing values using KNN algorithm attains superior

IJEEI ISSN: 2089-3272 

The Effect of Using Data Pre-Processing by Imputations …. (Abdelrahman Elsharif Karrar)

383

performance in the experimental replacement of numerical values due to its unique ability to classify the

missing parameters and assign cluster ratios for each type unlike other techniques that perform replacement in

whole datasets based on the normalized computation of mean absolute errors and root mean square error. A

study [31] observes that imputation based on computational and statistical models is recommended by scientists

due to its unique ability to determine the missing values by averaging a summarized likelihood function of the

entire dataset over a mathematically defined predictive distribution with considerably high precision.

The implementation of reverse data mining using the IBK classification algorithm has been effectively

demonstrated as a reliable solution for handling missing data values. The pre-processing implementation

functions replace missing values by computing the nearest neighbors with the highest similarity index. This

approach recreates the dataset with missing values into a complete with accurately imputed variables and

attributes for use in data analytics and decision support.

R-studio increases the performance accuracy of the imputation technique by replacing the missing

values in a dataset with several independently imputed values rather than a single randomly imputed value [32]

. This reflects possible uncertainties that may arise due to technical errors with the imputation model. For

instance, if a regression model is applied to imputing the missing values, it is desired that imputations reflect

both sampling variability and uncertainties regarding the regression coefficients utilized in the model.

Independent modelling of the coefficients makes it possible to create a new set of imputed values for each

instance based on the coefficient distribution through multiple imputations. Therefore, R-studio made it

possible to run the standard analysis of the datasets with missing values to generate inferences, which are then

combined to determine the most appropriate value of the missing data.

5. CONCLUSION

In conclusion, the objective of this study was to discuss and implement the imputation technique based

on R-programming language as a solution for handling missing values in data sets. The problem of missing

data may arise due to input or measurement errors, especially in our daily interactions with technology hence

impacting the quality of analytical insights from such data. The implementation of k-NN imputation method

based on the IBK classification algorithm proved a reliable approach to replacing missing data values with the

value of nearest attributes showing the highest similarity.

The experimental results also showed that pre-processing by imputation delivers high-level

performance efficiency in handling missing data values. Where these findings are consistent with the key idea

and objective of paper, which is to explore alternative imputation techniques for handling missing values to

improve the accuracy and reliability of decision insights extracted from datasets.

6. FUTURE WORK

While this paper presents an important knowledge framework for the future of data analytics and data-

driven decision-making, more research is needed to refine the imputation mechanisms for replacing missing

values to eliminate noise and inconsistencies, especially in the massive data generated from the Internet of

Things. Future work should focus on the development of automated imputation techniques based on machine

learning and artificial intelligence for improved efficiency in data pre-processing and analytics.

Since the dataset used in this study is very small, choosing the most common dataset would be useful

for an extended study, especially for comparing results with other methods.

REFERENCES

[1] A. Nikitas, K. Michalakopoulou, E. T. Njoya and D. Karampatzakis, "Artificial Intelligence, Transport and the Smart

City: Definitions and Dimensions of a New Mobility Era," Sustainability, vol. 12, no. 7, p. 2789, 2020.

[2] G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, with Applications in R,

vol. 2, New York, NY: Springer, 2021.

[3] Z. ÇETİNKAYA and F. HORASAN, "Decision Trees in Large Data Sets," International Journal of Engineering

Research and Development, vol. 13, no. 1, pp. 140-151, 2021.

[4] S. Chuprov, I. Viksnin, I. Kim, T. Melnikov, L. Reznik and I. Khokhlov, "Improving Knowledge Based Detection of

Soft Attacks Against Autonomous Vehicles with Reputation, Trust and Data Quality Service Models," in 2021 IEEE

International Conference on Smart Data Services, Chicago, IL, USA, 2021.

[5] S. Pei, H. Chen, F. Nie, R. Wang and X. Li, "Centerless Clustering: An Efficient Variant of K-means Based on K-

NN Graph," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[6] ChaoFu, C. Xu, M. Xue, W. Liu and S. Yang, "Data-driven decision making based on evidential reasoning approach

and machine learning algorithms," Applied Soft Computing, vol. 110, no. 15, 2021.

[7] V. Singh and V. D. Kaushik, "Concepts of Data Mining and Process Mining," in Process Mining Techniques for

Pattern Recognition, CRC Press, 2022.

[8] C. Yuan and H. Yang, "Research on K-Value Selection Method of K-Means Clustering Algorithm," J, vol. 2, no. 2,

pp. 226-235, 2019.

 ISSN: 2089-3272

IJEEI, Vol. 10, No. 2, June 2022: 375 – 384

384

[9] S. J. Choudhury and N. R. Pal, "Imputation of missing data with neural networks for classification," Knowledge-

Based Systems, vol. 182, 2019.

[10] B. C. Wesolowski, "Data Preprocessing and Data Manipulation," in From Data to Decisions in Music Education

Research, 1 ed., 2022.

[11] M. Umair, F. Majeed, M. Shoaib, M. Q. Saleem, M. S. Adrees, A. E. Karrar, S. Khurram, M. Shafiq and J.-G. Choi,

"Main Path Analysis to Filter Unbiased Literature," Intelligent Automation and Soft Computing, vol. 32, no. 2, pp.

1179-1194, 2022.

[12] N. Whitmore, "Data cleaning," in R for Conservation and Development Projects, Chapman and Hall/CRC, 2020.

[13] E. W. Steyerberg, "Missing Values," in Clinical Prediction Models. Statistics for Biology and Health, Cham,

Springer, 2019.

[14] T. F. Johnson, N. J. B. Isaac, A. Paviolo and M. González-Suárez, "Handling Missing Values in Trait Data," Global

Ecology and Biogeography, vol. 30, no. 1, pp. 51-62, 2020.

[15] C. Bonander and U. Strömberg, "Methods to handle missing values and missing individuals," European Journal of

Epidemiology, vol. 34, no. 1, 2019.

[16] A. E. Karrar, "Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classification

Data in the Diagnostic and Treatment Interventions for Prostate Cancer," International Journal of Advanced

Computer Science and Applications, vol. 13, no. 1, pp. 181-188, 2022

[17] Z. Hu and D. Du, "A new analytical framework for missing data imputation and classification with uncertainty:

Missing data imputation and heart failure readmission prediction," PLoS ONE, vol. 15, no. 9, pp. 1-15, 2020.

[18] A. K.S., R. Ramanathan and M. Jayakumar, "Impact of K-NN imputation Technique on Performance of Deep

Learning based DFL Algorithm," in 2021 Sixth International Conference on Wireless Communications, Signal

Processing and Networking, Chennai, India, 2021.

[19] C. Arkopal and M. R. Kosorok, "Missing Data Imputation for Classification Problems," arXiv, 2020

[20] A. Yadav, A. Dubey, A. Rasool and N. Khare, "Data Mining Based Imputation Techniques to Handle Missing Values

in Gene Expressed Dataset," International Journal of Engineering Trends and Technology, vol. 69, no. 9, pp. 242-

250, 2021.

[21] A. E. Karrar, "A Novel Approach for Semi Supervised Clustering Algorithm," International Journal of Advanced

Trends in Computer Science and Engineering, vol. 6, no. 1, pp. 1-7, 2017.

[22] A. Orfanoudaki, A. Giannoutsou, S. Hashim, D. Bertsimas and R. C. Hagberg, "Machine learning models for mitral

valve replacement: A comparative analysis with the Society of Thoracic Surgeons risk score," Journal of Cardiac

Surgery, vol. 37, no. 1, pp. 18-28, 2022.

[23] D. Bertsimas, A. Orfanoudaki and C. Pawlowski, "Imputation of clinical covariates in time series," Machine

Learning, vol. 110, no. 1, pp. 185-248, 2021.

[24] B. M. Bai, N. Mangathayaru, B. P. Rani and S. Aljawarneh, "Mathura (MBI) - A Novel Imputation Measure for

Imputation of Missing Values in Medical Datasets," Recent Advances in Computer Science and Communications,

vol. 14, no. 5, pp. 1358-1369, 2021.

[25] D. Lee and K. Shin, "Robust Factorization of Real-world Tensor Streams with Patterns, Missing Values, and

Outliers," in 2021 IEEE 37th International Conference on Data Engineering, Chania, Greece, 2021.

[26] T. Siswantining, T. Anwar, D. Sarwinda and H. S. Al-Ash, "A Novel Centroid Initialization in Missing Value

Imputation towards Mixed Datasets.," Communications in Mathematical Biology and Neuroscience, vol. 2021, 2021.

[27] F. Yin and F. Shi, "A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model

to a Cluster Architecture," International Journal of Parallel Programming, vol. 50, no. 11, pp. 27-64, 2022.

[28] R. Pan, T. Yang, J. Cao, K. Lu and Z. Zhang, "Missing data imputation by K nearest neighbours based on grey

relational structure and mutual information," Applied Intelligence, vol. 43, no. 3, pp. 614-632, 2015

[29] P. Keerin and T. Boongoen, "Improved KNN Imputation for Missing Values in Gene Expression Data," Computers,

Materials and Continua, vol. 70, no. 2, pp. 4009-4025, 2022.

[30] K. M. Fouad, M. M. Ismail, A. T. Azar and M. M. Arafa, "Advanced methods for missing values imputation based

on similarity learning," PeerJ Computer Science, vol. 7, 2021.

[31] M. Pampaka, G. Hutcheson and J. Williams, "Handling missing data: analysis of a challenging data set using multiple

imputation," International Journal of Research & Method in Education, vol. 39, no. 1, pp. 19-37, 2014

[32] J. C. Jakobsen, C. Gluud, J. Wetterslev and P. Winkel, "When and how should multiple imputation be used for

handling missing data in randomised clinical trials - A practical guide with flowcharts," BMC Medical Research

Methodology, vol. 17, pp. 162-171, 2017.

Pancreatic Cancer Classification Using Missing Data Imputation And Cluster-Based Undersampling Methods: A Comparative Analysis With Multiple Machine Learning Algorithms

Conference Paper

Jan 2024

Missing values and class imbalance are issues frequently found in databases from real-world scenarios, including cancer classification. Impacts on the performance of Machine Learning (ML) models can be observed if these issues are not properly addressed prior to the analysis. In this paper, a combined solution with missing data imputation using kNN and cluster-based undersampling using k-means is proposed, focusing on pancreatic cancer classification. Different data subsets were generated by combining different preprocessing methods and the performance was analyzed using a ML analysis pipeline from a previous study. This pipeline implements ten ML classifiers, including Random Forest (RF), Support Vector Machine (SVM) and Artificial Neural Network (ANN). All data subsets presented a significant improvement (p<0.05 with Students T-Test) in the performance of most ML algorithms when compared with the results obtained when the pipeline was first evaluated. Results suggest that kNN and k-means can be used in the data preprocessing phase to overcome missing values and class imbalance issues and improve the classification accuracy.

Augmented Reality Based on Network Physics and 6G for Immersive Experience of Digital Media Art

Article

Full-text available

Jun 2024
WIRELESS PERS COMMUN

To improve how digital media art is delivered and experienced, this study proposed a novel way to combine the unique capabilities of 6G networks with the effective Hybrid Neural Network Augmented Physics-based Models (HYNNA-PBM). This research presents a system where HYNNA-PBM plays an essential role in augmenting AR applications’ realism, interaction, and involvement by using the exceptional speed, bandwidth, and low latency of 6G technologies combined with advanced network physics principles. The proposed method solves the essential difficulties in producing incredibly realistic AR environments, which include dynamic light rendering, real-time physical interactions and complex environmental simulations by effectively combining neural networks with physics-based models. This method improves the user experience with digital media art by allowing the smooth combination of virtual elements with the real world, significantly increasing AR simulations’ accuracy and efficiency. In addition to entertainment, this technology provides creative applications in design, education, and cultural preservation by allowing absorbing, cooperative experiences previously unavailable in education and protecting culture. The system’s ability to provide highly complete experiences is shown by experimental results displayed through digital media artworks. This highlights the innovative possibilities of combining HYNNA-PBM with 6G technology in AR applications. With the borders between the virtual and the real-world loss, this research opens up new opportunities for digital art and entertainment that provide never-before-seen levels of involvement and interest.

Leveraging AI for sustainable accounting: Developing models for environmental impact assessment and reporting

Article

Full-text available

Jun 2024

The integration of Artificial Intelligence (AI) in sustainable accounting represents a transformative approach to enhancing the accuracy, efficiency, and comprehensiveness of environmental impact assessment and reporting. This paper explores the development of AI-driven models aimed at advancing sustainable accounting practices, focusing on environmental impact assessment and transparent reporting. AI technologies, particularly machine learning (ML) and natural language processing (NLP), play a pivotal role in automating and refining data collection, analysis, and reporting processes. These technologies enable the processing of vast amounts of heterogeneous data from multiple sources, including IoT sensors, satellite imagery, and corporate disclosures. By leveraging ML algorithms, organizations can identify patterns, predict trends, and assess the environmental impact of their operations with unprecedented precision. One of the key advantages of AI in sustainable accounting is its ability to enhance data accuracy and reliability. Traditional methods often suffer from manual errors and inconsistencies. AI models, however, can continuously learn and adapt, improving their accuracy over time. For instance, predictive analytics can forecast future environmental impacts based on historical data, allowing companies to implement proactive measures to mitigate adverse effects. Furthermore, AI facilitates real-time monitoring and reporting. IoT devices equipped with environmental sensors can stream data to AI systems, which process and analyze the information instantaneously. This capability is crucial for timely reporting and compliance with environmental regulations. Real-time data analytics also empower organizations to make informed decisions swiftly, optimizing their sustainability strategies and reducing their ecological footprint. Another significant contribution of AI is in enhancing transparency and accountability in environmental reporting. NLP algorithms can analyze and interpret regulatory texts, corporate reports, and public records, ensuring that organizations adhere to sustainability standards and guidelines. Additionally, AI can automate the generation of comprehensive and comprehensible sustainability reports, making them accessible to a broader audience, including stakeholders and regulators. Developing robust AI models for sustainable accounting involves several critical steps. Initially, data preprocessing is essential to clean and harmonize diverse datasets, ensuring quality input for AI algorithms. Next, model training and validation are conducted using historical and real-time data to refine predictive capabilities. Continuous model evaluation and adjustment are necessary to maintain accuracy and relevance in dynamic environmental contexts. Collaboration between AI experts, environmental scientists, and accounting professionals is paramount in this development process. Interdisciplinary teams can ensure that AI models are not only technically sound but also aligned with environmental science principles and accounting standards. This collaboration also fosters innovation, leading to the development of more sophisticated tools for environmental impact assessment and reporting. The adoption of AI-driven sustainable accounting models offers numerous benefits, including enhanced efficiency, accuracy, and compliance. However, challenges such as data privacy, algorithmic transparency, and the need for substantial initial investments must be addressed. Future research should focus on overcoming these obstacles and exploring the potential of emerging AI technologies, such as deep learning and blockchain, to further revolutionize sustainable accounting practices. AI holds significant promise for transforming sustainable accounting by improving environmental impact assessment and reporting. Through advanced data analytics, real-time monitoring, and enhanced transparency, AI can help organizations achieve their sustainability goals, ensuring a more sustainable future. The continuous development and refinement of AI models, supported by interdisciplinary collaboration, are essential for realizing these benefits and addressing the complex challenges of environmental sustainability. Keywords: Sustainable Accounting, Environmental Impact Assessment, AI, Developing Models, Reporting.

Covid-19 Prediction Using Enhanced KNN Imputation for Data Pre-Processing

Article

Full-text available

Jan 2024

Hybrid Feature Selection for COVID-19 Severity Prediction Using Cuckoo Search with SVM Framework

Article

Full-text available

Jan 2024

Application of Fuzzy Inference System Takagi-Sugeno Methods for Diagnosis of Diseases and Pests of Madura Paddy Rice Based on Symptoms

Article

Full-text available

Oct 2023

Diseases and pests of lowland rice are one of the factors that can cause a decrease in rice yields. Therefore, it is necessary to have a diagnostic system to identify diseases and pests of paddy rice from an early age based on damage symptoms. The process of diagnosis requires expertise, knowledge, and experience from experts. Therefore, this research tries to build an expert system that can diagnose diseases and pests of paddy rice early by applying the Fuzzy Inference System Takagi with the Sugeno method. Fuzzy Inference System Takagi forms fuzzy sets using implication functions (rules). Rule composition is obtained from a data set of relationships between regulations, where the affirmation (defuzzification) and input from defuzzification is a constant or linear equation. The Sugeno method is used to diagnose diseases and pests of rice plants based on the symptoms experienced. This research aims to help plant pest control officers diagnose diseases and pests of paddy rice plants from the symptoms that attack the rice. The testing technique used is system accuracy testing and Mean Opinion Score (MOS) testing. The MOS test was carried out by involving 30 respondents consisting of 10 farmers and 20 extension workers, where 4.27 was obtained on a scale of 5 which was categorized into a good system. while testing the accuracy obtained from testing the system on two experts on diseases and pests of Madura paddy rice plants in 30 different cases has resulted in an accuracy rate of 86.66%. The expert system built in this study was able to diagnose 13 diseases and pests of Madura paddy based on the knowledge of two experts on 38 symptoms, and the plan was feasible to use and categorized into a good system.

Analysis of the random forest and grid search algorithms in early detection of diabetes mellitus disease

Article

Full-text available

Aug 2023

This research focuses on implementing the Random Forest and Grid Search algorithms for the early detection of diabetes mellitus, aiming to modernize and enhance medical practices using technology. The proposed model achieved an accuracy of 77.06%, a precision of 71.43%, a recall of 47.30%, and a misclassification error of 22.94%. Comparative analysis with other data mining algorithms, including Decision Tree, Random Forest without Grid Search, and Cat Boost, demonstrated that the Random Forest with Grid Search algorithm outperformed the others. By utilizing Grid Search, the accuracy of the Random Forest algorithm increased by 2.03%. These findings indicate the potential effectiveness of machine learning in early diabetes detection. While the research offers promising results, there are limitations in terms of the dataset size and the number of detection variables used. Future studies should explore larger datasets and alternative algorithms to further enhance accuracy and aid in the early detection of diabetes mellitus.

The application of the K-NN imputation method for handling missing values in a dataset

Conference Paper

Jan 2024

A Comparison Study of Machine Learning Algorithms for Credit Risk Prediction

Chapter

Aug 2023

The banking industry performs credit score analysis as an efficient credit risk assessment method to determine a customer’s creditworthiness. In the banking industry, machine learning could be used for a variety of uses involving data analysis. A method of data analysis that is capable of self-regulation has been made possible by the development of modern techniques, such as classification approaches. The classification method is a form of supervised learning in which the computer acquires knowledge from the provided input data and then utilizes it to classify the dataset, which is used for training purposes. This study presents a comparative analysis of the various machine learning algorithms that are utilized to evaluate credit risk. The methods are used by utilizing the German Credit dataset that was collected from Kaggle, which consists of 1,000 instances and 11 attributes, all of which are used to determine if transactions are good or bad. The findings of data analysis using Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, K-Nearest Neighbors Classifier, Decision Tree Classifier, Support Vector Machines, and Random Forest are compared and contrasted in this study. The findings demonstrated that the Random Forest algorithm forecasted credit risk effectively.KeywordsCredit RiskBankingMachine LearningPredictionFeatures

An Intelligent Machine Learning-Based System for Predicting Heart Disease Using Mixed Feature Creation Technique

Chapter

Aug 2023

The quality of healthcare outcomes is influenced by the reliability and accuracy of computational analysis techniques applied to clinical data. Advanced classification systems improve the precision and speed of medical diagnosis, providing critical decision insights for doctors and resource optimization. Classification techniques based on machine learning have been applied as effective and reliable non-surgical techniques for the efficacious diagnosis and treatment of heart disease patients. This paper demonstrates the application of the Mixed Feature Creation (MFC) approach to classify a heart-disease dataset from UCI Cleveland using techniques such as the Recursive Feature Elimination with Random Forest (RFE-RF) feature selection and Least Absolute Shrinkage and Selection Operator (LASSO). Parameters from each technique are optimized through grid-search and cross-validation methods. Further, classifier performance models are used to determine the classification techniques’ F1-scores, precision, sensitivity, specificity, and accuracy based on independent measures such as RMSE and execution time. The findings suggest that ML-driven classification algorism can be used to develop reliable predictive models for the accurate diagnosis of heart diseases.

Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classification Data in the Diagnostic and Treatment Interventions for Prostate Cancer

Article

Full-text available

Feb 2022

Abdelrahman Elsharif Karrar

Class imbalance problem become greatest issue in data mining, imbalanced data appears in daily application, especially in the health care. This research aims at investigating the application of ensemble model by intelligence analysis to improving the classification accuracy of imbalanced data sets on prostate cancer. The primary requirements obtained for this study included the datasets, relevant tools for pre-processing to identify the missing values, models for attribute selection and cross validation, data resembling framework, and intelligent algorithms for base classification. Additionally, the ensemble model and meta-learning algorithms were acquired in preparation for performance evaluation by embedding feature selecting capabilities into the classification model. The experimental results led to the conclusion that the application of ensemble learning algorithm on resampled data sets provides highly accurate classification results on single classifier J48. The study further suggests that gain ratio and ranker techniques are highly effective for attribute selection in the analysis of prostate cancer data. The lowest error rate and optimal performance accuracy in the classification of imbalanced prostate cancer data is achieved using when Adaboost algorithm is combined with single classifier J48.

Main Path Analysis to Filter Unbiased Literature

Article

Full-text available

Jan 2022

Citations are references used by researchers to recognize the contributions of researchers in their articles. Citations can be used to discover hidden patterns in the research domain, and can also be used to perform various analyses in data mining. Citation analysis is a quantitative method to identify knowledge dissemination and influence papers in any research area. Citation analysis involves multiple techniques. One of the most commonly used techniques is Main Path Analysis (MPA). According to the specific use of MPA, it has evolved into various variants. Currently, MPA is carried out in different domains, but deep learning in the field of remote sensing has not yet been considered. In this paper, we have used three centrality attributes which are Degree, Betweenness and Close-ness centrality to automatically identify important papers by applying clustering method based on machine learning (i.e., K-means). In addition, the main path is drawn from important papers and compared with existing manual methods. In order to conduct experiments, a data set from Web of Science (WOS) has been established, which contains 538 papers in the field of deep learning. Compared with existing works, our method provides the most relevant papers on the main path.

Machine learning models for mitral valve replacement: A comparative analysis with the Society of Thoracic Surgeons risk score

Article

Full-text available

Oct 2021
J CARDIAC SURG

Background: Current Society of Thoracic Surgeons (STS) risk models for predicting outcomes of mitral valve surgery (MVS) assume a linear and cumulative impact of variables. We evaluated postoperative MVS outcomes and designed mortality and morbidity risk calculators to supplement the STS risk score. Methods: Data from the STS Adult Cardiac Surgery Database for MVS was used from 2008 to 2017. The data included 383,550 procedures and 89 variables. Machine learning (ML) algorithms were employed to train models to predict postoperative outcomes for MVS patients. Each model's discrimination and calibration performance were validated using unseen data against the STS risk score. Results: Comprehensive mortality and morbidity risk assessment scores were derived from a training set of 287,662 observations. The area under the curve (AUC) for mortality ranged from 0.77 to 0.83, leading to a 3% increase in predictive accuracy compared to the STS score. Logistic Regression and eXtreme Gradient Boosting achieved the highest AUC for prolonged ventilation (0.82) and deep sternal wound infection (0.78 and 0.77) respectively. EXtreme Gradient Boosting performed the best with an AUC of 0.815 for renal failure. For permanent stroke prediction all models performed similarly with an AUC around 0.67. The ML models led to improved calibration performance for mortality, prolonged ventilation, and renal failure, especially in cases of reconstruction/repair and replacement surgery. Conclusions: The proposed risk models complement existing STS models in predicting mortality, prolonged ventilation, and renal failure, allowing healthcare providers to more accurately assess a patient's risk of morbidity and mortality when undergoing MVS.

Advanced methods for missing values imputation based on similarity learning

Article

Full-text available

Jul 2021

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods' accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

Centerless Clustering: An Efficient Variant of K-means Based on K-NN Graph

Article

Feb 2022

Although lots of clustering models have been proposed recently, k-means and the family of spectral clustering methods are both still drawing a lot of attention due to their simplicity and efficacy. We first reviewed the unified framework of $k$ -means and graph cut models, and then proposed a clustering method called k-sums where a $k$ -nearest neighbor (k-NN) graph is adopted. The main idea of $k$ -sums is to minimize directly the sum of the distances between points in the same cluster. To deal with the situation where the graph is unavailable, we proposed k-sums-x that takes features as input. The computational and memory overhead of $k$ -sums are both $O(nk)$ , indicating that it can scale linearly w.r.t. the number of objects to group. Moreover, the costs of computational and memory are Irrelevant to the product of the number of points and clusters. The computational and memory complexity of k-sums-x are both linear w.r.t. the number of points. To validate the advantage of k-sums and $k$ -sums-x on facial datasets, extensive experiments have been conducted on 10 synthetic datasets and 17 benchmark datasets. While having a low time complexity, the performance of k-sums is comparable with several state-of-the-art clustering methods.

Concepts of Data Mining and Process Mining

Chapter

Jan 2022

Through this book chapter authors basically tried to illustrate about basic concepts of data mining as well as process mining in education field. Educational data mining (EDM) has been playing a very important role in educational institutions for handling a vast amount of data to infer some useful information which is beneficial in many aspects such as assessing students performance or forecasting the expected results for upcoming semester exams as per data availability of previous semester exam results, it is also helpful in determining the trends of undergraduates with respect to opting further higher education in a educational University or institution. There are many strategies available of educational data mining for assessing various educational decisions helping institutions to grow as well. Through this book chapter we have discussed some strategies and basic concepts.

Data Preprocessing and Data Manipulation

Chapter

Jan 2022

Brian Wesolowski

Improving Knowledge Based Detection of Soft Attacks Against Autonomous Vehicles with Reputation, Trust and Data Quality Service Models

Conference Paper

Sep 2021

Improved KNN Imputation for Missing Values in Gene Expression Data

Article

Oct 2022
CMC-COMPUT MATER CON

The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics, especially the analysis of gene expression data that facilitates an early detection of cancer. Many attempts show improvements made by excluding samples with missing information from the analysis process, while others have tried to fll the gaps with possible values. While the former is simple, the latter safeguards information loss. For that, a neighbour-based (KNN) approach has proven more effective than other global estimators. The paper extends this further by introducing a new summarization method to the KNN model. It is the frst study that applies the concept of ordered weighted averaging (OWA) operator to such a problem context. In particular, two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models. Using different ratios of missing values from 1%–20% and a set of six published gene expression datasets, the experimental results suggest that new methods usually provide more accurate estimates than those compared methods. Specifc to the missing rates of 5% and 20%, the best NRMSE scores as averages across datasets is 0.65 and 0.69, while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84, respectively.

Data Mining Based Imputation Techniques to Handle Missing Values in Gene Expressed Dataset

Article

Sep 2021

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

Abstract

Recommended publications

Impute Missing Values in R Language using IBK Classification Algorithm

Data Transmission Security Based on Machine Learning Algorithms: A Review

Investigate the Ensemble Model by Intelligence Analysis to Improve the Accuracy of the Classificatio...

Adopting Graph-Based Machine Learning Algorithms to Classify Android Malware