Generalized Approach 

Generalized Approach 

Source publication
Conference Paper
Full-text available
Missing values in datasets has been a continuous challenging issue in the field of data mining, data warehousing, machine learning and artificial intelligence. In this paper we introduce a toolkit for missing value imputation that brings all the famous algorithms under one window. This toolkit also facilitates the user to apply any two algorithms f...

Context in source publication

Context 1
... automation of business processes leads the computer science community to starts thinking how they can support business community to get benefit from the vast accumulated data. This also results in initiation of Business Intelligence (BI) which helps in providing latest information and used for competition analysis, market research, economical trends, consumer behavior, industry research, and geographical information analysis [1]. All these crucial analysis require quality data. Quality of data can be improved by using data cleansing, data integration and data transformation techniques [2]. If data contains missing values, noisy values and/or inconsistencies, then BI decisions will be affected from it. Our focus in this paper is on missing values which is a part of data cleansing process of data mining. Missing values in data sets are due to human mistakes, network data transfer problems, hardware malfunctioning or many others reasons [2]. Another reason for missing values with in a data set arises due to lack of response or erroneous response [3]. Wild values are also a cause for missing values. A value is called a wild value when we know for sure that the value is not correct. For example, a categorical variable having a numerical value. Punching errors or the recorder’s ignorance may be the reasons for this. The most common remedy in practice for wild values is to enter “nothing” in place of the wild value, thereby creating more missing data. Not only these, but unanswered checklists, questionnaires, skipped questions, inefficient data collection methods contribute to missing values in data sets [3, 4]. We present a missing value imputation toolkit that comprises several algorithms from the domain of classification, clustering, and association rule mining. Imputation is a class of procedures that fill in the missing values with the estimated one [5]. Focus of our system is to provide a set of existing missing value imputation techniques under one window, because different datasets are different in nature and hence require different type of algorithms for its pre-processing. Naïve Bayesian algorithm takes more imputation time on large datasets as compare to small datasets due to probability calculation for each attribute. We also provide the facility to data miner to use the combination of any two algorithms for handling missing values. Selection of algorithms are based on the nature of dataset and desire of data miner that what they require either better accuracy or quick imputation or combination of both. In system overview we explain one of the combinations of algorithms comprising of Frequent Pattern Growth [2] and K-Nearest Neighbor [7]. Starting with the toolkit overview firstly we load the data and after that we miss the values at random from the dataset. As shown in the Figure 1, the factors that are used to insert the missing values are; Minimum Value, Interval, and Repeat. Minimum value shows the minimum amount of missing values inserted in the data. Interval shows the gap by which the missing values will be inserted and repeat shows the number of times the process of missing values will be placed in the data. The purpose to miss the data is to impute the values and check the accuracy of the executed algorithm. On the other hand if the dataset already contains missing values then we just impute the missing values in the data with the selected algorithm. In this case we are unable to check the predicted accuracy. Liu et al. [6] discusses handling missing values using decision tree. Batista et al. [7] proposes solution for imputing missing values using KNN for different values of K. Shariq et al. elaborates a solution for Apriori full match and Apriori partial matching [8,9].We present the toolkit for missing value imputation that comprises of several techniques and combination of different approaches. We also proposed frequent pattern mining algorithm for imputation of missing values in combination with KNN algorithm. The block diagram shown in Figure 2 explains five steps which are executed in sequence. In step 3, shown by dotted line rectangle, we show the 7 different algorithms used in this toolkit. During the process either one or a combination of two algorithms is used at one time. Figure 3 shows adaptive part of our proposed system in which we combine an association rule mining algorithm along with a clustering algorithm to get better accuracy. In series we apply FP algorithm to get partially imputed missing values dataset and after that we apply the KNN algorithm to impute rest of the missing values. Input to FP algorithm is the datasets along with minimum support value. FP will generate the frequent patterns from dataset. After that FP tree is built from the generated frequent patterns. We use FP tree for the imputation of missing values. The way FP tree impute the missing values is elaborated in pseudo code. If the dataset still contains missing values more than a certain threshold or if data miner desires, this partially imputed dataset is passed to KNN algorithm. This will impute the remaining missing values. The idea to use two algorithms is to increase both the prediction accuracy and decrease the prediction time, because normally datasets involved in data mining contains huge amount of data so prediction time along with accuracy will be the major ...

Similar publications

Article
Full-text available
Objectives: This paper aims to evaluate the performance of the machine learning classifiers and identify the most suitable classifier for classifying sentiment value. The term “sentiment value” in this study is referring to the polarity (positive, negative or neutral) of the text. Methods/Analysis: This work applies machine learning classifiers fro...
Conference Paper
Full-text available
Traditional recommender systems utilize user and item profiles in order to predict ratings of unseen items. New users, items and ratings are continuously updated to the system, making data available for detection of changes in user preferences throughout the time. In this work the widely used user-neighborhood recommender system is extended by inco...