Missing Data Imputation using Genetic Algorithm for Supervised Learning

Data is an important asset for any organization to successfully run its business. When we collect data, it contains data with low qualities such as noise, incomplete, missing values etc. If the quality of data is low then mining results of any data mining algorithm will also below. In this paper, we propose a technique to deal with missing values. Genetic algorithm (GA) is used for the estimation of missing values in datasets. GA is introduced to generate optimal sets of missing values and information gain (IG) is used as the fitness function to measure the performance of an individual solution. Our goal is to impute missing values in a dataset for better classification results. This technique works even better when there is a higher rate of missing values or incomplete information along with a greater number of distinct values in attributes/features having missing values. We compare our proposed technique with single imputation techniques and multiple imputations (MI) statistically based approaches on various benchmark classification techniques on different performance measures. We show that our proposed methods outperform when compare with another state of the art missing data imputation techniques. Keywords—genetic algorithm; information gain; missing data; supervised learning


I. INTRODUCTION
Data is available in every sphere of life which is collected and used for various purposes.Processing and analysis of the collected data after being processed usually provides useful insights and knowledge about the system which has produced such data.The field of data mining basically deals with mining useful information from raw data instead of using all the data that also has some unimportant information.Data mining is a collection of techniques used for extracting or mining of previously unknown, useful and understandable patterns from large databases.Data mining integrates techniques from multiple disciplines such as database technology, machine learning, statistics, pattern recognition, neural networks, and image processing and data visualization.There is always a requirement for efficient and scalable data mining algorithms and it is a subject of ongoing research [1].
The process of data mining is to extract information from data.The first step is to extract data from the database and then perform preprocessing steps on it.Data mining techniques are used to extract data patterns.Evaluation and presentation mean to represent the knowledge which is understandable to users.The result is the empowerment of users with knowledge.
There are different data mining techniques including supervised classification, association rules mining or market basket analysis, unsupervised clustering, web data mining, and regression.One important technique of data mining is the classification of data.The objective of classification is to build one or more models based on the training data, which can correctly predict the class of test objects.There are several problems with a large scale of domains which can be cast as classification problems [1].The classification has several important applications in our lives [2][3][4][5].Examples include customer behavior prediction, portfolio risk management, identifying suspects, medical applications, sports and fraud detection etc.This research deals mainly with the data preprocessing evaluated on the basis of classification technique of data mining.
One of the challenging problems is to transform huge amount of data into an accessible and actionable knowledge.This knowledge is utilized by domain experts for decision making.Therefore, the core focus is on the knowledge discovery process in the databases (KDD).KDD is defined as a nontrivial process of identification and extraction of implicitly, previously unknown, and potentially useful information from the data [1].
The collected data may contain several states of the art deficiencies such as missing values, non-discredited data, inconsistent, incomplete and noise etc.If data is not of high quality it may hinder the discovery of useful patterns later in the process.The main purpose of the preprocessing step is to enhance the quality of data used in the experiment.All the data mining techniques are applicable once the data has been preprocessed and the objective of preprocessing is simple.Data collected from the real world is dirty and needs to be cleaned.The word dirty in data perspective means state of the art deficiencies described earlier.There can be various reasons due to which these issues arise, overcoming these problems is done by using KDD process, and there are different techniques that are proposed by various researchers which we will describe later in this paper.
In this paper, we address an important area of data preprocessing which is missing values imputation.Missing values in a dataset mislead the learning model.We have proposed a new approach based on GA and IG to impute the missing values.The proposed technique has been evaluated on different classification methods.The proposed technique has a higher accuracy rate and is well suited for large dimensional search spaces with a higher rate of missing values.
The rest of the paper is organized as follows.Section 2 www.ijacsa.thesai.orgdescribes the background of missing values, section 3 presents different classification algorithms, section 4 provides a detail description of proposed technique, section 5 presents experimentation results and finally section 6 concludes proposed technique and gives some future directions.

A. Importance of Complete Datae
Basically, in data mining, the focus is on extracting useful information from a large amount of data that is collected from various sources and to take decisions using such data.Decisions are made on the basis of science, business and economic approaches on data available.As an example, sales and other information allow business class and investors to evaluate and make critical decisions regarding their investments with their future outcomes, whereas advances in research are based on the discovery of knowledge from various experiments and measured parameters.
During fault detection and identification, it is observed that most data is corrupt or incomplete.Predictive models that take observed data as an input are used for many decision-making processes, such models do not tolerate any incompleteness in data provided for prediction and as a result, such models are normally broken down.In many applications, simply ignoring the incomplete record is not an option.Most decision-making tools such as the commonly used neural networks, support vector machines, and many other computational intelligence techniques cannot be used for decision making if data is not complete.This is mainly due to the fact that ignorance can lead to biased results in statistical modeling or even damages in machine control [6].For this reason, it is often essential to making the decision-based approach on available data [7].
The challenge missing data pose to the decision-making process is more evident in on-line applications where data have to be used almost instantly after being obtained.The biggest challenge is that the standard computational intelligence techniques are not able to process input data with missing values and hence, cannot perform classication or regression.Some of the reasons for missing data are sensor failures, omitted entries in databases and on-response to questions in questionnaires.There have been many techniques reported in the literature to estimate the missing data for some applications [7].There are several reasons why data might be missing, and missing data may follow an observable pattern.Exploring the pattern is important and may lead to the possibility of identifying cases and variables that eect the missing data [7,8].A proper estimation method can be derived by identifying the variables that predict the pattern.

B. Missing Data Mechanisms
Missing data randomness is divided into three classes [9] such as missing completely at random missing at random, not missing at random [5] and missing data handling techniques (Ignoring data).
To discard data with missing values two core methods are used.One is called complete case analysis.It is available in every one of statistical packages and is the default method in many programs.The other method is discarding instances or attributes called listwise deletion.In this method, the level of missingness is determined on each instance and attribute and deletes the instances or attributes with high extents of missing data.Prior to deleting any attribute, it is vital to evaluate its connotation to the investigation.The methods, complete case analysis and discarding are executed only if missing data is missing completely at random.The missing data that are not missing completely at random contain non-random elements that may prejudice the results [9].The deletion can bring in significant bias into the experimentation.In addition, the reduced sample size can significantly hamper the analysis.The thumb rule for deletion instances is, if attributes have more than 5 1) Mean-fill approach: Most common technique in missing data imputation is finding the estimates of the values and then these estimates are replaced with the missing entries, the focus of our work is related to the estimation of values and its comparison to proposed technique.These estimates include statistical calculation i.e., means, zero filling, min replacement and max replacement.
These estimation techniques are used in datasets with missing values as observed values results are observed in the form of classifiers accuracy and other output measures like precision, recall, f-measure and Area under ROC.The main reason of calculating other results is just because if the classifier does not satisfy the accuracy reported.Then these measures can also be observed in the case of finding a better result.
Mean-fill approach is one of the most common statistical estimation approaches that is actively used as filling up missing values attributes of data with missing values, which is provided by various open source data mining toolboxes or packages.Also in latest researchers are using comparison technique and their majority cases research provides promising results.But it is observed that for data with a large amount of missing information this approach do not work very well.

Mean of the attribute values (in case, of numeric values,
for discrete values MODE is taken) set is taken and all the missing values are replaced by the mean value in that particular attribute, similar is the case for all attributes for any dataset.Min fill approach, Max fill approach, Zero fill approach and K-Nearest Neighbor approach [10] are most common approaches being used.K-Nearest Neighbors are determined on the bases of some kind of distance between points.It has the biggest disadvantage since it looks for the most similar instances, the whole dataset should be searched.On the other hand, how to select the value k and the measure of similar will impact the result greatly.

2) Multiple imputations (MI):
It is one of the most attractive methods for general purpose handling of missing data in the multivariate analysis.Rubin [9] described MI as a three step process, imputation, analysis, and pooling.
The most challenging step is imputation, that is, the construction of the m-completed datasets.This step accounts for the process that causes the creation of the missing data.First, sets of plausible values for missing values are created using an appropriate model chosen, reflects the uncertainty due to the missing data.Each of these sets of plausible values is used to fill-in the missing values and creates a completed dataset.Typical problems are: • Missingness could be related to the value of information (e.g., people with higher incomes tend to skip income questions more often).
• Missing entries can appear anywhere in the data.
• The method used in the imputation step must foresee the intended complete-data analysis.
The repeated ANALYSIS step on the imputed data is actually somewhat simpler than the same analysis without imputation because there is no need to bother with the missing data.Each of these datasets can be analyzed using complete data methods.
The POOLING step consists of computing the mean over the m repeated analysis, its variance, and its confidence interval or P value.Results are combined finally.In general, these computations are relatively simple.
There are various ways to generate imputations.The implementation program for MI of continuous multivariate data (NORM) is available in [12].However, it is not necessarily true that any particular method will perform better for any particular empirical study.It is well known that methods for handling nonignorable data require the analyst to make assumptions about the model of missingness [11].Recent overviews of NMAR modeling are given in [13,14,15].Selection and Pattern mixture models are used for NMAR data Models need more statistical formulas to impute the data.If the chosen model is incorrect then MNAR model may perform even less well than standard MAR methods [9].Different types of weighting methods are also used for non-ignorable missing data.Even though many methods are available, they could not be used by researchers due to lack of familiarity and computational challenges and researchers often opt for ad-hoc approaches that may do more harm [7].
3) Auto-associative Neural Networks: An auto-associative referred to as autoencoder neural network is a specific neural network, trained to recall its inputs [19].Given a set of inputs, the network predicts these inputs as outputs and thus has the same number of output nodes as there are inputs.However, the hidden layer is characterized by a bottleneck, with fewer hidden nodes than output nodes.
The smaller hidden layer projects the inputs onto a smaller space, extracting linear and non-linear interrelationships such as covariance and correlation, from the input space and also removes redundant information [19].This means that they can be used in applications to recall the inputs and missing data estimation applications.

III. CLASSIFICATION
Data mining learning models are categorized into two, the one in which class to which training sample is known while there is a learning stage; it is called labeled training data.The predictive models are built on the basis of supervised learning data, whereas unlabeled data is used to test the model.One example is the classification method in which class labels are known.Other is unsupervised learning method where the class label for the training data is unknown.Here the training data is grouped according to their similarities, clustering is the example of unsupervised learning where data is unlabeled.
A fundamental aim of this research work in the field of classification is to perform preprocessing on data available and to make clean data available to the classifiers highly accurate models from the available data that can be learned.Other objective includes verification of correctness of proposed technique on the basis of classification results.Decision Tree (C4.5),PART, NB-Tree and RIPPER are the most common classifiers used in the field of machine learning and these are also used in this research [23,24,25,26].

IV. PROPOSED TECHNIQUE
We have used GA with IG for imputation of missing values.Following subsections will describe the proposed technique.

A. Genetic Algorithm
GAs are basically evolutionary ideas of natural selection and genetics [16,17].GAs are adaptive heuristic search algorithm.Inspired by Darwins theory of evolution survival of the fittest, it is common in nature that in a competition where individuals are looking for resources fittest individuals dominate over weaker ones.Evolutionary computing today holds GAs as one of the important parts.Among random search methods employed to solve optimization problems, GAs represent an intelligent structure which is easy to implement.
For any particular problem GAs works for solving it is by mimicking processes nature use, like selection, crossover, mutation and acceptance, to evolve a good solution for that problem.
1) Operators of GA: GAs use genetic operators to maintain genetic diversity.It is important that genetic diversity or variation is maintained for the process of evolution.Inspired by natural genetic structure, genetic operators are the same.Following are operators used in genetic algorithms. 1) Reproduction/ Selection: Usually, the first operator applied on population is a reproduction, from the population the chromosomes are selected to be parents for the crossover step and producing offsprings.
According to Darwins theory survival of fittest, the best ones should survive and create new offsprings.Reproduction operator is also called selection operator because it is basically extraction of genes subset from existing population based on some quality criteria or definition.The fitness function is the quality measurement that can be performed to select best genes subset, as every gene contains some meaning.2) Crossover/ Recombination This genetic operator is called crossover because it mates (combines) two parents (chromosomes) to produce a new offspring (chromosome).Most commonly used methods for the selection of parents to crossover are: • Roulette wheel selection.
• Steady state selection.
• Tournament selection.The idea behind crossover is that after mating any chromosomes (parents) that are selected based on some function, offsprings (chromosomes) will be fitter as they are derived as a result of best characteristics of their parents.According to user-defined crossover probability, it takes place during evolution stage.
3) Mutation: During the evolution stage mutation occurs where the user defines mutation probability, this probability is usually set to a fairly low value, like 0.01 is a good first choice.Mutation is the genetic operator used to maintain genetic diversity from one generation of a population of chromosomes to the next generation.

B. Proposed Technique
This section provides detail of the proposed technique along with fitness function used 1) General Description: GA is used for missing data imputation, the importance of missing data imputation varies from problem to problem, and we use this technique to clean the dirty data for classification problem.The missing values are imputed in the datasets using GA and GA is run for each attribute which is treated as a chromosome.We divided these chromosomes into frames for further accurate measures; these frames are explained by the example in the following section.Frames are dependent upon the no of classes in the dataset.i.e., there is n number of class labels in a dataset.
The flow chart describes the working of proposed technique as shown in figure 2. Using attribute instances first we create an initial solution of population size defined in parameter section.Evaluate the fitness of each solution.Check termination criteria for a maximum number of generations.For generation number 1 initial size of the new population is 0. Select individuals randomly from the population according to tournament size for selection using tournament selection.Select genetic operator to be applied to the selected individuals probabilistically.Perform crossover or mutation on the selected individual's bases on the probability of selection for crossover or mutation.The resultant of the genetic operator is inserted in the new population.Check for population size on every iteration, if population size is equal to maximum population size then start a new generation and check for termination criteria else continue to select new individuals from the current population.When the population size is reached maximum new generation become started, if the current population has the fitness of individual then how previous populations best fitted then we keep that individual from the previous population in current population (Elitism= keep best).
To illustrate how GA works in improving data quality by imputing missing values based on estimation, the following is A chromosome split into n number of frames (subchromosome).N is the number of classes in the dataset.Each frame initialized independently to another frame, within restricted range that it must contain values obtained by the attribute to a specific class, in the first generation.Merging all frames into one makes the valid structure of a chromosome in population.
Each frame is treated as full fledged independent chromosome at the time of applying genetic operators on it.One point crossover used on each frame so n number of cross points are used for every chromosome.Mutation operator mutates randomly n number of genes depending on the probability of mutation criteria.Each gene belongs to a specific frame so, during mutation of a gene, gene value is replaced by a specific set of domain values of class from the dataset.1) Structure of Chromosome: The data illustrated above belongs to two class problem so N = 2.The number of the frame will be two in each chromosome as shown 2) Operate on genetic operators: 1) Crossover: One point crossover is performed.shows the result of crossover operation performed on it in the previous step.2) Mutation: In mutation gene values mutate according to the domain of each frame defined according to the range of distinct values that are available in the particular data attribute, here figure 8 illustrates the outcome of the mutation operation performed on chromosome shown in figure 4 (e-Mutation performed).After mutation is performed the fitness of the resultant chromosome is calculated if it is greater than the previous data fitness the values are saved and next iteration takes its place, until the termination criteria are met that can be the end of condition or some value which achieved terminates GA.

C. Fitness function
In the proposed GA for missing data imputation, we are using IG as Fitness function that is based on the entropy of each attribute regarding its class label in the given dataset.Remember that when we calculate the fitness of an attribute then the whole attribute is used for the calculation of fitness after imputing missing values.
Following is the brief description of Fitness function used in the proposed work.
IG is a correlation-based measure.It is based on an information theoretical concept of entropy i.e. a measure of the uncertainty of a random variable.Following is the equation of entropy of X eq (1).
The entropy of X when the value of another random variable Y is known, following is the conditional entropy eq(2).
In the above equation (2), P(x i ) is the prior probability for all the values of X. P(x i-y j ) is the posterior probability of X after the values of Y are known.The amount of which the entropy of X decreases, it depicts the decrease in uncertainty level.This is achieved through the additional information regarding X provided by Y.This measure is called IG.Following is the formula for information gain eq(3).

A. Experimentation Framework
The population size is defined as 500, for 100 generations and tournament size is kept 6.These parameters setting have been chosen after performing several experiments.All the other parameters used are defined in the following table 1. Experimentation has been performed with different combinations of these parameters and best values are kept same for all the experimentation as shown in table 1.
In the experimentation, worth of a missing data imputation through GA is evaluated on 5 key measures, i.e. predictive accuracy, along with precision, recall, f-measure, and ROC.
Following table 2 elaborates about the datasets used in the experimentation.All the datasets used are publicly available and taken from UCI repository [31].We have used standard implementation of MI which is available as NORM [12], and classifiers like NB tree, PART, JRIP, j48, NAVE bases and K-Nearest Implementation of these algorithms is provided by data mining software Weka [30].All the algorithms are used with their default values and no tweaking is done over the methods.Since these algorithms are implemented by their authors, therefore, it is assumed that parameter setting is already incorporated.
Following table 2 describes datasets used for experimentation along with total number of features, the number of instances and percentage of missing values in these datasets.
The techniques that are used for comparison with the proposed method are Multiple Imputation, Mean filling, Min filling, Max fills and Zero fill.Table 3 shows a comparison of classification accuracies and their standard deviations after being imputed by various techniques including proposed technique of GA fill.

B. Comparison with other techniques
For comparison, four single imputation techniques have been adopted; filling missing values using mean, min, max and zero by replacing all missing data by 0, Multiple Imputation (MI) of missing data is also used for comparison.The results are compared for NB-Tree, JRIP, PART, NAVE Bayes, IBK (Lazy) and j48 (C4.5) classifiers.For performing experimentation with these classifiers we used Weka machine learning tool [30].We used supervised discretization filter of Weka-3.4 machine learning tool [30] to discretize continuous attributes as a preprocessing step.The GA has seven userdefined parameters.The values of these parameters are given in table 1.The predictive accuracies of the compared algorithms are shown in tables 3 and 4. Ten-fold cross validation is used to obtain the results.The Bold value represents the highest accuracy achieved.
From tables 3 and 4, it is observed that missing data imputation using GA clearly out marks by 70% of the datasets than other estimation and predictive model techniques.These predictive accuracies show worth of the proposed approach.A genetic algorithm is an evolutionary algorithm and has much diversity when imputing missing values.
Our method performs better in three datasets for NAVE Bayes algorithm, similarly, for K-Nearest neighbors classifier, it achieves better accuracies on three datasets and better in four datasets for the j-48 classifier.
Table 5 shows results of precision results of proposed algorithm and single, and MI technique is presented.Different classifiers are used for classification with 10-fold cross validation.The Bold value represents the highest accuracy achieved.
In the above-mentioned table, our method performs comparable and/or better in 70% of the datasets.It can be observed that our proposed method achieves better/ comparable classification precisions as compared to single and MI techniques n most of the cases.It is observed that missing data imputation using GA clearly out marks/ comparable with other estimation and predictive model techniques.
In table 6, recall measures of proposed algorithm and single and MI technique is shown.In the below-mentioned table, our method performs comparable or better in most of the datasets.
In table 7, F-Measure of proposed algorithm and single, and MI technique is presented.The proposed approach also has better F-measure values in most of the datasets.
In table 8, AREA under ROC of proposed algorithm, single and MI techniques are presented.The AREA under ROC approach is high on most of the datasets.
These experimentation results of different data sets are evaluated on different benchmarks evaluation methods.This has shown the worth of proposed approach when compared with well-known missing data imputation algorithms.These results indicated that GA is a suitable method for the imputation of missing values.achieve higher accuracy rate, are comprehensible and can be learned in reasonable time, even for large databases.
In this paper, we addressed the problem of missing data imputation.First, we have elaborated on the importance of clean data (complete) in KDD.We have proposed an evolutionary technique for filling missing data on the basis of good estimation using GAs.Our main objective was to embed population-based search mechanisms to explore more search space along with exploitation.The datasets used are standard datasets having by default missing values.We have also demonstrated that proposed technique works well for datasets with a greater percentage of missing values also for datasets where attributes are having a large range of distinct values, as GA gets into real play where there is space for more and more combination of different values.In future, we like to extend our algorithm to the domain of Noise reduction/removal.

Fig- ure 4
(c-Crossover Performed) shows the crossover points on the chromosome.As a result of crossover, offspring is created that is shown in figure7.The chromosome in figure4(d-Result of crossover)

TABLE I .
PARAMETERS USED IN GA

TABLE II .
DATASETS USED FOR EXPERIMENTATION

TABLE III .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON CLASSIFIERS ACCURACIES ALONG WITH STANDARD DEVIATIONS

TABLE IV .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON CLASSIFIERS ACCURACIES ALONG WITH STANDARD DEVIATIONS

TABLE V .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON CLASSIFIERS PRECISION RESULTS

TABLE VI .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON CLASSIFIERS RECALL MEASURES VI.CONCLUSION AND FUTURE WORK Data mining is an active area of research and in this area data is the most vital and valuable asset.Without applying automatic data mining techniques and preprocessing methods it is difficult to effectively analyze large amounts of data.Researchers are interested in finding efficient and accurate technique/method that cleans dirty and noisy data so that

TABLE VII .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON CLASSIFIERS F-MEASURES

TABLE VIII .
COMPARISON OF GA WITH DIFFERENT TECHNIQUES BASED ON AREA UNDER ROC