Missing Data Prediction using Correlation Genetic Algorithm and SVM Approach

Data exists in large volume in the modern world, it becomes very useful when decoded correctly to inform decision making towards tackling real word issues. However, when the data is conflicting, it becomes a daunting task to get obtain information. Working on missing data has become a very important task in big data analysis. This paper considers the handling of the missing data using the Support Vector Machine (SVM) based on a technique called Correlation-Genetic Algorithm-SVM. This data is to be subjected to the SVM classification technique after identifying the attribute’s correlation and application of the genetic algorithm. The application of the correlation enables a clear view of the attributes which are highly correlated within a particular dataset. The results indicate that apart from the SVM, the application of the proposed hybrid algorithm produces better outcomes identification rate and accuracy is considered. The proposed approach is also compared with depicts the Mean Identification rate of applying the neural network, the result indicate a consistent accuracy hence making it better. Keywords—Missing data; Support Vector Machine (SVM); genetic algorithm; hybrid algorithm; correlation


I. INTRODUCTION
Data missing is the most common issue in various real worlds because it affects taking a timely decision using the acquired data. This research addresses missing data issues of data preprocessing that can have a significant impact on generalization performance of classification accuracy towards meaningful data. Various dataset suffer from an unavoidable problem of missing values for many reasons such as not enough data in report results; missing in industrial experiment, or failures automatic machine while collecting data [1] medicinal dataset contains missing data because some patient's record needs some critical value, not all possible tests to investigate it [2]. Data missing may happen at two stages; during training time as training data or at the prediction time while testing the data. The machine learning algorithms are mainly concerned with the identification of the missing values at the training time with less focus on missing values during prediction time. There are various techniques for treating missing data, examples include imputation techniques, ignoring techniques, and model-based techniques. The ignoring technique includes complete case analysis, which involves analyzing the case to have any missing data in any of the variables. The particular case is omitted from the analysis part. Another technique in ignoring is pair-wise deletion in which each of the features is considered and the value missing in any field is not much minded. Treating missing data requires thorough analysis process involving estimation of missing value without losing the statistical perspective of the dataset. These two criteria are contradictory and use the information from the partially completed data and at the same time maintaining the statistical perspective of the dataset while imputing the missing values [3]. Some techniques have been discussed to handle of missing data [3], such as remove cell containing missing data other using imputation with appropriate values. The main difference between the two approaches is that, removing missing data is more suitable with small number of instances to avoid information decrease while Imputation methods can be practical with big data and large missing value. Consequently, imputation methods are a accepted approach dealing with missing value.
Correlation is a technique which identifies the relationship between variables. The correlation factor helps in identifying the suitable relation between the variables of a particular dataset. The support vector machine (SVM) is a supervised learning technique which initially helped in two-class classification problem. The kernel functions may also be applied to optimize the parameters. Given a set of training data, SVM produces optimum hyper plane by using the concept of supervised learning. Basically a hyper plane is one which acts as a line that plays an important role in dividing the plane into two parts which belongs to each of the class. The SVM plays a drastic role when there is a clear-cut division of the two classes along X and Y plane. When there is no clear discrimination of the two classes through a particular line, then there is a need to use the third axis Z. There arises the use of kernals. Therefore by using some tuning factors in support vector machine and by changing them according to the problem enables to achieve non-linear classification. This type of classification helps in achieving higher accuracy in limited amount of time. Kernal plays a very important role in learning the hyper plane in SVM which helps in changing the problem using linear algebra. The SVM plays a major role in text categorization removing the need for labeled training data. Image classification is also possible through SVM which provides higher accuracy rate than existing systems. Image segmentation also has the usage of SVM in it.The SVM helps in classification of proteins in biological science and also enables recognition of the handwritten text. SVM also has few disadvantages. The SVM algorithm avoids probability estimation on data which are stable. The input data needs to be fully labeled. The applicability of SVM is more towards two-class problem and further multi-class structure needs to be looked upon.
The genetic algorithm has the basic steps of selection of www.ijacsa.thesai.org population, crossover and mutation. The fitness function determines the quality of the individual. Fitness passed individuals are inherited to another generation. The genetic algorithm initially originates with a set of solutions and later variants them for different generations. For increasing the performance of the algorithm, random search is performed on the old data for new search items. Therefore genetic algorithm allows global search thereby trying to improve the global optimum through various available solutions.
The genetic algorithm has the necessary steps for selection of population, crossover, and mutation. The fitness function determines the quality of the individual, and individuals who pass fitness test are inherited to another generation. The genetic algorithm initially originates with a set of solutions and later variants them for different generations. For increasing the performance of the algorithm, a random search is performed on the old data for new search items. The genetic algorithm creates an opportunity for global search hence improves global optimum through the available solutions. The genetic algorithm tries to identify the attributes with the missing values [4]. Once the attribute has been identified, it engages in finding domain values of missing data values. Values of the missing attributes are then replaced with identified domain values such that possible set of domain values are identified for the missing attribute values. A similar concept applies to all attributes with missing values. With an overall bunch of arrived values, the set of values are chosen. Crossover on the set of selected instances is made. The fitness function is determined and validation is done against it. This helps in the determination of classification accuracy on the decision tree [5] [6] . If the selected instance is classified, then the substituted values are classified or else they are deleted. The process is repeated until a bunch of values is obtained. The proposed paper tries to address the missing data using the concept of correlation, which identifies the relation. Then the genetic algorithm and SVM are applied to handle the missing data and efficiently classify the data.
The paper is organized as follows: Section II deals with the related works in handling missing data, Section III is the proposed work and Section IV deals with the implementation and Section V deals with results and discussion and section VI deals with conclusion.

II. RELATED WORK
Handling missing data is very important in term of use these data. Many Techniques used to optimize the data findings and use. Optimization and Machine Learning algorithms are used to enhance the data processing. The Genetic Algorithm (GA) [7] was used to optimize the initial weight and threshold values of support vector machine. The proposed GA-SVM was used to forecast the CO2 emissions of Beijing [8]. The factors contributing to this was identified to be residential growth, economic factors and the CO2 emissions were found to be more than 0.5. The cancer data is classified using support vector machine and genetic algorithm [9] to find the better accuracy in classification. Radial basis and polynomial kernel [10] function are used in this proposed technique. The proposed technique is compared to the existing techniques based on the runtime also.
In model selection using support vector machine [11], genetic algorithm is being used. The fitness function is calculated and various kernel parameters [12] [13] are determined. The proposed model selection technique is applied on four datasets to observe if it satisfies the criteria. The proposed estimator outperforms giving best fitness criterion that yielded more models. Authors in paper [14] proposed a genetic algorithm for optimizing the parameters of support vector machine. This involves image classification based on object-oriented classification. The proposed system is compared with the grid algorithm and found to be superior in terms of time and accuracy factors.
For the purpose of identifying the damage on the bridge, support vector machine along with genetic algorithm which is customized to get best kernel parameters. The proposed GA-SVM [15] is compared with other back propagation techniques to arrive at the best technique. With the error rates of other technique, it is being concluded that the proposed technique has higher accuracy rate in finding the damage. The least square SVM [16] technique is being proposed which helps in making the complex problem to linear regression one. Then by applying genetic algorithm over this LS-SVM [17], optimal parameters [18] [19] are obtained. The proposed system is compared with other existing systems like artificial neural network and it is found that the LS SVM based system perform far better than that.
The classifier works in [23], presents how a classifier works if there are missing values in the data. Initially non-parametric technique is used for the data processing. But it narrows to simpler SVM if no missing data is present in the data. Further an analysis of Least square SVM [4] [24] is done to understand the classifier better.
The work in paper [1], is based on the objective to identify the missing rate in a selective manner. The proposed technique helps in achieving a good Mean Identification Rate (MIR) through less imputation method. By understanding the technique, the proposed method is evaluated for the parameter to check if the system is working properly. The paper [2]is based on the functional dependency related technique which is targeted with machine learning. The algorithm namely Knearest neighbour algorithm is used to find the functional dependency in the given data. The concept of using data dictionary also yields effective results. The parameter namely missing rate [4] is taken into account for evaluation.
Additive least square technique with application of support vector machine which helps in performing classification of the data which are missing is presented in [3]. Cross validation strategy with ten folds is performed to correctly classify the data. The strategy is verified by measuring the accuracy factor through mean and standard deviation values [6] for the given data. The research in [20] provides embedding based calculation of the missing data through non-linear technique that bind the vector label. The proposed system is evaluated for its performance and also by the time taken for training the dataset.
The method of finding the missing data and grouping them is done through sampling in [21]. This method helps in omitting the missing data by calculating the error. Based on the accuracy and error calculation, the proposed system is evaluated. SVM based model which does not require selection of planes is investigated in [16]. The system is evaluated Higher accuracy rate is provided by KNN and optimal endurance by MLP Accuracy and endurance using the parameters like root mean square and absolute error [6] which helps in effective determination of the proposed technique.

A. Panel Data
Panel data is a multidimensional-format data involving measurements over varied time. The multi-dimensional format represents the various attributes of a dataset constituting a complete dataset. Time series data also comes under the panel data. The dataset with the primary data element occurring n number of times in a particular time series makes it worth investigating. A balanced panel is one in which the panel data is continuously observed in every time interval as represented in Equation 1.
Using the panel data, the model can be constructed as shown in Equation 3 and Equation 4: Where, G IT is the component which varies with time and γ I is the specific thing to a particular member and fixed for a time interval.

B. Pre-Processing
Data in reality has a noisy and incomplete [25]. To address the preprocessing, various techniques of data cleaning integration, and reduction are incorporated to make it more consistent [26] . For the proposed technique CGA-SVM, the data cleaning process involves identification and addressing of missing values. Further data values are sorted and arranged into their respective buckets; a process called binning. Values that do not reflect any cluster are identified, differentiated through outlier detection techniques. Redundant values in data aggregation process are then removed after which the process of normalizing values is done. During data reduction, the attribute dimension which in this case is the size is reduced and data compressed making it easier to handle. The pre-processing Missing values can be keyed in manually but only small datasets with fewer tuples. Replacement of missing data with global constant fixed based on the relevant dataset can also apply in such a situation. Data becomes noisy by using measuring instruments that make faulty calibrations. Binning is the next technique as it helps in classifying data into several buckets. Smoothing is also a data cleaning strategy, which involves replacing bin values by either mean or with the close boundary value. The values, which do not fit any of the group, are termed as outliers and are being handled with the dataset. Therefore, the pre-processing takes place efficiently starting from cleaning and further proceeds until reduction with intermediate steps being executed.

C. Correlation
Correlation is a measure of association between two attributes and also the nature of the relationship [27]. The correlation coefficient value lies between -1 to +1. Correlation a mathematical value which describe a relationship between one or more independent variables with dependent variable. For example, a correlation can be a connection between two variables (numeric) values. If increasing happened to one variable value, then the other one also will increase (or decrease). However, potential predictive power in Correlations make it valuable: use or act on the value of one variable to predict or modify the value of the other. Furthermore, the correlation does not imply causation and a correlation does not tell us about the core cause of a relationship. The correlation method is a systematic praxis with roots going far back in human history. It is also used to analyses extremely large datasets correctly and efficiently that plays a critical role in the science of the future.
There are various correlations namely pearson, kendall and spearman which are used to measure the relationship between two attributes or variables. Pearson Correlation is a measure of the degree of relationship between linearly associated variables. The variables are required to be normally distributed. It is given as the ratio of covariance of the variables to the product of their standard deviation as shown in Equation 5.
Where, ρ is the Pearson coefficient and a,b are the two variables.
Kendall correlation is non-parametric where there is a dependency between two variables and is represented as tau. It analyses the relationship between two variables, and provide solution between discordant and cordant pairs. Spearman correlation is ranked, depending on the variable's rank value for the operation. It is denoted as ρs.
When the three correlation techniques are adopted in both panel datasets, the outcome shows how Pearson coefficient provides better correlation values as indicated. Once the correlations between the attributes are obtained, it is subjected to the SVM classification with genetic algorithm applied.

D. CGA-SVM Technique
Data generated after analysis of the correlation factor is examined with a genetic algorithm, which is further analysed by the SVM technique. This provides room for treatment of missing values hence minimising their effects.

CGA-SVM Algorithm:
• Input: Dataset with missing values and correlation details.
• Output: Classified data with addressed missing data.
1) Initialize each individual and then produce which is in accordance with X,i=1,2. . . .ln. 2) Arrive based on the signs of and 3) Calculate the fitness values using the fitness function for the individuals as follows: F(z)=j-f'( where f'( is the derivative of the objective function for optimization based on the correlation result and j is the constant for scaling the fitness function 4) If condition achieved then stop. Else move to step 5. 5) If the best fitness value is less than the threshold, Go to next iteration. 6) Do selection based on F(z). Then perform crossover between the chromosomes with same attribute values. 7) The Leave one out cross validation is applied as a condition for SVM with GA. 8) Perform new iteration of variables generation using the steps 3-7 and exit.
In the algorithm CGA-SVM, the genetic algorithm determines the fitness value of the used variables. The fitness function is defined by determining the fitness value of variables using the objective function. it is an iteration process until the best fitness value is achieved, compared to the threshold. Mutation and crossover are also performed with the matching chromosomes. Validation is done using Leave one out technique which ensures that the proposed fitness value meets the threshold and CGA-SVM outperforms the existing techniques. The flowchart in Fig. 2 represents the correlation based Genetic algorithm and SVM combination of classification which involves parameter setting in SVM and then finding the fitness value. Once the fitness function is determined, and the required condition is satisfied, SVM model is evaluated and missing values are addressed. If condition is not met, then further genetic algorithm is being applied with the activities of selection, crossover and mutation to arrive at desired objective function.

IV. IMPLEMENTATION
Four benchmark panel datasets [28] are used to support the findings of this study have been selected from the UCI machine learning repository as follow:

1) Ionospher 2) Iris Plant 3) Parkinson 4) SPECTF
First correlation is applied to identify the association among the attributes in a dataset. Correlation value GA will be applied then followed by classification by SVM for meaningful data and testing of results. Table II

A. Correlation Methods
Using Tidy verse package in R software, the dataset is being read and Tidy up the dataset by making every row and column clear for observation. A variable is one way to visualize the rearranged data, making the relationships between measure, class, and part a little clear. Correlation values between the various datasets parts and corresponding measures are based on what class the attributes of each data set happens to be. Reshape2 package helps to reshape benchmark dataset and establishment of variables for correlation of every measurement against the other. Pass the benchmark dataset, grouped by class, to (cor list) function, which calculates correlations by applying the Reshape2 package. This will generate N rows as shown the Equation 6.
RowsN umber = (mclasses * (N measurements) 2 ) (6) relation coefficients for every measurement pair. The last step of correlation is the visualization of correlations between measurements, grouped by class. Fig. 3 summarizes correlation plot box measurement according to Pearson correlation coefficients values which greater than or less than 0. We then omit 0 values indicating no relation among attribute. Fig. 4 indicates the strengths of correlation coefficients of correlation measure within four different features of used datasets.

B. Genetic Algorithm
The benefit of the proposed approach is its imputation approach based on Genetic Algorithm. Input to the algorithm is a dataset with missing values and correlation values not equating 0 and that needs imputation. Primarily, the dataset is randomly imputed. This step is relevant as it produces a temporary complete dataset for enhancements in later steps through crossover and mutation. The algorithm repeats itself continuously regressing each attribute with missing values on other attributes. Implementation of Genetic programming in the experiment adopted the Genetics package. For 100 iterations, calculate the accuracy independently running for each benchmark dataset, and the result summary is as shown in Table III   V. RESULTS AND DISCUSSION The proposed system is SVM with correlation and GA applied. It is compared with the simple SVM providing accuracy and error comparison. The prediction accuracy (%) (with errors less than 10%) using SVM with correlation and GA is Accuracy1 and without correlation and GA is Accuracy2.    Fig. 6 depicts the Mean Identification rate of applying the neural network approach and the proposed CGA-SVM approach. Regarding to Table IV and Fig. 6 show the best accuracy achieved in the experiments, after training with CGA-SVM. The proposed system is also compared with depicts the Mean Identification rate of applying the neural network approach by (91%) and the proposed CGA-SVM approach(93%) which mean the existing systems to handle missing values, where the results indicate a consistent accuracy hence making it better.

VI. CONCLUSIONS
Addressing missing data in the big dataset is very important. The proposed system handles the missing data through correlation technique followed by genetic algorithm imposed on support vector machine. This variant of SVM performs well as it effectively handles the missing data. The proposed