Variable Reduction-based Prediction through Modified Genetic Algorithm

—Due to the massive influence in the use of prediction models in different sectors of society, many researchers have employed hybrid algorithms to increase the accuracy level of the prediction model. The literature suggests that the use of Genetic Algorithms (GAs) can sufficiently improve the performance of other prediction models; thus, this study. This paper introduced a new avenue of prediction integrating GA with the novel Inversed Bi-segmented Average Crossover (IBAX) operator paired with rank-based selection function to the KNN algorithm. The 70% of data from 597 records of student-respondents in the evaluation of the faculty instructional performance from the four State Universities and Colleges (SUC) in Caraga Region, Philippines were used as training set while the 30% was used for testing. The simulation result showed that the use of the proposed prediction model with the integration of the modified GA outperformed the KNN prediction model where GA with average crossover and roulette wheel selection function was used. The KNN where k value is three (3) was identified to be the optimal model for prediction with the 95.53% prediction accuracy compared to KNN with 1, 5, and 7 k values.


I. INTRODUCTION
Data Mining (DM) is the process of extracting implicit information or knowledge from databases [1]- [3], that is drawn from the field of statistics [4] which uses mathematical and machine learning techniques and algorithms [5].Knowledge Discovery in Databases (KDD) which is coined to data mining [6], represents the generally observed process in knowledge discovery where knowledge is the result of the data-driven discovery while data mining being the observed step in the process for efficiently automated discovery, employs diverse approaches of DM analysis [7].
The field of data mining has become standard practice in various disciplines such as business, finance, and marketing allowing to inadvertently impact social sciences and humanities in general [8].The range of its application has also reached other sectors such as education [9], [10] and healthcare [11], [12].DM is promising for researches applied in engineering, biomedical sciences, medical systems, web, sports, and shared market because of the accessibility to various vast datasets [13], [14].
There are several widely accepted major functions in data mining found in the literature such as association, classification, clustering, estimation, and prediction [15].Prediction, as one of the optimal data mining approach, was defined by [16] as "a powerful tool in the process of planning that can provide the decision maker with a prediction about the future events according to using experiences and applying statistical, mathematical, or computational methods."It is commonly used in educational data mining (EDM) [17]- [19], crime mining [20]- [22], business and finance [23], [24], health [25], [26], and more.
Data preprocessing is one of the essential methods that are useful in data mining.It has led to the enhancement of the quality of data and improved the precision and accuracy level of a prediction model [22], [27].Data reduction, as an important data preprocessing technique in DM, is performed through the selection and removal of the unneeded attributes in the dataset [28].Reducing the training set or variables and retaining the most representative data is advisable.The goal is also to obtain nearly the same outcome or data-driven output [29]- [31].Minimizing the size of the dataset aids in maximized accuracy [32], [28].One of the widely used data reduction methods is the Genetic Algorithm [33] which was introduced by J.H. Holland.The average crossover, which is one of the crossover operators of GA, is modified in this study.
Due to the massive influence in the use of prediction methods in diverse fields such as weather and natural calamity, stock markets, telecommunication, transport organization, energy, economy and other sectors [16], researchers have employed models integrating algorithms for prediction as well as hybridizing algorithms and combining different techniques to elevate the accuracy level of a prediction model.To name some, the study of [34] employed a hybrid feature selection method integrating Weight by Relief and GA to select the best features in the dataset for myocardial infraction prediction using J48.An accuracy of 82.67% was depicted after applying the model to the imbalanced dataset.Also, a study of [35] used the K-Means segmentation technique and C4.5 algorithm to build a prediction model for customer loyalty in a multimedia service provider.The integration of K-Means and C4.5 algorithm have yielded an increase of 79.33% accuracy prediction from the identified 69.23% accuracy with the C4.5 algorithm alone.
Lastly, the prediction model of [36] used the K-Nearest Neighbor (KNN) algorithm to predict standard levels of OTOP's (One Tambon One Product) wood handicrafts product.Results showed that the model obtained the best prediction at the accuracy of 87.73%.The KNN algorithm is susceptible to noise and sensitive to irrelevant features [37].Even though the prediction rate using KNN is already www.ijacsa.thesai.orgacceptable, but with the advent of combining genetic algorithm for variable reduction to address the problem of KNN, an increase of accuracy through the hybridization is hoped to be established.
With the advent of combining two or more models, an increase of prediction accuracy is evident [38] such that of [34] that obtained 82.67% and [35] with 79.33% after integrating hybridization than employing prediction with one algorithm alone.
Therefore, the quest of this study is not only to modify the genetic algorithm and introduce a new avenue of crossover mating scheme but also to increase the accuracy of the prediction model of [36] who used KNN algorithm through the integration of the modified genetic algorithm for feature selection and variable minimization before prediction.The rest of the paper is arranged as follows: Section II discusses the literature review of genetic algorithm and other prediction models.Section III includes the design and methodology used in the study.Section IV discusses the results and discussions while Section V highlights the conclusion and recommendation.

A. Genetic Algorithm
The genetic algorithm is one of the many evolutionary algorithms anchored on the biological adaptation in the quest for global optimization.GA is deliberately one of the famous technique used in the search for the optimal solution for problems with a large search space.GA produces and controls some individuals by assigning optimal operators on its three fundamental operations namely the selection, crossover, and mutation functions.In this study, a modified genetic algorithm with the integration of novel Inversed Bi-segmented Average Crossover (IBAX) is used.This novel crossover is a modified version of the traditional average crossover of GA.

B. Genetic Algorithm-based Prediction Models
The literature suggests that the genetic algorithm can efficiently increase the performance of other prediction models [39], [40].The most significant benefit of the genetic algorithm is its ability to avoid being confined in local optima, and the use of GA or a hybrid GA gives the chance to select the best appropriate objective functions freely [41].
A recent study improved the accuracy of the selforganizing map (SOM), a type of unsupervised ANN, in predicting robotic manipulation failures for force-sensitive tasks using a genetic algorithm.The proposed hybrid GA-SOM model exhibited an increased accuracy prediction and improved the predictive capability of the SOM algorithm when used alone [42].Moreover, the use of evolutionary technique like the genetic algorithm in enhancing ANN was observed along with SVM-Linear (L), SVM-Polynomial (P), SVM-Radial Basis Function (RBF), and CART in predicting the shape of carbon black reinforced rubbers.With the advent of the genetic algorithm, the prediction accuracy of each model has increased, and the most accurate model was obtained using GA-ANN hybrid model with those obtained using the GA-CART, GA-SVML, GA-SVM-P, and GA-SVM-RBF [43].Another study used the hybrid GA-BP neural network in predicting long-term skid resistance of epoxy asphalt mixture.The GA-BP model produced a great accuracy result when tested using the training set, validation set, and test set [44].Meanwhile, the application of genetic algorithm, Levenberg-Marquardt (LM) algorithm, and backpropagation neural network were observed in fault prediction of drying furnace equipment.The hybrid GA-LM-BP model showed an increased prediction accuracy compared to both BP neural network and GA-BP neural network models [45].
Further, the hybrid genetic algorithm-based least squaressupport vector machine (GA-LS-SVM), genetic algorithmbased back propagation neural network (GA-BPNN), and genetic algorithm-based random forest (GA-RF) were employed in identifying the topographical origin of extravirgin olive oils.The simulation results showed that GA-LS-SVM obtained the highest prediction accuracy for features selection methods compared to GA-BPNN and GA-RF models [46].To further prove the superiority of GA as variable minimization algorithm, the genetic algorithm was used to perform feature selection where the extracted features are taken as an input to random forest (RF) classifier in accomplishing cardiovascular diagnostic problem.The outcome shows that the GA-RF model obtained the highest prediction accuracy rate when compared to other feature selection algorithms [47].
Lastly, an artificial neural network (ANN) and genetic algorithm-based ANN (GA-ANN) were proposed and evaluated to predict air overpressure from blasting operation in a granite quarry site in Penang, Malaysia.Simulation results proved the superiority of GA-ANN model in predicting air overpressure than using ANN algorithm alone [39].The indexed GA-based prediction models are shown in Table I.

A. Modified Genetic Algorithm for Variable Reduction
To achieve the purpose of the study, the average crossover which is one of the crossover operators in the genetic algorithm as shown in Fig. 1, is modified.The modified crossover will be called Inversed Bi-segmented Average Crossover (IBAX) as depicted in Fig. 2. The use of rank-based selection function was observed in the simulation process.For the IBAX operator to be realized, the following steps must be executed: Step 1: Take the parents from the selection pool.
Step 2: Count the number of genes found in the chromosomes.
Identify if the dataset is in odd or even numbers.
Step 3: Segment the chromosomes (x and y) by dividing the total number of genes in the chromosomes into two and make sure that both first and second segments must contain an equal number of genes in an even count.
Step 4: On the first segment, create offspring Z for each gene by inversely pairing the first gene from chromosome X to the last gene on chromosome Y.Repeat until the last gene of the chromosome X and the first gene of the chromosome Y have inversely mated and have produced an offspring using the formula: Step 5: Execute the same process on the second segment until genes from all segments have produced offspring.In case of odd datasets, the last genes of the chromosomes will not be combined in the second segment and will automatically be mated with each other to produce offspring.

B. K-Nearest Neighbor Algorithm
Another recognized data mining algorithm for classification and prediction introduced by Fix and Hodges is the k-Nearest Neighbor (k-NN).This method adopts instance-based learning for prediction.The famous classifier is known as a nonparametric algorithm since it does not produce assumptions on the input data distribution; therefore, it is widely used in various applications [48], [49].K-Nearest Neighbor (KNN) algorithm is simple and can be implemented through the following steps: Step 1: Assign k values of the nearest neighbor of an instance in the algorithm.
Step 2: Perform the Euclidian distance calculation of each instance.
Step 3: Choose K neighboring attributes that have the lowest Euclidian distance.www.ijacsa.thesai.orgA prediction model using K-Nearest Neighbor (KNN) algorithm was utilized in the study of [36] along with the many studies found in the literature.

C. Enhanced KNN Prediction Model
The study evaluated the accuracy level of [36] prediction model when integrated with GA having AX operator and with the modified GA with IBAX operator having 1, 3, 5, and 7 k values.The Waikato Environment for Knowledge Analysis (WEKA) version 3.8.2 was instrumental in the simulation of KNN prediction model.The simulation results of both existing and enhanced prediction models were compared to check the improvement rate of the accuracy level of the prediction model.The conceptual framework of the study is presented in Fig. 3.

D. Datasets
The datasets used in this study were the 597 records of student-respondents in the evaluation of the faculty instructional performance from the four State Universities and Colleges (SUC) in Caraga Region, Philippines.The thirty (30) variables that represent the faculty instructional performance (IP) having divided into six (6) parts viz., methodology, classroom management, student discipline, assessment of learning, student-teacher relationship, and peer relationship are reduced before the prediction to aid maximized accuracy.The 70% of the data were used as the training set while the remaining 30% were used for testing.

E. Prediction Evaluation
An optimal model is selected once the model with the highest prediction rate is identified granted that the model has the lowest root mean squared error and mean absolute error values.Countless forecasting and prediction models found in the literature are evaluated using the various forecast error statistical tools.The following tools listed below will be used along with Precision, Recall, and F-Measure: Mean Absolute Error (MAE)

A. Variable Minimization using GA with AX and IBAX Operators
The simulation on the genetic algorithm was done for ten generations utilizing the existing traditional average crossover and roulette wheel selection function.To generate new offspring from the two chromosomes (IP and Y), the average crossover was used where the average of the two chromosomes/parents was calculated.The new fitness values are then calculated based on the new offspring produced after the crossover function.Variables having the lowest fitness value were removed from the dataset.The sample simulation on the genetic algorithm having the original AX operator and roulette wheel selection function is presented in Table II.
First Generation: Variable C2 is removed from the chromosome since it obtained the lowest fitness value of 171396 as evident in Table II.
On the other hand, the simulation on the genetic algorithm with the novel Inversed Bi-segmented Average Crossover (IBAX) operator and rank-based selection function was done on the same dataset and number of generations.
First Generation: Variable C2 was removed from the list of variables after applying the rank-based selection.The variable C2 obtained the lowest fitness value in the rank-based selection.Hence, it does not have any chance to be selected.Moreover, after applying the inversed bi-segmented average crossover (IBAX) operator and obtained the fitness value of the offspring, variable C3 was removed from the chromosomes since it obtained the lowest fitness value of 224676 that will not warrant for the next generation.Thus, in the first generation, there were two variables removed from the list as shown in Table III.
Prior to prediction, the variables were minimized using GA with AX operator having roulette wheel selection function and GA with IBAX operator having rank-based selection function performed for ten generations.www.ijacsa.thesai.orgThe variable minimization process using the genetic algorithm with AX operator and RWS function has depicted a decrease after the ten generations.From the 30 variables, it was minimized to 17 with a total reduction of 43%.Meanwhile, the variable minimization process using the genetic algorithm with the proposed novel mating scheme called inversed bisegmented average crossover operator, and rank-based selection function has depicted a noticeable decrease after the ten generations.From the 30 variables, the numbers were minimized to 10 variables after the generations.A total of 66.66% of variables were removed as depicted in Table IV.
The simulation result showed that the modified genetic algorithm with a new crossover mating scheme outperformed the average crossover of genetic algorithm in reducing variables prior to prediction.Since dropping one or more variables helps reduce dimensionality, predictions using the dataset having 17 and 10 variables were conducted using the KNN algorithm.
Meanwhile, in the extent of fitness function, the proposed IBAX operator of the genetic algorithm has increased and outperformed the rate of the fitness functions generated using the genetic algorithm with the existing AX operator.The variables who obtained the lowest fitness function in each generation for ten generations were removed.It is evident in Fig. 4 and Table V that the fitness functions that were removed using the new crossover operator is higher compared to the fitness functions that were removed using the existing average crossover.This denotes that the modified genetic algorithm has managed to increase the fitness function of the variables compared to the genetic algorithm with traditional AX operator.

B. Prediction Model Accuracy Evaluation
To evaluate the accuracy level of KNN as a prediction model, thirty percent (30%) of the data were used for testing while seventy percent (70%) were used as the training set.Table VI shows the comparison of results when GA with AX operator and roulette wheel selection function is used, and GA with IBAX operator with rank-based selection function are integrated prior to the prediction using KNN.The predictive capability of the KNN algorithm was also tested without the variable reduction stage and obtained a 90.50% prediction accuracy rate with a k value of 1.
The results showed that the prediction model gained an increase in the accuracy when integrated with genetic algorithm especially with the modified GA.The optimal model for predicting the instructional performance of the faculty in the four SUCs in the Caraga Region, Philippines is the KNN with a k value of 3. The model obtained a high 95.53% prediction accuracy.Meanwhile, the second best model that has 94.97% accuracy is the KNN with k=5 where its MAE and RMSE values are 0.08 and 0.20, respectively.

V. CONCLUSION AND RECOMMENDATION
With the integration of the genetic algorithm, the prediction model using the KNN algorithm has increased its prediction accuracy.The modified genetic algorithm with a new crossover mating scheme called Inversed Bi-segmented Average Crossover (IBAX) showed a considerably high prediction percentage than the genetic algorithm with average crossover having the roulette wheel as the selection function.Along with the GA-based prediction models found in the literature, the enhancement on the KNN as prediction model integrated with the modified genetic algorithm was a success and is added to the body of knowledge.Future researchers may consider using the modified GA-based KNN on different datasets as a prediction model.

TABLE I .
To enhance the performance of SOM using GA in predicting robotic failuresThe hybrid GA-SOM yielded 91.95% prediction accuracy compared with the 84.96% prediction of the standalone SOM algorithm Hybrid GA-ANN, GA-CART, GA-SVML, GA-SVMP, GA-SVM-RBF prediction modelMartinez etal., (2018) To use GA, ANN, SVM, and CART to characterize rubber blends.The GA-ANN model exhibited the finest classification accuracy of 75.75% improving the 74.80% accuracy attained without GA.Genetic Algorithm-based Back Propagation (GA-BP) neural network prediction model Zheng, Qian, Liu, & Liu, (2018) Hybrid GA-BP neural network was used to model skid resistance of epoxy asphalt mixture The optimized GA-BP neural network hybrid model was able to give an effective and accurate forecast of long term skid resistance with 99% accuracy.Genetic Algorithm-based Artificial Neural Network (GA-ANN) prediction model Armaghani et al., (2016) To enhance the prediction rate of ANN in predicting AoP from blasting operation in granite quarry site.GA-ANN model obtained 0.965 coefficient of determination, variance account for (VAF) value of 96.380 and RMSE of 0.049 than the ANN with those statistical function values of 0.857, 84.257, and 0.117 respectively.www.ijacsa.thesai.org

TABLE II .
GENERATION1 USING AN AX OPERATOR WITH RWS FUNCTION

TABLE IV .
VARIABLE MINIMIZATION SIMULATION RESULT FOR GENETIC ALGORITHMS WITH AX AND IBAX OPERATORS

Percentage of Variables Removed 43.33% Total Percentage of Variables Removed 66.66% Fig
. 4. Comparison of the Fitness Function of the Removed Variables in Every Generation.

TABLE V .
VALUE OF THE FITNESS FUNCTIONS REMOVED IN EVERY GENERATION

TABLE VI .
INDEXED KNN AND GA-BASED KNN PREDICTION MODELS