Hybrid Feature Selection and Ensemble Learning Methods for Gene Selection and Cancer Classification

A promising research field in bioinformatics and data mining is the classification of cancer based on gene expression results. Efficient sample classification is not supported by all genes. Thus, to identify the appropriate genes that help efficiently distinguish samples, a robust feature selection method is needed. Redundancy in the data on gene expression contributes to low classification performance. This paper presents the combination for gene selection and classification methods using ranking and wrapper methods. In ranking methods, information gain was used to reduce the size of dimensionality to 1% and 5%. Then, in wrapper methods K-nearest neighbors and Naïve Bayes were used with Best First, Greedy Stepwise, and Rank Search. Several combinations were investigated because it is known that no single model can give the best results using different datasets for all circumstances. Therefore, combining multiple feature selection methods and applying different classification models could provide a better decision on the final predicted cancer types. Compared with the existing classifiers, the proposed assembly gene selection methods obtained comparable performance. Keywords—Microarray; gene selection; ensemble classification; cancer classification; gene expression


I. INTRODUCTION
Gene expression is called the process of transcription of the Deoxyribo Nucleic Acid (DNA) sequence into Ribo Nucleic Acid (RNA). The expression frequency of a gene shows the average number of copies of the cell-produced RNA in that gene and is associated with the corresponding volume of protein [1].
Microarray is the technique for simultaneous measurements of the expression level in a single chip of tens of thousands of genes. Microarrays therefore provide an effective way to collect data that can be used to establish the pattern of expression of thousands of genes. In most classification issues, high gene expression data is a major challenge. Therefore, not all genes also lead to cancer. A broad variety of genes have no clinical importance or insignificance. However, incorrect diagnosis can also be accomplished by using both genes in the Microarray classification of gene expression. The two key explanations for low classification precision are two: large number of features (genes) against limited sample size and dimensional consistency in articulated data [2]. Subsequently, the decrease in dimensions is necessary. Standard machine learning methods have not been effective, since these methods are better suited when there are more samples than features.
In order to solve these problems, selection algorithms for dimension reduction or features (gene) were used. The gene selection methods are usually divided into three groups, namely filter, wrapper and embedded methods. The filter procedure requires the individual evaluation of each feature using its statistical characteristics in general. The wrapper approach uses training strategies to choose the best subset of features. By the precision of the particular classifier the efficiency of the wrapper technique is calculated. In the wrapper method evolutionary or bio-inspired algorithms are also used to direct the search process. The embedded approach aims for the best feature subset and is implemented in the classification scheme. The general structure for feature selection was recently complemented with hybrid and ensemble approaches. The filter and the wrapper approaches are designed to take advantage of hybrid. Extensive works have investigated this issue and proposed several methods such as [3][4][5][6][7][8][9][10][11][12][13][14][15][16].
Several feature selection methods have been applied. For instance, the authors in [17][18][19] proposed hybrid methods to combine filter and wrapper algorithms to overcome the disadvantage of each individual one. Conventional optimization algorithms are not efficiently working in the feature selection of large scale problems [20]. Alternatively, different meta-heuristic algorithms have been adapted for feature selection issues. Examples of these algorithms are Genetic Algorithm (GA) [21], Ant Colony Optimization [22], Simulated Annealing [23], and Particle Swarm Optimization (PSO) [24,25]. In addition, a modified support vector machine (SVM) was also suggested to select the minimum possible genes [26]. Multi-objective version of bat algorithm for binary feature selection [27] and Genetic Bee Colony (GBC) algorithm [28] were successfully utilized in high dimensional datasets. Moreover, a hybrid feature selection algorithm was proposed that combines the mutual information maximization (MIM) and the adaptive genetic algorithm (AGA) [19]. The reduced gene expression dataset presented higher classification accuracy compared with conventional feature selection algorithms.
In addition, a binary version of Black Hole Algorithm called BBHA was proposed for solving feature selection problem in biological data. However, the tested classifiers were under tree family, and other kinds of classifiers were not assessed [29]. Along this line, the assessment of different classifiers such as artificial neural network (ANN) [30] and www.ijacsa.thesai.org fuzzy decision tree algorithm [31] has been made upon microarray data. In addition, the two evolutionary algorithms of PSO and GA are usually used in wrapper form [17,20]. PSO is known to be a memory enabled algorithm compared with other algorithms, it requires few parameters to be adjusted, so it is simple and efficient [18,32]. Kar et al. [33] proposed a PSO-adaptive K-nearest neighbors (KNN) based gene selection method and they used a heuristic for selecting the optimal values of K, while the classification accuracies have been tested using SVM algorithm. Furthermore, Jain et al. reported a two phase hybrid model for cancer classification, integrating Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO) using Naive-Bayes as the only classifier [34].
Moreover, Almutiri and Saeed [35], proposed a new combination for gene selection that utilized Chi Square and SVM Recursive Feature Elimination. This proposed method was called ChiSVMRFE and considered as ranking method. The top 10% of the genes were selected based on the high obtained weights and then SVM-RFE was used to remove the genes with lower weights. Only 10 features were selected and fed to several machine learning methods such as random forest, decision tree, K-nearest neighbors Naïve Bayes, and neural networks to enhance the cancer classification process.
The objectives of this paper are to propose a hybrid feature selection methods using the combination of filter and wrapper methods and apply them with different machine learning and ensemble learning methods to improve the performance of cancer classification.
The rest of the paper is structured as follows: Materials and Methods are provided in Section II. The experimental design is presented in Section III. Section IV shows the results and discussion. The conclusion and future work are presented in Section V.

A. Datasets
The proposed methods have been applied on four high dimensional microarray datasets for gene expression of different types of cancers. In addition to Breast Cancer and Brain Cancer dataset, Lung Cancer, Leukemia Cancer, Central Nervous System Cancer (CNS) datasets as shown in Table I. In the previous studies, other datasets have been used such as SRBCT, Prostate, Ovarian, MLL, Lymphoma, Leukemia and Colon, but the dimensionality of the genes for these methods is not too high and the applied feature selection and machine learning methods on these datasets obtained satisfactory performance.

B. Hybrid Feature Selection Methods
In this study, several combinations between Filter-based and Wrapper-based feature selection methods have been done to suggest the better hybrid method. In Filter-based method, the information gain was used to reduce the dimensionality 1% and 5%. After that several wrapper-based methods were applied to investigate on the performance of gene selections, which are Best First, Greedy Stepwise, and Rank Search. Two classification methods were used in each wrapper method, which are: K-nearest neighbors and Naïve Bays. Fig. 1 shows the overall methods used in this study.

C. Machine Learning Methods
Several machine learning methods were applied for each combination in the feature selection step. These methods include individual and ensemble classification methods such as K-nearest neighbors, Naïve Bays, Support Vector Machine, Random Forests and Stacking Ensemble methods. The performance of these methods was evaluated before and after using the different combinations of feature selection and the best preforming methods were reported, as shown in Fig. 1.

III. EXPERIMENTAL DESIGN
The experiments have been conducted on WEKA tool version 3.8. Each outcome of feature selection method has been fed to all machine learning methods (KNN, NB, SVM, RM and Stacking) in order to evaluate the performance of the gene selection and the cancer classification methods. 10-folds cross validation has been used for training and testing each dataset for all obtained combinations. The performance was evaluated using Accuracy and Recall measures, which are defined in the following equations (1) and (2).
is true positive; is true negative; is false positive, and is false negative.
In addition, the performance of each method was compared before and after using features selection methods in order to discuss the enhancements obtained.

IV. RESULTS AND DISCUSSION
The performance of the different combinations of feature selection and machine learning methods is shown in the tables below. The best performing method for each combination is bolded and the best performing method among all combinations for each dataset is shaded.
For Breast Cancer dataset, the performance of the used methods (using top 1% and 5% in the ranking method: information gain) are presented in Tables II and III. As shown in Table II, the random forest method obtained the best accuracy and recall values with high dimensionality case (all features: 24481 and top 1% features: 244). However, after applying different combinations using ranking and wrapper methods, we found that Information Gain & Wrapper (NB & Best First) and Information Gain & Wrapper (NB & Gready Stepwise) obtained the best performance compared to all other methods/combinations before and after applying feature selection. Similarly, when the top 5% genes were selected in the ranking method, the performance of the used methods in Table III showed that random forest obtained the best results when high dimensional dataset was used, but when wrapper methods were applied, the combination of Information Gain and Wrapper (NB & Best First) obtained the best results. For Brain Cancer dataset, the results of used methods using the top 1% and 5% features are shown in Tables IV and V.   Table VII).  By comparing the performances of all combined feature selection methods with different individual and ensemble machine learning methods, it is clearly shown that using these combinations with high dimensional datasets improved the cancer classification using all datasets used. The results in Tables II to IX showed that the best performing methods

V. CONCLUSION AND FUTURE WORK
The investigation of high dimensionality issue in microarray datasets has been conducted in this paper. Several combinations of ranking methods (using information gain with threshold of 1% and 5%) and wrapper methods (using KNN and NB with Best First, Greedy Stepwise, and Rank Search) were used to select the most important genes for microarray datasets. These datasets included Breast Cancer, Brain Cancer, Lung Cancer and CNS datasets. The experimental results showed the consistent good performance of applying all feature selection methods comparing with the case when all features were used (no feature selection methods). Among these used methods, the KNN with Information Gain & Wrapper (KNN & Best First) and NB with Info Gain & Wrapper (NB & Best First) obtained the best performance and overcame all other methods. Therefore, this study recommends to use one of these methods on high dimensionally microarray methods with the aim of obtaining better cancer classification accuracy. Future works will investigate other hybrid and intelligent feature selection methods for cancer classification using microarray datasets.