A Feature Selection Algorithm based on Mutual Information using Local Non-uniformity Correction Estimator

Feature subset selection is an effective approach used to select a compact subset of features from the original set. This approach is used to remove irrelevant and redundant features from datasets. In this paper, a novel algorithm is proposed to select the best subset of features based on mutual information and local non-uniformity correction estimator. The proposed algorithm consists of three phases: in the first phase, a ranking function is used to measure the dependency and relevance among features. In the second phase, candidates with higher dependency and minimum redundancy are selected to participate in the optimal subset. In the last phase, the produced subset is refined using forward and backward wrapper filter to ensure its effectiveness. A UCI machine repository datasets are used for validation and testing. The performance of the proposed algorithm has been found very significant in terms of classification accuracy and time complexity. Keywords—Feature subset selection; irrelevant features; mutual information; local non-uniformity correction


I. INTRODUCTION
In many applications of machine learning, the number of samples and dimensions of most datasets have grown rapidly [1].Since the computational power, processing time and classification accuracy depend on the size of data therefore, reducing the dataset represents a challenge for researches.The primary motivation of reducing dimensions of data and minimizing the set of features is to decrease the training time and to enhance the classification accuracy of the algorithms [2], [3], [4].Feature subset selection provides an approach for dimensions reduction and data minimization by replacing the original set of features with a compact subset that acts similar to the original one.This approach has been used in several applications in engineering, economy and medical sciences [5], [6], [7], [8].
Feature subset selection is categorized into two main approaches in terms of evaluation strategy [1]: First, the wrapper approach which depends on searching the whole search space to find the optimal subset [9].This approach finds every combination of subsets to determine the accuracy by the classifier predication function.Thus, the quality of this subset is calculated without any modification of the learning algorithm.Since the produced subset is optimized for a particular classification algorithm therefore, the main advantage of the wrapper approach is the high accuracy.On the other hand, searching every combination consumes the computational power.The wrapper approach may also suffer from over-fitting to the learning algorithm.This drawback may also occur, when any parameter changes in the learning model [10].
Second, the filter approach depends on ranking each feature according to a specific evaluation function using distance, information and statistical measures.Many techniques have been proposed to calculate the feature relevance including: Fishers Discriminate Ratio [11], the Single Variable Classifier [12], Mutual Information [13], the Relief Algorithm [14], Rough Set Theory [15] and Data Envelopment Analysis [16].The main advantage of the filter approaches is the computational efficiency and scalability in terms of the data dimensionality.Even though the filter approach is faster than the wrapper it suffers from lack of information between the features and the classifier.This approach may also select irrelevant or redundant features because of the limitation of the evaluation function [17].
Information theory [18] has been applied in many filter approaches to determine the relevance and redundancy of features.In feature subset selection process, mutual information is used to measure relevance and redundancy of features effectively.It has been applied by many researchers to characterize the information content of features [13], [19], [20].The primary contribution of this research is to generate a compact feature subset with high accuracy and to keep the time complexity as minimum as possible.This paper is organized as: Section II introduces the related work and the limitations of the previous work.In Section III, the preliminaries and essential knowledge of information system, mutual information, conditional entropy and feature significance are discussed.In Section IV, the proposed algorithm is illustrated in detail.In Section V, the experiment and final results are presented.Finally, the paper is concluded in Section VI.

II. RELATED WORK
Authors proposed many approaches for the enhancement of feature subset selection using several methods.Mutual information was first proposed by Battiti, et al. [13] to improve the selection process by providing a novel algorithm called Mutual Information Feature Selection (MIFS).The MIFS used mutual information among features and between each feature and the decision class to determine the best k features from the original set.The MIFS used the traditional greedy algorithm to select the optimal candidate set.The MIFS introduced the concept of relevance and redundancy using mutual information.Battiti proved that mutual information could be very useful for feature selection problems, and illustrated that his proposed MIFS is suitable for any classification issues.However, this method is not suitable for non-linear ones.Kwak and Choi [21] analyzed the work presented by Battiti and proposed an enhancement of the MIFS method.Kwak and Choi introduced MIFS-U that enhanced the estimation of information between input features and decision classes obtained from the MIFS.However, they neglected the behavior of the selected features together and focused on individual features.
Peng and Long [19] proposed a different method for solving feature selection problem based on min-redundancy and max-relevance mRMR.This method consists of two steps: in the first step, the best candidate elements are selected using the mRMR first order incremental criteria.In the second one, the wrapper filter is applied to search the obtained candidate set using backward and forward selections algorithms.However, this method searches the complete search space to find the compact subset of features which is a high computational cost.Therefore, it is necessary to reduce the search space with any reduction method or refine the candidate feature set.
The presented methods are all incremental methods that search for one feature at a time according to specific criteria.This strategy neglects the relationship among feature groups and could select one element to represent the group if it is better than the other candidates.

III. PRELIMINARIES
In this section, a brief introduction to information theory, basic principles and concepts are presented.An Information System IS is defined as quadruple such that IS = (U, A, V, f ) where U denotes a non-empty set contains the whole set of objects, A denotes the finite non-empty set of features, V represents the combination of all feature domains V = a∈A V a and V a is the domain of a specific feature a ∈ A, and f represents the mapping function f : U × A → V that produces a unique values of each feature with each object belongs to the universe.Let there is subset called P such that P ⊆ A, then for each P there is an associated indiscernible relation defined as is the equivalence class calculated by u with respect to subset P .For any given P ⊆ A, there is a binary relation called SIM (P ) that defined as follows is the maximal set of instances that possibly indistinguishable the universe U by the set P such that S p (u) = {v ∈ U | (u, v) ∈ SIM (P )}.A member S p (u) from U/SIM (P ) is called an information granule.[22].
The information entropy among random variables is defined as the required amount of information to describe this variable x [18], [23].The entropy of a discrete random variable X = (x 1 , x 2 , , x n ) is denoted as H(X) and defined as follows: Where x i represents the possible values of x and P (x i ) states for the probability of x i .In the case of discrete random variable then: The base of the used logarithm is two because the unit of measuring entropy are bits.For any two discrete random variables called X and Y with corresponding probability distribution P (x; y).The conditional entropy is defined as: The mutual information is defined as the amount of information that variable X contains about variable Y and is represented as follows: The mutual information indicates the level of shared information between two random variables.Mutual information could be used to decrease computation by representing a relation between the entropy and the conditional entropy as follows: The high value of mutual information means that the two random variables are closely related to each other.Otherwise, if the mutual information value equals to zero, then the two variables are very independent of each other.Replacing Y with F n and D defines both the feature to class and the feature to features terms respectively.Although the mutual information is a stable measure of obtaining the uncertainty, it is not a monotonic function.Therefore, Dai, et.al [24] presented a monotonic mutual information measure for incomplete decision tables as follows: Dai, et.al proved that this new formulation is a monotonic function that could be used for measuring uncertainty effectively.Mutual information is also used to determine the significance of a specific feature b i ∈ B such that B ⊆ C with respect to D as follows: The value of sig(b i , B, D) represents the change of mutual information if the feature b i is removed from the subset B. The higher value of the mutual information is, the more significant the feature is.If sig(b i , B, D) = 0 then the feature b i is dispensable.

IV. A MUTUAL INFORMATION BASED UNCERTAINTY MEASURE
In this section, the ranking function is introduced to obtain the uncertainty of knowledge.The main properties and features are presented to illustrate the validity of the ranking function.In order to obtain the uncertainty for a target decision, measuring the feature dependency and the redundancy among features.Let IS = (U, C ∪ D) is a given information system, the uncertainty of knowledge is formulated as follows: Where, I(C, D i ) and I(C, C j ) represents the mutual information between the decision and a specific feature and the mutual information between a certain feature and the other features respectively.The proposed function h(C) represents a relation between feature redundancy and decision dependency Property 1.Let IS = (U, F ∪ D) is an information system such that U represents the all space of objects, F is condition classes (features) set and D is the decision set.For ∀A,

This equation is reformulated as follows
Consequently from the monotonicity of f (x, y) = −xlog x x+y [24], we obtain that

Hence, H(D|B) ≤ H(D|A) according to definition (x). Substituting with equation(x) then h(A) ≤ h(B).
Property 2 (Maximum Value).Let IS = (U, C ∪D) is a given information system, the maximum value of h(f ) is one and occurs when P (f, f i ) = P (f )P (f i ), ∀0 < i < n − 1 where n is the number of features.
Property 3 (Minimum Value).Let IS = (U, C ∪D) is a given information system, the minimum value of h(f ) is zero and occurs when P (f, D i ) = P (f )P (D i ), ∀0 < i < m − 1 where m is the number of decision classes.

A. Feature selection algorithm
In feature selection process based on mutual information, the tolerance classes must be computed for the complete decision system.This process is an exponentially time consuming that affects the total time performance.In order to design an effective feature selection algorithm based on mutual information for a decision system, a fast algorithm for assembling granules from a given decision system is introduced initially.This algorithm is mainly based on decomposition and mutual information estimation.Computing the mutual information is a very complex task especially for large dimensions or samples.Determining mutual information using the traditional method is a time-consuming task with an exponentially time complexity O(n 2 ).Therefore, a non-parametric mutual information estimator based on Local Non-uniformity Correction (LNC) is used [25].The main idea of LNC based algorithm is to calculate an average correction term LN C for all each point x i ∈ X.The correction term LN C is used to adapt the value of Kraskov estimated mutual information [26].The correction term is computed based the volume of max-norm rectangle V (i) produced by the PCA analysis of the k th nearest neighbors of each point x i .
Algorithm 1 Mutual Information Estimation using LNC 1: Input: X = {x 1 , x 2 , ...x m } where X is the sample of point, d is the dimension size, k is the k-nearest neighbor and α is a threshold.2: Output: ÎLNC (X) 3: Calculate ÎKSG (X) using KSG estimator with k nearest neighbors.4: for all x i ∈ X do 5: Find the k th nearest neighbors of x i as {knn 1 i , knn 2 i , ..., knn k i }

6:
Apply PCA on the k th nearest neighbors 7: Calculate the volume corrected rectangle V (i), and volume of max-norm rectangle V (i)

B. Proposed Method
In this section, the feature subset selection is proposed in detail.
The proposed algorithm consists of three main blocks.In the first block from step(1) to step (12), the proposed measure  Calculate U/SIM (R ∪ {a i }).Calculate U/SIM (R − {a i }).

27:
Calculate sig(a i , R , D) end if 31: end for 32: return Red is computed for each feature according to Eq. (7).Since the computation of mutual information is a time expensive process, an effective estimator is used to calculate this formula based on Local Non-uniformity Correction (LNC) and KSG estimator.Afterwards, radix sort is used to sort the features according to the value of h(c i ) in a descending order.The radix sort is used to minimize the total computational cost as it is a linear time sorting.Then, the granules of information are obtained using Pawlak definitions with respect to the complete set of features [27].Then, the significant of each feature is computed to construct an initial subset of features.If the significance of a feature equals to zero then its considered to be an irrelevant feature.Otherwise, the relevant feature is added to the optimal subset of features called R. In the second block from step (13) to step (22), the obtained subset R is refined using forward wrapper filter.The granule of information is calculated to determine the significance of the non-participated features.Each feature is joined to the generated subset R to study it's significant.Once the feature is considered a significant one with respect to R, then it should be merged to R. The last block from step (23) to step(32) is a backward wrapper, which makes it similar to the second block but with reverse effect.
The complexity analysis of the proposed algorithm is determined as follow: the time complexity of step(1) is O(n|U |).For step (2), the radix sort is applied with a complexity of O(n).

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, datasets description, numeric results and comparative studies are presented.

A. Dataset Description
Five datasets are used to benchmark and evaluate the proposed approach.These public datasets are used for benchmarking and validation of selection algorithms.A brief discussion of the used datasets is listed in Table I.The iris dataset is a very popular benchmarking dataset that contains information about iris plant.The Iris dataset contains three classes of 50 instances per class represents an iris category (Setosa, Versicolour, and Virginica).Also, the feature set includes a sepal length, sepal width, petal length, and petal width.The second dataset contains experimental information about breast cancer.Dr. William H. Wolberg collected this dataset from the University of Wisconsin Hospitals, Madison.There are nine features in this dataset out of ten features excluding the sample identifier.Each instance could be classified in the binary decision (benign or malignant).The missing values are replaced with the average value of the corresponding feature in order to prevent exceptions.The Liver Disorder dataset that contains blood measures tests.This dataset contains six features and 345 instances.Each instance could be classified into the binary decision.Finally, the Glass dataset that includes nine features and 214 instances with seven decision classes per instance.The numeric experiment is implemented using Python Scikit-learn [28] package.All comparative studies with the other methodologies are implemented on WEKA [29] software.The experiment is executed on an Intel(R) Core (TM) i7-2400 CPU 3.10GHz platform and MS Windows 10 installed.

B. Numeric Results And Comparative Studies
In this section, the numeric results of the experiment are presented.The proposed method is implemented on five datasets as described in Table I and achieved a high accuracy over the other methods.The Naive Bayes classifier is used to determine the accuracy of the proposed algorithm.The comparison is based on some standard methods such as Information Gain, Gain Ratio, Chi-Square, Best First Approach and Symmetrical Uncertainty.Also, the mRMR, MIFS, MIFS-ND and MIFS-U are also used for the ionosphere dataset.For the breast cancer dataset, the proposed method achieved the highest accuracy among the other methods as shown in Fig. 1.The minimum accuracy achieved by symmetric uncertainty method, then the best first method.All of the chi-squares, gain ratio and information achieved the same accuracy.The minimum scored accuracy is 91.04% and the maximum accuracy is 92.09%.For the glass dataset, the proposed method archived the highest accuracy as shown in Fig. 2. The minimum accuracy achieved by chi-square, gain ratio, information gain and symmetric uncertainty.The best first feature selection scored 49.53% accuracy and the proposed method scored 54.67%.For the ionosphere dataset, the proposed method achieved the best accuracy over the other methods as shown in Fig. 3.The MIFS-U achieved the best accuracy for only three features.
The MIFS-ND scored the best accuracy over the all other methods.The proposed method scored the best accuracy from the other methods for the both the ten and fifteen features.The minim accuracy achieved for 15 features is 92% by MIFS, and the maximum accuracy is 94.3% by the proposed method.For the liver disorder dataset, the proposed method scored the best accuracy as shown in Fig. 4. The minimum accuracy achieved by most of the selections methods (chi-square, gain ratio, info gain and symmetric uncertainty).Then the best first scored accuracy of 58.55%.The best scored accuracy 58.84% achieved by the proposed method.
and the classification produced by the decision D is U/IN D(D) = {D 1 , D 2 , ...., D m }.Since A ⊆ B then the classification produced by the subset B is better than the classification produced by the subset A or simply T B (X) ⊆ T A (X)i.e : |T B (X)| ≤ |T A (X)|.
end for 14: Calculate LN C = m i=1 LN C i /m 15: ÎLNC (X) = ÎKSG (X) − LN C 16: return ÎLNC (X) The LNC estimator works typically for any dimensions d, for example to compute the mutual information I(X; Y ) using the LNC algorithm.Let the dimension parameter d = 2, the input array equals to X = [[x 1 , x 2 , . ..], [y 1 , y 2 , . ..]], the K th nearest neighbors k = 3 and threshold α = 0.25.The time complexity of Algorithm 1 is determined as follows: for step(3) it is O(N k + d) which can be approximated to O(N ) since both k and d are relatively less than N .For step (5) it is O(k), for step (6) it is O(k 3 ), and the complexity of steps (7-12) are O(1).Then the overall complexity becomes O(N ).(O(k) + O(k 3 ) + O(1)) that yields to O(N k) ≈ O(N ).

Algorithm 2 A 2 :
Hybrid Mutual Information based Feature Selection AlgorithmInput: IS = (U, C ∪ D).Output: Red The reduced subset.1:Calculate h(c i ), ∀c i ∈ C using Algorithm 1. Sort the features in descending order using radix sorting and then denote the result by S = {a 1 , a 2 , ...a n } 3: Calculate U/SIM (C).4: for all a i ∈ S do 5:Calculate U/SIM (C − {a i }).

18 :
Calculate sig(a i , R, D)19: if (sig(a i , R, D) = 0)then20: R = R ∪ {a i } 21: end if 22: end for 23: Let Red = R 24: Construct an input sequence subset Q = {a 1 , a 2 , ...a l } such that l = |R | and l ≤ n. 25: for all a i ∈ R do 26: For step(3), the time complexity is O(n|U |).From step(4) to step(12) the complexity is O(n) × (O(|U |) + O(1)) which is approximated to O(n|U |).From step(13) to step(22), a forward wrapper filter is applied to determine the effect of any irrelevant feature to granules obtained by the classification of subset R. The complexity of the forward wrapper is O(k|U |) where k ≤ n.Then, a backward wrapper is used to refine the generated subset with a complexity of O(l|U |) where l ≤ n.Hence, to total complexity of proposed algorithm is O(n|U |) + O(n) + O(n|U |) + O(k|U |) + O(l|U |) which approximately equals to O(n|U |).

Fig. 1 .
Fig. 1.A comparison between different methods of feature selection for breast cancer dataset.

Fig. 2 .
Fig. 2. A comparison between different methods of feature selection for the glass dataset.

Fig. 3 .
Fig. 3.A comparison between different methods of feature selection for the ionosphere dataset.

Fig. 4 .
Fig. 4. A comparison between different methods of feature selection for the liver disorder dataset.

TABLE I .
DESCRIPTION OF BENCHMARKING DATASET