A Novel Band Selection Approach for Hyperspectral Image Classification using the Kolmogorov Variational Distance

In this paper, we introduce a novel band selection approach based on the Kolmogorov Variational Distance (KoVD) for Hyperspectral image classification. The main reason we are taking interest in KoVD is its unique relation to the classification error. Our previous works on band selection using the Mutual Information (MI), the Divergence Distance (DD), or the Bhattacharyya Distance (BD) inspire this study; thus, we are particularly interested in finding out how KoVD performs against these distances in terms of the numbers of band retained and the classification accuracy. All the distances in this study are modeled with the Gaussian Mixture Model (GMM) using the Bayes Information Criterion (BIC) / Robust Expectation-Maximization (REM). The experiments are carried on four benchmark Hyperspectral images: Kennedy Space Center, Salinas, Botswana, and Indian Pines (92AV3C). The results show that band selection based on the Kolmogorov Variational Distance performs better than BD and DD, meanwhile against MI the results were too close. Keywords—Band selection; Bayes Information Criterion (BIC); Bhattacharyya Distance; divergence distance; hyperspectral imaging; Kolmogorov Variation Distance; Gaussian Mixture Model (GMM); Robust Expectation Maximization (REM); remote sensing


I. INTRODUCTION
In hyperspectral imaging, sensors record data from hundreds of contiguous bands of the electromagnetic spectrum. However, the Hughes phenomenon [1] [2] [3] and computational complexity [4] are two problems that appear during the classification process. Due to the small sample size problem [5] and the large number of bands acquired from the sensors, the classifier won't be properly trained [3]. Therefore, dimensionality reduction is needed.
Two approaches for dimensionality reduction can be found in literature, band selection [6] [3] [7] [8] and band extraction [9] [10] [11] [12]. The aim for band extraction is to create a new reduced dataset from the existing one using a linear/non-linear transformation [6]. The Principal Component Analysis, Projection Pursuit, Independent Component Analysis, Orthogonal Subspace Projection, Segmented Principal Component Analysis, and others [13] [14] have been used to reduce the data volume. However, due to the linear/non-linear transformation, the original data are replaced by a new set of variables with no actual physical meaning [6] which can be a disadvantage in some applications. Band selection, on the other hand, tries to find an optimal subset from the original pool by only selecting relevant bands with valuable information for the classifier, through maximizing a class separability criterion [6].
Between band extraction and band selection, in this study, the later is the preferred one since, with band selection, the data remains unchanged and the physical meaning is preserved [15].
Band selection techniques can be broadly classified into two categories: wrapper and filter techniques. The wrapper approach [6] take advantage of the classifier itself and use it as the criterion for band selection [16], the result is a subset with a high classification score, however, the drawback of this technique is that the bias toward the used classifier. Unlike the wrapper approach, the filter [6] [16] deploy metrics and distances to evaluate the bands without involving the classifier. In theory, the best criterion to measure the pertinence of a band is the Bayes error. However, the calculation of the Bayes error is, in general, a very complex problem [17]. Therefore, some approach seeks an upper bound of the error probability such as the Chernoff and Bhattacharyya bounds.
A new band selection approach is introduced in this paper, based on the Kolmogorov Variational Distance for Hyperspectral image classification. This work is a sequel on our previous research on band selection with Mutual Information [18], Bhattacharyya Distance [8] and Divergence Distance [19]. The primary interest in KoVD is the fact that is uniquely related to the classification error [20] [6], which is often difficult to estimate [17]. KoVD has been used in other fields such as signal selection, communication and radar systems [20] [21] but not in the hyperspectral imaging context.
To model the Kolmogorov Variational Distance, the Gaussian Mixture Model is used with The Expectation-Maximization (EM) algorithm [9], however, with the EM algorithm we face two issues: The first one is the choice of the number of components K as it can affect the estimation of the covariance matrix [8] and the second issue is the sensitivity to the initial values choice [22]. With a bad choice of K, we can easily end up with the Curse of Dimensionality. As a solution two approaches are proposed; a GMM based on the Bayes Information Criterion (BIC) and a Robust Expectation-Maximization (REM) algorithm [22].
Our main contributions in this study is a novel band selection approach with the Kolmogorov Variational Distance modeled with GMM-REM and GMM-BIC. To assess the performances of KoVD two criteria are being used: the numbers of the retained band and the classification accuracy. The experiments are performed on four hyperspectral benchmark datasets: The scene Indian Pines (92AV3C), Botswana scene, Kennedy space center scene, and Salinas scene. This paper is structured as follows: Sections II and III www.ijacsa.thesai.org describe the fundamentals and the proposed band selection algorithm. Section IV discusses the experimental results. Finally the conclusion in Section V.

A. Kolmogorov Variational Distance
The Kolmogorov Variational Distance (KoVD) is the integral of the absolute difference between two posterior probabilities. It expresses the distance between the densities [6]. The main advantage for KoVD is its direct relation to the classification error [20] [6]. KoVD is expressed as follows [6]: KoVD provides an indication of the amount of probability mass by which the two distributions differ. If the classes ω 1 and ω 2 are similar, P (ω 1 |x) = P (ω 2 |x) then J KoV D will equal zero, and if the classes ω 1 and ω 2 are disjoint P (ω 1 |x) = 0 and P (ω 1 |x) = 0, J KoV D will attains its maximum value [6].
In the case of multi-class problem, between each pairwise class (ω i , ω j ), KoVD is computed as the average cost function.

B. Mutual Information
Given X and Y , two discrete random variables, the Mutual Information (MI) is the defined as [18]: I(X; Y ) expresses the information we gain by decreasing the uncertainty contained in the random variable X after knowing Y . With The entropy H(X) of a random variable X and H(X/Y ) the conditional entropy of X given Y [18] [23].

C. Divergence Distance
The divergence distance (DD) [19] is a probabilistic distance that measure of the similarity between two classes ω 1 and ω 2 often used in information theory. DD is the sum of the two Kullback-Leibler divergences. Given P (x|ω 1 ) and P (x|ω 2 ), DD is defined as [6]: DD distance is interpreted as the amount of information necessary to change the prior probability distribution into posterior probability distribution [24]. In the case of multiclass problem, between each pairwise class (ω i , ω j ), DD is computed as the average cost function according to equation (2).

D. Bhattacharyya Distance
The Bhattacharyya distance (BD) [8] is a similarity measurement of the scatter degree of two classes ω 1 and ω 2 . The bhattacharyya distance is expressed as [6]: In the case of multi-class problem, between each pairwise class (ω i , ω j ), BD is computed as the average cost function according to equation (2).

E. Gaussian Mixture Model
The Gaussian Mixture Model (GMM) models the density as the sum of one or more weighted Gaussian components [25] [8]. For a GMM, a probability density function is the sum of K Gaussian components: where K the number of mixture component, π k the mixing weight (0 ≤ π k ≤ 1 and With µ k the mean and Σ k the covariance matrix for the k th component. The parameters {πc, µ c , Σ c } are usually estimated by the EM algorithm [9].

III. BAND SELECTION BY KOLMOGOROV VARIATIONAL DISTANCE
Given a set of band F = {b i } d i=1 , the goal is to find an optimal subset S = {b i } d i=1 , S ⊂ F, d ≤ d that only keeps the relevant bands that contribute to the classification task while discarding any redundancy. An exhaustive search for the optimal subset S can be impractical from a computational viewpoint, and the Sequential forward selection (SFS) is one of the simplest search strategy [26] [18]. With an empty set of bands S at the beginning, we start to add sequentially the bands that maximizes the KoVD cost function until the desired number of band is achieved, or no longer maximize the costfunction. SFS algorithm have a relatively low computational burden [27].
The algorithm (Fig. 1) is the same as [18] [28] [29] except the computation of the Mutual Information as a cost function between multiple variables. Instead, KoVD is used as a criterion between multiple bands to select the salient ones for hyperspectral image classification.

A. Bayes Error
In theory, the best criterion to measure the pertinence of a band is the Bayes error. The lower the error the better. However, the calculation of the Bayes error is, in general, a very complex problem [17] and it is often difficult to evaluate its probability. In m-class case, the Bayes Error Probability e is given as [30]: where the posterior probability P (x) is the mixture density function and P (ω i ) is the prior probability A direct calculation of equation (8) in general is often impossible or impractical [17]. In two class case, the Error Probability can be expressed as: The Kolmogorov Variational Distance is the integral from the equation (10) . From equation (1) and (10), the Error Probability e can be expressed as: From equation (11) we can notice that KoVD can be expressed in terms of classification error. It has a direct relation to Bayes Error Probability, which is its main advantage unlike other probabilistic distances that only provides a bound on the error. However, KoVD requires an estimate of a probability density function and its numerical integration, which can restricts its usefulness in many practical situations [6].

B. KoVD Based on Gaussian Mixture Model
The Kolmogorov Variational distance based on Gaussian Mixture Model can be expressed as follows: With the equation (6) and the equation (9), the KoVD can be expressed as: To compute our cost-function J KoV D (ω 1 , ω 2 ) from equation (13) we need to estimate the following parameters: the number of clusters K, the covariance matrix Σ, the mean µ and the mixing coefficient π.
With GMM the challenge is the estimation of the parameters π, µ, Σ, K, the first three the parameters can be estimated with the Expectation-Maximization (EM) algorithm [16]. And K the number of components, the fourth parameter, is userdefined and has to be given a priori. Choosing the right value for the number of components K is crucial since it has a direct effect on the estimation of the covariance matrix. With a bad choice of K, ill-conditioned covariance matrices can be formed and the Curse of Dimensionality then can't be avoided [5] [2].
To overcame this challenge, we pursuit two approaches in order to define the optimal value for the parameter K. The first one is based on the Bayes Information Criterion (BIC) [31]. BIC is a popular measure for comparing maximum likelihood models and the model with the smallest value is the preferred one [32] [33]. BIC was introduced by [34], and defined as: With k the number of parameters estimated and N the number of observations.
The second approach is the Robust Expectation-Maximization (REM) algorithm [22]. The main advantage of this algorithm is its ability to find an optimal number of clusters K automatically, thus the number of the component will no longer have to be defined a priori. REM also solves the issue of the initial value of the standard EM algorithm, the problem of choosing cluster centers. At first, the Robust Expectation-Maximization algorithm uses all data points as centers, and from there try to automatically reach an optimal number of clusters by discarding the clusters that do not meet the required criteria (see Fig. 2); For more detail about the algorithm, see [22].

C. Regularization Problem
For the estimation of the covariance matrix in hyperspectral imaging, the "Hughes phenomenon" and the singularity problems [25] are usually caused by the small sample size datasets. And by partitioning the already small dataset we can easily end up with an ill-conditioned mixture model [35]. For each component the sample size must not be less than the dimensionality of the data [25], since the covariance matrix should be invertible in order to compute equation (7). For Gaussian Mixture Model, the curse of dimentionality is primarily related to the estimation of the covariance matrix [36], and regularization techniques are one way around this problem:

1) Leave One Out Covariance (LOOC) :
To avoid the singularity problem the LOOC estimator can be used to regulate the covariance matrix [37] [25] [3] [36]. Let S and diag(S) be respectively the covariance matrix its diagonal version: The LOOC estimator evaluate several values of α i , and the value that maximize the average log likelihood of the Gaussian density is the optimal choice [37]. In our case, since we are using an iterative approach to select bands, using this regularization techniques as described by equation (15), did add to the complexity of the algorithm and to the computation time.

2) Maximum Entropy Covariance Selection (MECS):
The MECS method deals directly with singular and unstable covariance matrices; rather than optimizing the group likelihood or the classification accuracy, MECS maximize the information under an incomplete and consequently uncertain context [38]. We are particularly interested in this method since according to [38], an optimization procedure isn't required, whenever covariance matrices are ill-posed or poorly estimated, and finally it has a much lower computational cost while performing as well as any other method.

B. Experimental Setup
The band selection approach using the Kolmogorov Variational Distance was tested using the following hardware setup: a 64-bit PC (i7-2.20GHz) with 6 GB RAM and Matlab (R2014a). The experiment was run on four benchmark hyperspectral images: the Indian Pine (92AV3C), Salinas, Kennedy Space Center, and Botswana datasets. For classification purposes, the dataset is split into two halves of training/testing. The selected bands are fed to classifiers in order de show their classification performances. The used classifier is SVM through the LIBSVM library with RBF as kernel function and the grid search technique to find the C and γ parameters [40].

C. Results and Discussions
To evaluate our proposed approach, tests were run on the benchmark dataset Indian Pine, as this scene has been often used in various studies such as [25] [39] [37] [2]. In the first  In previous studies on the Indian Pine dataset [39] [41] [42] the bands [104 − 108, 150 − 163, 220] were reported to be in the water absorption region with no useful information just noise as it can be seen in Fig. 7 and Fig. 8 as they got the lowest value. Hence, band selection with KoVD can successfully measure the pertinence of a band and discard those with no valuable information from the selection process.
For the second experiment, the goal is to measure the performance of the KoVD band selection algorithm with just the first two selected bands and to answer the question of whether KoVD modeled with GMM can separate classes successfully or not. For easier visual inspection the experiment was carried out on a portion of the benchmark dataset Indian Pine working  with 4 classes instead of 16 similar to [39] [41]. The data as seen in Fig. 9 is highly correlated, nonetheless, we were able to separate one class from the rest with just 2 bands out of 220 with an SVM classification score of 81.92%. On the other hand, the other classes are still correlated thus the need to add more bands to achieve the desired result. For this sub-scene, a classification score accuracy of 93.81% with SVM is achieved with only five bands and a classification score of 97.04% at dimension thirty-six. Thus, the KoVD criterion modeled with GMM can be used as a class separability measurement for band selection.
In the next step, we are going to compare the performances of KoVD against its peers -the mutual information, the divergence, and Bhattacharyya distances -in terms of classification score and the number of retained bands. Due to the complexity of the dataset, all the probabilistic distances were computed through the Gaussian mixture model. The probability estimation is computed with GMM-BIC and the GMM-REM approach, meanwhile, the SVM is used as a classifier.
In Fig. 10, 11, 12 and 13 we do notice that, for the Indian Pine dataset, the SVM classification Score for the selected bands with KoVD performs better than the ones selected with the Bhattacharyya and Divergence distances. Meanwhile, compared to the mutual information, in terms of classification Accuracy, KoVD is slightly better, in fact, the curves almost overlap each other. Depending on the number of the selected bands, on how well the GMM was estimated, on how well the classifier parameter was chosen and on the data set itself how correlated it is and how its post-treatment was to deal with the outliers, we do notice that KoVD performs the best at times and others times the MI. According to Fig. 10, 11, 12 and 13, the results are close to each other and the margin between the classification curves of the selected bands with both distances is not wide enough to concur on the superiority of one on the others. Therefore, it is hard to decide which one of the distances is the best. Thus, we can conclude that in our setup, the KoVD performs as well as the MI and both of them perform better than the Divergence and Bhattacharyya Distances.

V. CONCLUSION
In this paper, a novel band selection approach based on the Kolmogorov Variational Distance for Hyperspectral image classification was introduced. The first experiment performed on the Indian Pine dataset have proved the efficiency and reliability of the KoVD criterion as a similarity measure. KoVD can measure the pertinence of a band, thus given a hyperspectral image dataset we can cluster the optimal bands while discarding those with no relevant information. This study was inspired by our previous work on the MI, BD, and DD. Thus, we were particularly interested in finding out how KoVD performs against these distances in terms of the numbers of bands retained and the classification accuracy. The experimental study showed that KoVD performs better than BD and DD, meanwhile against MI the results were too close; therefore, in the current setup, it is hard to decide which one is the best. Thus we can conclude that the KoVD performs as well as the MI.