Support Kernel Classification: A New Kernel-Based Approach

In this paper, we introduce a new classification approach that learns class dependent Gaussian kernels and the belongingness likelihood of the data points with respect to each class. The proposed Support Kernel Classification (SKC) is designed to characterize and discriminate between the data instances from the different classes. It relies on the maximization of the intra-class distances and the minimization of the intraclass distances to learn the optimal Gaussian parameters. In fact, a novel objective function is proposed to model each class using one Gaussian function. The experiments conducted using synthetic datasets demonstrated the effectiveness of the proposed algorithm. Moreover, the results obtained using real datasets proved that the proposed classifier outperforms the relevant state of the art approaches. Keywords—Supervised learning; classification; kernel based learning


I. INTRODUCTION
Classification finds applications in many real-world problems related to different fields. Such applications include, for example, analyzing customer data in the areas of commerce [1][2], detecting fraud to benefit industry and government [3], improving the learning process in education [4], predicting the climate for crop production [5], and assisting doctors in detecting anomalies in the healthcare [6]. In order to solve these classifications problems, many approaches have been reported in the literature [7]. However, most approaches assume that the different categories can be separated using linear boundaries, and thus, they are effective when the data has a simple geometric characteristic with wellseparated categories.
Kernel-based approaches [8] have been proposed as an alternative solution. They map the data into a new feature space in such a way that categorizing classes with complex boundaries can be reduced to a simple categorization problem in the new feature space. Nevertheless, the choice of an optimal kernel that allows separating linearly the different categories of the data is a challenging problem [9]. The most common used kernel is the Gaussian kernel due to its statistical and geometrical proprieties [10]. However, it is sensitive to the choice of the Gaussian parameters. An exhaustive search of these parameters requires training the classifier many times to consider different possible values. In addition to the problems related to the selection of the set of possible values, and to the time complexity, the exhaustive search may also lead to an over-fitting problem. In fact, the Gaussian parameters are selected based on the value of a criterion function that is computed on the training data [11][12][13]. Moreover, a global parameter over the entire data may be inappropriate when the different categories have large characteristics variations.
In this paper, we propose a novel classification algorithm named the Support Kernel Classifier (SKC). It categorizes the data by learning a Gaussian kernel for each category. SKC is designed to learn the Gaussian parameters from the intrinsic geometric characteristics of the data. More specifically, the kernel parameters are learned by minimizing the intra-class distances and maximizing the inter-class distances simultaneously. Moreover, the proposed classifier learns the probability of each data point to belong to each class. In fact, it does not use crisp assignment where an instance belongs or not to a class, but rather learns its likelihood to belong to it. This is intended to better describe the data. Besides, it allows avoiding the over-fitting problem.
The rest of the report is as follows: In Section II, we present the related works. Section III describes the proposed approach. The experimental results and analysis are outlined in Section IV. Finally, we conclude the report and highlight the future works in Section V.

II. RELATED WORKS
In this review, we focus on the main classification approaches based on the Gaussian kernel. More specifically, we focus on the approaches that learn the Gaussian parameters such as the support vector machine [15], the Gaussian mixture [16], and the radial basis function neural network [17].

A. Parameters Selection for the Support Vector Machine
The typical SVM [15] algorithm was extended by mapping the input data vectors into high-dimensional feature space [18]. This mapping can be obtained by using a kernel function [19]. Thus, the SVM discriminant function becomes: Although any kernel function can be used, the Gaussian kernel is widely used. It is defined as: The choice of the Gaussian parameter, , affects the SVM performance. As shown in Fig. 1, when is too small, the discriminant function surrounds each data point, which may lead to an over-fitting problem. Yet, if is too large, the discriminant function surrounds all points, which yields mapping all points into a single one [20]. *Corresponding Author www.ijacsa.thesai.org Since the value of the Gaussian parameter has a large impact on the SVM classification results, several attempts have been proposed in the literature to determine this parameter.
The authors in [21] proposed a Gaussian parameters selection approach for a the outlier detection problem. They used the dual function as a criterion to find the optimal kernel parameters. In particular, the introduced the criterion [21]: Similarly, the authors in [20] used a dual function as an extension of the criterion in (3). They proposed the following objective function: Inspired by the "Fisher linear discrimination" (FLD) [11], the authors in [22] proposed to find the Gaussian parameters using an objective function that minimizes the intra-class distances and maximizes the inter-class distances. The proposed criterion was defined as: Where and are the mean of the two classes while are the coincidences of two classes defined as: The authors in [23] proposed an approach to learn the Gaussian parameters based on maximizing the "kernel target alignment" (KTA) objective function [12]. KTA maximizes the intra-class similarities and minimizes the intra-class similarities using given the following expression: where and . Note that 〈 〉 is the difference between the intra-class similarities and the intra-class similarities as defined below: where is the Gaussian kernel function. The optimal is obtained by maximizing the following objective function: The optimization problem formulated in (9) is solved by computing the partial derivative and setting it to zero. Thus, For the SVM-based approaches, various criterion have been proposed to select the optimal Gaussian parameters [20][21] [22]. These approaches are similar to the exhaustive search. Yet, the approach in [31] learns the Gaussian parameter by maximizing the intra-class similarities and minimizing the intra-class similarities. However, this approach is suitable for the two-class problems only.

B. Parameters Selection for the Gaussian Mixture Models
Typical Gaussian Mixture Model (GMM) classifier [16] relies on Bayesian framework, Gaussian probabilistic modelling and the expectation maximization (EM) algorithm [13]. In particular, it assumes that the data can be modelled as a mixture of a finite number of Gaussian functions. GMMs compute the probability density functions (PDF), ) | ( i C X P [24], for the data instance given the class i C . Then, they classify the test instances using the Bayes' rule [25] using theses PDfs as: where is the class prior probability and serves as a normalization term. Note that the GMM assumes that the probability density functions, , are a weighted sum of multiple Gaussians as: In (12), is the number of Gaussians and is the weight associated with the k th Gaussian constrained to: The k th Gaussian is formulated as: where and are the mean and the covariance, respectively. The GMMs can be defined through three parameters. Namely, the set of mean, , the set of covariances, , and the weights, represent the model parameters. The EM [13] is the iterative optimization approach typically used to estimate these parameters.
In order to estimate the three parameters , , and , the Maximum Likelihood Estimation (MLE) algorithm [26] can be used.
Let be a set of data points and let | be the conditional probability of belonging to cluster c defined by where is the centroid of the cluster and its covariance matrix. One should note that the set of mean is initialized using the K-means clustering algorithm [14].
GMM [24] defines the total probability distribution of as: where C is the number of clusters and which represents the ratio of the number of data instances in the cluster c is computed as: (16) with is the number of instances assigned to the cluster c. GMM [27] optimizes the log-likelihood of the total probability distribution of below The probability that the instance belongs to cluster c, is defined as: In fact, is considered as the membership of the data instance to the cluster such that: The number of data points assigned to cluster can be expressed as The covariance matrix, is then defined as Similarly to the MLE [35] approach, the maximum a posterior (MAP) approach [28] computes the GMM parameters , , and . However, it estimates by maximizing the posterior probability function not the likelihood function. More specifically, MAP finds that maximizes: MAP assumes that is a random variable with a distribution. In fact, it relies on the equation: Note | represents the conditional probability that an instance belongs to a cluster defined by . Where is the centroid of the cluster and is its covariance matrix. For each mixture in the prior model, MAP [28] defines the posterior probability | as: The optimal is a random variable defined by: Since is not dependent of , one can write This yields If follows a normal distribution , where μ is random and is fixed. Then, This gives | , - As it can be seen, where (32) and Consequently, it follows a normal distribution with and as parameters.
One can claim that MLE [26] and MAP [38] are efficient approaches that provide interpretable results. However, MLE [36] based solutions are prone to over-fitting [33]. On the other hand, MAP [28] addresses the over-fitting problem through the assumption that the parameters of the Gaussian distribution that fits the data are known. www.ijacsa.thesai.org

C. Parameters Selection for Radial basis Function Network (RBFN)
The Radial Basis Function Network (RBFN) is a particular neural network where the Gaussian distribution is used as activation functions [29]. Besides, the network output is a combination of Gaussian functions of the inputs: where is the output corresponding to the input x, are the weights, and  is the Gaussian function characterized by the parameters . Figure 2 displays the architecture of a RBFN.
Training the RBFN involves learning the optimal weights . These weights are learned using gradient descent. Therefore, the iterative learning process requires deriving the training error, which is defined as: where is the target label and is the output label.
The authors in [31] used the Gradient descent to learn iteratively the Gaussian parameters. Let be the vector including the mean , the standard deviations σ, and the set of weighs respectively. The update equation of V is defined as follows: (36) where is the learning rate.
In [32], the researchers proposed the learning of the Gaussian parameters based on the intra-class and inter-class structures in the training data. Specifically, the mean of each class is computed. Then, the distance is defined as the distance between the mean and furthest sample belonging to the class based on the distance.
The second step consists in computing the distance between each mean and the closest mean to it. Given a confidence parameter , the width of class k, is: The overlap between class k and class I ( ) is: (40) where is a factor that controls the overlap between the classes. The Gaussian parameter, σ, with respect to class k, is defined in [32] as the largest value between and : As the choice of is not straightforward, the authors in [32] suggested the following approximation: The value of the Gaussian parameter, σ, is then updated using gradient descent by deriving the training error defined in (35).
These approaches may be prone to local minima. The approach in [20] tries to avoid the problem by suggesting a way to initialize the parameter based on the intra-and interclass similarities. However, the suggested approach requires the estimation of other parameters.

III. THE PROPOSED SUPPORT KERNEL CLASSIFICATION
Kernel classification approaches are intended to categorize the data by mapping it into a new feature space. This mapping reduces the complex classification task to a simpler problem in the new feature space. The Gaussian kernel function is commonly used due to its analytical characteristics. However, the performance of the Gaussian kernel based classifiers depends on the setting of the Gaussian parameters. In this work, we propose a new kernel-based classification approach where each class is modeled using a Gaussian function. The optimal Gaussian parameters are learned by optimizing a novel objective functions.
Let a Gaussian function be defined as: where is the scaling parameter, and represents the distance between the data points and . In this work, we use the squared Euclidian distance defined as: The optimal set of Gaussian parameters is obtained by minimizing the intra-class distances and maximizing the interclass distances. More specifically, the proposed approach formulates and minimizes the intra-class distances as follows: In (45) and (46), represents the number of observations in the training set , is a constant that determines the degree of overlapping between classes, represents a regularization term, and expresses the likelihood that the observation belongs to the class . Note that satisfies: and ∑ for (47) where C is the number of classes. Notice that the regularization term is integrated in (45) in order to avoid the trivial solution of large scaling parameter, , which would map all data instances into one single point. On the other hand, the regularization term in (47) is intended to avoid the trivial solution of a scaling parameter equal to zero. Besides, allows ensures the tradeoff between the minimization of and the maximization of . Note that is also learned through the optimization of and .
The distance, , between the data points and , in the new feature space is defined as: The Gaussian parameters are defined with respect to each class in order to better handle the distribution and the geometric characteristics of each class. The proposed approach learns the scaling parameter for each class and the likelihood for each observation to belong to class using the given the training set. Moreover, the objective functions and are based on the relational distances between pairs of data instances rather than the distance between the data instances and the classes. This relaxes the assumption that each class fits a spherical shape [34]. In fact, the objective functions and do not use the class means/centroids. In the proposed approach, the mean is used only in the testing phase. It is computed after learning of each class .

A. Optimization with Respect to
In order to optimize and with respect to , we use the relational dual described in [35]. It defines the relation between the relational distance and the distance between point and class , using the probabilities as follows: Using the relational dual, we rewrite (45) and (46) and obtain the following set of equations system: In order to optimize and with respect to , subject to the constraint in (47), we use the language multipliers to obtain: where is as defined in (49) and is the Lagrange multiplier variable. Setting the derivatives with respect to of the set of equations in (51) to zero yields: . (52) Thus, and, This results in

B. Optimization with Respect to i
In order to optimize and with respect to , we derive and with respect to and set the derivatives to zero. First, we set the derivative of to zero and to obtain: which yields: Substituting (58) in (46) gives The derivative of (59) with respect to can be written as: Which results in: and Based on an iterative optimization approach and the assumption that and do not change significantly from one iteration to another, we and can be updated alternatively using (61), and (56), respectively. Once and are optimized with respect to each class , we define the prototype for each category. Then, it can be used during the testing phase in order to predict the class value for any unlabeled data point. Along with the standard deviation , we propose to use the mean of each class as its prototype. Specifically, we define it as: where is as defined in (49). The proposed training algorithm is depicted below: The set of Gaussian parameters defines the model with respect to each class .
Using the learned models from the training set, we classify the unlabeled data point using where The latter equation (66) represents the distance between the test data point and the center as defined in (48). Note that the parameters were learned during the training phase. The proposed SKC testing algorithm is detailed below:

IV. EXPERIMENTS
In order to assess the performance of the proposed approach, we conducted several experiments using both synthetic and real datasets. The synthetic datasets are 2-D datasets generated to represent different geometric characteristics. They were used to illustrate visually how the proposed classifier categorizes them. Moreover, they were intended to analyze and interpret the learned Gaussian parameters. Besides, the proposed approach was evaluated using real benchmark datasets. Specifically, 10 data sets from the UCI repository [46] were used to analyze the performance of the proposed approach. Namely, these datasets are: Handwritten Digits [36], Mammographic Mass [37], E.coli [38], and Haberman's Survival [39], Frogs MFCCs [40], Blood Transfusion Service Center [41], HCC Survival [42], Adolescent Autistic Spectrum Disorder Screening [43], Libras Movement [44], and Seeds of wheat [45] data sets. Table I summarizes the considered real datasets.

A. Experiments using Synthetic Datasets
In order to show that SKC succeeds to learn the optimal Gaussian parameters for each class, and simultaneously classifies accurately the data instances. Therefore, we set the fuzzifier m to 2, and the maximum number of iterations to 100. Then, we run SKC on the synthetic 2-D datasets in Fig. 3. As it can be seen, the datasets include 3 classes with the same intrinsic characteristics. However, class 1 and class 2 exhibit low inter-class distances, while they show large inter-class distances with class 3. SKC classifies correctly data set 1 as shown in Fig. 3(b). It learns two similar Gaussian parameters for class 1 and class 2 ( = 0.0002 and = 0.0001), and a larger Gaussian parameter ( = 0.0050) for class 3. Indeed, is not too large so that points from class 2 get assigned to class 1. Similarly, is not too large so that points from class 1 are not labeled as class 2. On the other hand, is relatively larger because its intra-class distances are larger.  The Gaussian Mixture Models (GMM) based classification [16] is the most similar approach to SKC because it learns a Gaussian mixture for each class. Therefore, we compare the classification results obtained using SKC and to those achieved using GMM on various different datasets.
As shown in Fig. 4(a), the dataset includes has three classes where class 1 and 2 have similar size and density while class 3 which is larger and less dense. SKC succeeds to learn the optimal Gaussian parameters for each class ( = 3.35 10 -04 , = 8.07 10 -04 , and = 9.31 10 -01 ), and classifies accurately dataset 2 as shown in Figure 4-(b). In fact, the classification problem gets easier if the classes have similar volume and density. Such performance is attained through the Gaussian parameters learned by SKC where a larger allows shrinking class 3 so it is less sparse and has a comparable volume to class 1 and class 2. Since class 1 and class 2 have comparable intra/inter class distances, similar Gaussian parameters are learned by SKC. On the other hand, s reported in Fig. 4(c), GMM is not able to classify accurately dataset 2.
Another synthetic dataset is shown in Fig. 5(a). As one can notice, despite the good classification results obtained using SKC, some border points are misclassified as reported in Fig.  5(b). On the other hand, as shown in Fig. 5(c), GMM yields poor classification results because it learns larger a Gaussian parameter for class 1 compared to class 2 which results in similar density and volume for both classes. Fig. 6 reports the classification results obtained by SKC and GMM using a different synthetic dataset. Although the inter-class distances are too small for the border points, SKC classifies correctly this dataset as displayed in Fig. 6(b). In fact, SKC learns double the value of for class 1 to shrink it more than class 2. This yields a considerable separation between both classes. On the other hand, as reported in Figure  6-(b), GMM misclassifies the border points which degrades the overall classification performance.
Similarly, for the synthetic dataset in Fig. 7, both classifiers misclassify some border points from class 2 which the inter-class distances with class 1 is lower than the intraclass distance. In particular, SKC learns similar Gaussian parameters for class 1 and class 2, cannot discriminate accurately between the border points. Whereas, some of the points from class 2 that are misclassified by GMM have relatively large inter-class distance with class 1. Based on the comparison of the classification results obtained by SKC and GMM using the different synthetic datasets, one can claim that the learning of the optimal Gaussian parameters using the intra-class and the inter-class characteristics of each dataset, makes SKC outperform GMM. Besides, in case of large volume variations for the different classes, GMM misclassifies a considerable proportion of the dataset. In fact, GMM does not include the inter-class distance in the learning process of the Gaussian parameters of the large classes. This yields the misclassification of some points from the other near classes. Moreover, GMM fails to classify the border points when the two classes are too close. This can be attributed to the fact that it learns the Gaussian parameters based on the intra-class distances only.

B. Experiments using Real Dataset
In this section, we report the classification results of the benchmark datasets from the UCI repository [46]. The classification task was conducted using the proposed SKC, the Gaussian Mixture Model classifier (GMM) [16], the K-nearest Neighbour classifier (KNN) [47], the kernel Support Vector machine (SVM) with Gaussian kernel [18], and the Naïve Bayes approach [19]. Note that for KNN we set three different values of K (K=1, K=3, and K=5). For the Gaussian kernel SVM we varied the Gaussian parameter by setting 6 different values (10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 , and 1). For SKC, we set the fuzzifier m to 2, and the maximum number of iterations to 100. We adopted a 10-folds cross validation approach, along with the accuracy, the sensitivity, and the specificity as performance measures to report the classification performance. Thus , Tables II and III show the average scores over the 10 training iterations. Moreover, a t-test was conducted to evaluate the statistical significance of the obtained results. In Tables II and III, the best results are shown in red. On the other hand, the green color represents the results that are not significantly different according to the ttest. As it can be seen, SKC overtakes all classifiers on Handwritten Digits [36] and Mammographic Mass [37] datasets. Moreover, it yields the best performances on the remaining 7 data sets.
Similarly, SKC outperforms KNN [47] on Handwritten Digits [46] and Mammographic Mass [47] data sets. However, it yields the same performance attainment on the other datasets. Even though KNN does not use the Gaussian kernel, it uses the local characteristics of the data by labelling the unknown instances based on their neighbouring points in the training set. Moreover, it requires a prior setting of the number of neighbours (K).   [42] and Seeds of wheat [45], while it beats SVM [18] on the remaining datasets. In fact, although Kernel SVM [18] relies on the inter-class distances through the learning the optimal hyperplanes that guarantee the best interclass margin, it uses one global sigma for all the data in the original features space. In other words, it assumes that all classes follow the same distributions.
Also, one can see that SKC attains similar results as NB [19] on Haberman's survival [39], Blood Transfusion Service Center [41], HCC survival [42], Libras movement and Seeds of wheat [45]. On the other hand, it outperforms NB [19] on the remaining 5 datasets. This results can be attributed to the fact that NB [19] learns a sigma for each feature with respect to each class. Therefore, if the features are not independent it fails to discover the correct structure of the data. Moreover, NB doesn't take into consideration the inter-class distances.
SKC yields similar performance to GMM [16] on E.coli [38], Adolescent Autistic Spectrum Disorder Screening [43], and Seeds of wheat [45] data sets. It overtakes GMM [16] on the 7 other data sets. Even though, GMM learns a Gaussian parameter with respect to each feature and the corresponding covariance matrix, it does not take into consideration the interclass dissimilarities. Therefore, the Gaussians parameters are learned based on the intra-class distances only.

V. CONCLUSIONS
Despite the researchers' efforts to address the supervised learning challenges, most of the classification algorithms exhibit some limitations. The classification task is even more acute when the data classes show different distribution characteristics. Kernel-based classifiers were introduced to overcome this problem through the mapping of the data into a new feature space using a specific kernel function. This mapping is intended to obtain better separation between the data classes and simplify the classification task. Even though the Gaussian function proved to yield reasonable classification accuracy, its performance depends on the choice of its parameters' values. Moreover, if the data include highly variant classes in terms of size, density, and shape, the data mapping into a new feature space using one global Gaussian is not effective. Typically, the tuning of the Gaussian parameters is done though some search strategy that is intended to optimize a predefined criterion function. In this paper, we proposed a new classification algorithm that learns a Gaussian function for each data class. The proposed Support Kernel Classification (SKC) is designed to characterize and separate the data instances from the different classes. It relies on the maximization of the intra-class distances and the minimization of the intra-class distances to learn the optimal Gaussian parameters. In fact, a novel objective function is optimized to model each class using one Gaussian function. The experiments conducted using synthetic datasets demonstrated the effectiveness of the proposed algorithm. Moreover, the results obtained using real datasets proved that the proposed classifier outperforms the relevant state of the art approaches.