Empirical Evaluation of SVM for Facial Expression Recognition

Abstract—Support Vector Machines (SVMs) have shown better generalization and classification capabilities in different applications of computer vision; SVM classifies underlying data by a hyperplane that can separate the two classes by maintaining the maximum margin between the support vectors of the respective classes. An empirical analysis of SVMs on the facial expression recognition task is reported with high intra and low inter class variations by conducting an extensive set of experiments on a large-scale Fer 2013 dataset. Three different kernel functions of SVM are used; linear kernel, quadratic kernel and cubic kernel, whereas, Histogram of Oriented Gradient (HoG) is used as a feature descriptor. Cubic Kernel achieves highest accuracy on Fer 2013 dataset using HoG.


I. INTRODUCTION
Facial expression is one of the most important non-verbal means of communication, enabling human beings to exchange the social information.There is a vast range of applications of facial expression recognition (FER) such as human and man-machine interaction [1], smart healthcare [2], biometrics, medical diagnosis [3], surveillance [4] and mental state identification [5].
However, high intra and low inter-class variations make human FER more challenging task in the field of machine learning and computer vision.Based on emotions and social interactions, facial expressions are usually categorized into six basic emotions, namely, happiness, sadness, anger, fear, surprise, and disgust [6].The expression neutral is also added as another facial expression and widely accepted by the researchers [7].Figure 1 represents some sample images of these emotions/facial expressions.
In recent years, several interesting solutions have been proposed for FER, relying on the diversified set of techniques and strategies to better represent and classify facial expressions in various application domains [5], [8].Most of the existing FER frameworks focus on algorithms to extract better features, such as Gabor filter and RBF network [7], CNN [9], hybrid CNNSIFT aggregator [10], and SVM [11].
In this paper, empirical study of SVMs with different kernels and features on FER is reported.In experiments, the analysis of false positives for each expression are also discussed.The rest of the paper is organized as follows: section II discusses the related work.The detailed description of the methodology adopted for this empirical study is provided in the section III.The section IV reports the experiments, the dataset used for the evaluations along with the detailed analysis of the obtained results.Finally, section V provides some concluding remarks.

II. RELATED WORK
In recent years, the research community has shown great interest in FER.As a result, several interesting solutions have been proposed for FER.For instance, Usman et al. in [12] proposed a three step solution to FER.In the first step, a state-of-the-art Viola-Jones face detection method has been employed to detect faces in the images.Subsequently, HoG features are extracted from the identified regions of interest followed by autoencoder and PCA based features dimensional reduction techniques.Finally, SVMs are used for the classification purposes.A Similar technique has been adopted in [13], where, initially, faces are detected using a Viola-Jones face detector followed by a feature extraction phase where Local Binary Pattern (LBP) features are used for representation purposes.
On the other hand, Anurag De et al. in [14] modeled FER system using an eigenface based approach where huesaturation values are used for face detection.By calculating Euclidean distance between the test image and mean of training dataset, the expressions are then classified.For FER, Lekdioui et al. in [15] rely on a novel facial decomposition technique.First, the regions of a face are extracted using facial landmarks by using the algorithm IntraFace.For feature extraction, different techniques, such as LBP, CSLBP, LTP and Dynamic LTP, are used.SVMs are then used for classification purposes.Suzan Anwar et al. in [16] proposed an Active Shape Model (ASM) tracker, which takes input from a webcam and tracks 116 facial landmarks.These tracked landmark points are used for extracting the feature of expression from face and Support Vector Machine (SVM) is used for classification.

III. PROPOSED METHODOLOGY
Figure 2 provides a block diagram of the methodology adopted in this study.The method is mainly composed of two components, namely (i) feature extraction and (ii) classification.For feature extraction, state-of-the-art feature descriptors, namely HoG, is used, whereas, three different kernels are used of SVM for the classification.
In Equation 1, f (x, y) describes the pixel value at location (x, y) in the given face patch I, and m describes the magnitude.4) The length of the HoG feature vector for given I is N × C × P , where N denotes the number of the blocks, C denotes the number of cells in each block, and P denotes the number of orientation bins.The orientation are quantized into 9 bins, where makes the HoG features of length 25 × 4 × 9 = 900 [17], [18].An example of the HoG process for feature extraction is shown in Figure 3.The SVM has shown outstanding performances in different application domains [19], [20].The core goal of SVM is to find the best hyperplane to separate binary classes at maximum distance with the minimum number of support vectors.Consider a set of l training examples (r i , t i ), i = 1, ...l, where each example is n-dimensional, r i ∈ R n , a class label t i ∈ {+1, −1}.A function φ is learned that maps given unknown instance r j to t j , t j = φ(r j ).
Inherently, SVM is a binary classifier.To extend SVM for multi-class classification, either one-vs-one (OVO) or one-vsall (OVA) approach is used.In case of OVA, for q different classes where q > 2, q different classifiers are trained, for each class i, it assumes i as positive and rest all other as negative.The OVA often leads to imbalanced training dataset.In Fer 2013 dataset, it also create huge imbalance for few classes such as facial expression Disgust.If binary classifier for Disgust expression is trained using OVA, then there will be only 436 positive instances and 28273 negative instances; the positive instances will only be 1.5% of given training instances.
For many applications, OVO has shown better accuracy compared to OVA.The OVO approach trains q(q −1)/2 binary classifiers.In case of FER, there will be 21 different binary classifiers.The ensemble/voting approach is used to decide the label of given instance r j , the unknown instance r j is given to 21 different binary classifiers, and the label with highest frequency is decided as the expression; OVO is used in this paper.

A. Dataset
The SVMs are evaluated on a large-scale benchmark dataset, namely, Fer 2013 dataset [21]       the other kernels.The quadratic SVM gives 77.51% accuracy on training set but 54.33% accuracy on the test set which gives an impression that feature space is quite challenging and HoG feature representation is limited for FER application.Even cubic SVM, which has better performance for nonlinear feature space, has only 57.17% accuracy on test set.
Interestingly, cubic SVM over-fits on the training set and achieves 97.11% accuracy.
Moreover, in order to provide more detailed analysis of the results; Figure 4, Figure 5 and Figure 6 show the confusion matrices of linear, quadratic and cubic kernels, respectively, on HoG features.It is interesting to observe that linear SVM fails to learn hyperplane for Disgust expression in training set and also in test set which indicates that Disgust expression is not linearly separable.It is also evident from Figure 4 that expression Happy achieves better recognition compared to the all other expressions.The happy expression also have similar recognition in quadratic and cubic SVMs test sets.On a leader-board, top three positions are the same for all SVM kernels whereas few expressions keep changing their positions.In case of disgust expression, it is on the last position (7th) on linear and quadratic SVMs, whereas, it improves the position on Cubic SVM that is 4th which indicates that cubic SVM gives better results in imbalance training as well.Almost all expressions improve their accuracies from linear to cubic SVMS except Neutral expression; for Neutral expression quadratic SVM gives the highest accuracy.

V. CONCLUSION
In this paper, an empirical study is conducted on SVM for investigating the impact of different kernel functions on the performance of FER.There are seven different facial expressions; the expressions include Angry, Disgust, Fear, Happy, Sad, Surprise and Neutral.Cubic SVM achieves highest accuracy on the test set, whereas, it over-fits on the training set.The cubic SVM also learns better hyperplanes under imbalance training instances.In case of Disgust expression, the training instances are only 1.52%.The linear SVM fails to learn the hyperplane for Disgust expression, the accuracy of Disgust is 0%; it neither appeared as false positive for other expressions as well.The cubic SVM has better accuracies for all the expressions except Neutral; quadratic SVM gives highest accuracy for the Neutral.The linear SVM give poor performance for all the expressions except Happy expression; linear SVM gives competitive accuracy for Happy expression, that is 73.7%, whereas, the best is 77.6% using cubic SVM.

Fig. 1 .
Fig. 1.Sample images from the Fer 2013 dataset.Images in the same column show identical expressions.The labels of expressions from left to right are: Anger, Disgust, Fear, Happy, Sad, Surprise and Neutral.

Fig. 2 .
Fig. 2. The flow chart of the proposed model
. The imbalanced nature of the dataset makes it one of the most challenging datasets for FER.In total, the dataset covers 7 different facial expressions, namely Angry, Disgust, Fear, Happy, Sad, Surprise and Neutral.The dataset is divided into training and test sets.The test set is further divided into private and public testing.The training set contains 28709 instances against all expressions, whereas, the test set contains 7180 instances.The number of instances of each facial expression in test and training set are described in Table I.It can be observed that some expressions have very few training instances.For instance, Disgust has only 436 training instances, which makes dataset imbalanced for the disgust expression.Since classification models are data greedy models, therefore, the recognition heavily depends on the number of training instances per class.

Fig. 4 .
Fig. 4. FER on linear SVM using HoG features.a) Shows the results on the training set, and b) Shows the results on the test set.

Fig. 5 .
Fig. 5. FER on quadratic SVM using HoG features.a) Shows the results on the training set, and b) Shows the results on the test set.

Fig. 6 .
Fig. 6.FER on cubic SVM using HoG features.a) Shows the results on the training set, and b) Shows the results on the test set.

TABLE I .
FER 2013 DATASET DESCRIPTION B. Experimental ResultsTable II summarizes the recognition accuracies.It can be seen that linear SVM has limited performance compared to

TABLE II .
COMPARISON BETWEEN THE RESULTS OF LINEAR SVM, QUADRATIC SVM AND CUBIC SVM ON HOG FEATURES