An Optimized Neural Network Model for Facial Expression Recognition over Traditional Deep Neural Networks

Emotions have a key role in Feedback analysis to provide a good customer service, the main seven emotions are Anger, Disgust, Fear, Happy, Neutral, Sad and Surprise. There are several advantages, an efficient Facial Emotion Recognition model can help us in self-discipline and control over the drivers, while they are driving the vehicle. Low resolution and Lowreliable images are main problems in this field. We proposed a new model which can efficiently perform on Low resolution and Low-reliable images. We created a low resolution facial expression dataset (LRFE) by collecting various images from different resources, which contains low resolution images. We also proposed a new hybrid filtering method, which is a combination of Gaussian, Bilateral, Non local means filtering techniques. Densenet-121 achieves 0.60 0.68 accuracy on fer2013 and LRFE respectively. When hybrid filtering method is combined with Densenet-121, it achieved 0.95 accuracy. Similarly Resnet-50, MobileNet, Xception models performed effectively when combined with the hybrid filtering method. The proposed convolutional neural network(CNN) model achieved 0.65 accuracy on fer2013 dataset, while the existing models like Resnet-50, MobileNet, Densenet-121 and Xception obtained 0.60 0.57 0.60 0.52 accuracies on fer2013 respectively. The proposed model when combined with hybrid filtering method achieved 0.85 accuracy. Clearly the proposed model outperforms the traditional methods. When the hybrid filtering method is combined with the CNN models, there is significant increase in the accuracy. Keywords—Facial expression recognition; deep learning; filtering techniques; convolutional neural network; emotion


I. INTRODUCTION
The raw data consists of noise like random variation of brightness or color information, removing noise from the images drastically improves the performance of the facial emotion recognition models. To eliminate noise from images there are many denoising techniques such as gaussian blur, bilateral filter, non-local means filtering. Gaussian Blur helps in blurring the edges and reducing the contrast, but it reduces the details [1]. Bilateral Filter decreases the noise by preserving the edges by replacing the intensity of pixels with weighted average of intensity from surrounding pixels [2]. Gaussian Filter, Bilateral Filter and other traditional filtering techniques can remove image noise, but the image structure information is not retained enough. Non Local Means Filtering averages neighbours with similar neighbourhoods, with much greater clarity and smaller extent loss of detail post-filtering. The limitation of this technique is, efficiency is slightly lower when compared to traditional techniques. The computation complexity is quadratic in number of pixels in the image, so it is expensive to apply. To speed up the execution many techniques were designed, one such technique is fast Fourier transform, it determines the similarity between two pixels by speeding up the algorithm by factor of 50 and also maintains the quality of result [3] [4].
When compared grayscale images with RGB images, grayscale images achieves more accuracy in object recognition field. The other benifit of using grayscale images is, cost of computation will decrease [5] [6]. Due to continuous gradient updating, overfitting is one of the basic issues in neural networks. This results in poor performance of the neural network model [7]. For a deep learning model to perform well, it needs a large amount of samples. Gathering more number of samples or large dataset might be expensive, so an possible way is, to automatically generate new samples, this process is called data augmentation. Data Augmentation is used to improve neural network model performance by decreasing data bias and improving the model generalization [8]. Batch normalization is used to increase the stability of a neural network and allows us to use higher learning rates. For as much as dropout can decrease overfitting in a model, a batch normalized neural network can remove or reduce the overfitting [9]. The traditional way for training a CNN is via stochastic gradient descent [10]. Instead of decreasing the learning rate, increase the batch size during the training. This method shows identical performance with lesser parameter updates [11][12] [13].
Haar feature-based cascade classifier is a machine learning approach, where the cascade function is trained on positive and negative images. This approach is useful in object detection in images [14] [15]. The best way to distinguish a neutral facial emotion from other emotions is to check whether the person's mouth is open or closed. If the mouth seems to open, then that it does not belongs to neutral Facial emotion. A lot of research is being done in mobile applications for Emotion recognition tasks. Even though the present mobile devices have enough memory and processing power, when compared to previous generation smart phones, we cannot directly use solutions from computer to smart phones in respect of facial emotion recognition [16] [17]. Social robots are very much needed for the society, they can behave as consort for old people, help doctors during www.ijacsa.thesai.org operations. In facial emotion recognition mouth, eyes, eyebrows play a key role for emotion recognition, Gabor filter helps to obtain these features from an image [18] [19].
Our major contributions in this research paper can be outlined as: (1) designed a novel convolutional neural network, Fig. 4 represents the proposed model architecture; (2) presented hybrid denoising method; (3) low resolution facial expression (LRFE) dataset is created for facial expression recognition; (4) compared with traditional methods. We applied various filtering techniques such as Average filtering, Median filtering, Gaussian filtering, Non local means filtering, Bilateral filtering and Hybrid denoising method to both FER2013 and LRFE dataset and compared the results. The Hybrid denoising method is presented by combining Gaussian, Bilateral and Non local means filtering techniques. The proposed model is compared with traditional methods, the batch size used in this research is 32 and trained for 100 epochs. Various techniques like dropout, L2 regularization are used to avoid overfitting. We build a novel convolutional neural network because the existing methods are not working well on the test sets and are very large in size(more number of layers when) and taking more time to train them. So our proposed convolutional neural network overcomes all these problems. In section III proposed work we explained about the dataset used in this paper and the algorithm of the proposed model. Next, in experiment and result section, we pointed out all the experimental results along with graphs of all the techniques used. Table I outlines the description of FER2013 and LRFE datasets, respectively.  [20] proposed a method based on ensemble of three face detectors, followed by a classification module with ensemble of various convolutional neural networks. Each convolutional neural network model is pretrained and fine-tuned on Facial Expression Recognition challenge 2013 and SFEW 2.0, respectively. This method achieved 61.29% on test set of SFEW 2.0. Zhihao Zhang at al [21] designed a convolutional neural network to extract features from video clips and a feature matrix processing method is used for identifying the apex frame from such a long video. By combining feature extraction and feature matrix processing methods, the model achieved smaller Mean Absolute Error (MAE). Samira Ebrahimi Kahou et al [22] proposed a hybrid convolutional neural network (CNN)recurrent neural network (RNN) method for facial expression recognition. Recurrent Neural Networks produced state-of-art performance on diverse set of sequence analysis tasks. The results show that higher recognition accuracy can be achieved by combining feature-level and decision-level fusion networks.
Bing Feiwu et al [23] proposed a model, which can solve the problem of customizing the general model without the label information of the testing samples. The model resulted an improve in accuracy by 3.01% 0.49% 5.33% when tested on extended Cohn-Kanade (ck+), Radboud Faces Database (RaFD) and Amsterdam Dynamic Facial Expression set (ADFES) respectively. Shamim Hossain et al [24] designed a model for mobile application, which can detect the facial emotions with less computation, since a mobile device has limited processing power we need a model which can recognize facial emotions with computationally less expensive. The proposed model takes only 1.4 seconds to recognize one instance of emotion and obtained an 99.8% 99.7% accuracies on JAFFE database and CK database respectively. Jia Deng et al [25] proposed conditional generative adversarial network approach to reduce the intraclass variations. The proposed approach consists of a generator G and discriminators (Di, Da and Dexp). For learning the generative and discriminative representations, three loss functions were designed. But there is one limitation in this approach is that the model is trained individually for each different datasets, a model which is trained on a particular dataset may result in poor accuracy on another dataset.
Hongli Zhang et al [26] designed a method based on convolutional neural network and edge detection for facial emotion recognition. For testing this they created a simulation experiment by combining the fer-2013 database with LFW dataset. The average recognition obtained by this method is 88.56% and the train speed on the training dataset is 1.5 times faster than the traditional method. Yingying Wang at al [27] proposed a hybrid transfer learning model, which is based on Convolution Restricted Boltzmann Machine (CRBM) model and a Convolutional Neural Network (CNN) model, since there are some content differences between the datasets during traditional transfer learning, which affects the classification performance of the model. In this model CRBM replaces the full connection layer in the CNN model. The added CRBM layer learns about the unique statistical characteristics of the target set. This helps in eliminating the content differences between the datasets. Ronak Kosti et al [28] presented "Emotions in context Database" (EMOTIC), this dataset contains images of people in context in non-controlled environments with 26 emotional categories. They trained a convolutional neural network model on EMOTIC dataset that can analyze the person and the whole scene to classify the emotion states. There model is able to make notable guesses on the emotion states, when the face of the person is not visible. Jianzhu Guo et al [29] created ICV-MEFED dataset. It includes 50 classes of compound emotions (e.g., happy-disgusted and sadly-fearful) and labels that are evaluated by psychologists, since the labels that are obtained automatically by machine learning based algorithms could lead to inaccuracies. They have organized a challenge on the ICV-MEFED dataset at FG workshop 2017. After analyzing the top three methods, the experimental results indicate that pairs of compound emotions (e.g., happily-surprised vs surprisingly-happy) are more difficult to recognize. www.ijacsa.thesai.org

A. Filter Description
The main aim of our research is to compare the Facial Emotion recognition accuracy of Gaussian, Bilateral, Non local means, Average, Median, Hybrid denoising techniques. A hybrid denoising method is proposed by combining the Gaussian, Bilateral, Non-local Means denoising techniques. Guassian filter is a 2D convolution filter, which blur the image, helping in remove the noise. The only limitation with this technique is, the loss of image details is high when compared to other techniques. Bilateral is a non-linear filtering technique used to remove noise from the image by preserving the edges. The limitation of this technique is that it introduces false edges in the image. Non local means filter, unlike taking the mean value of a group of pixels, non local means takes a mean of all pixels and unlike other techniques which blur the image, non local means can restore the texture of image.Median filter is one of the non-linear digital filtering technique, used to remove the noise from the images. It removes the noise from the images by preserving the edges. For removing salt and pepper noise, median filter is most effective.Average filtering helps in removing the noise from the images by replacing each value with average of neighbouring pixels; decreases the intensity variation among neighbouring pixels.

B. Dataset Description 1) FER2013 dataset:
This dataset contains nearly 35887 images of various people facial expressions. It is a publicly available dataset, which enables to do research in the field of facial expression recognition. It contains both male and female   (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 7, 2021 446 | P a g e www.ijacsa.thesai.org

C. Algorithm
Step 1: Input of image dataset containing seven different facial emotions.
Step 2: Firstly, convert all the images into JPG format.
Step 3: Secondly, convert all the RGB images into gray scale format.
Step 4: Thirdly, re-size all the corresponding gray scale images to 48X48 pixels.
Step 5: Now, Assign labels to all the images after resizing.
Step 6: Split the dataset into 80:20 ratio for training and testing the model for classification of image.
Step 7: Train the model on the training set and evaluate the model on the testing set.
Step 8: Finally, output the classification of image based on the emotion expressed in the image.

A. Performance on FER2013 Dataset
The Table II (fer2013 dataset) is divided in the ratio of 80:20 for training and testing purpose, the batch size is 32 and all the models are trained for 100 epochs. During model implementation, 80 percent of fer2013 dataset is used for training the model and remaining 20 percent is divided into validation and testing the model.  Fig. 5(a), 5(b) and 5(d) we can see that train loss and test loss converges quickly, as the number of epochs increases the train loss and test loss decreases quickly. In Fig. 5(c), we can see that the Xception model is taking much more number of epochs to decrease the train loss and test loss, when compared to proposed FerExpNet model. In Fig. 5(e), we can see that the DenseNet121 model is taking much more number of epochs to decrease the train loss and test loss, when compared to proposed FerExpNet model The Fig. 5(f) shows the accuracy vs loss comparison graph of proposed FerExpNet model. After analyzing all the results, the proposed FerExpNet model is performing better than the existing state-of-art models on Fer2013 dataset for facial expression recognition in terms of accuracy and loss. Table IV shows the results of FerExpNet on Fer2013, when different filtering techniques are applied. We designed a novel Hybrid filtering method (HDM), which is a combination of Gaussian, bilateral and non-local means filtering techniques. When the proposed FerExpNet is combined with average filtering technique, the model achieved 0.70 accuracy on Fer2013 dataset. The proposed model without any filtering technique achieved 0.65 accuracy on Fer2013 dataset, there is a significant increase in accuracy after applying the average filter. When the proposed FerExpNet is combined with Gaussian filtering technique, the model achieved 0.65 0.55 on train set and test set of Fer2013, respectively. The accuracy of this approach is only 0.55 which is less, when compared to FerExpNet without filtering techniques, because when Gaussian filter is applied a lot of details in the images will be lost. When the designed hybrid filtering method (HDM) is combined with FerExpNet, the model achieved 0.87 0.85 accuracy on train and test sets of Fer2013 respectively. There is a significant increase in both accuracy and loss, when compared to FerExpNet without filtering techniques. The results in Table IV show that the FerExpNet with hybrid filtering method is performing better than other filtering techniques on Fer2013 dataset. The Fig. 6(a), 6(b) and 6(c) represents the comparison of accuracy and loss of various denoising techniques on proposed model when applied on FER2013 dataset, respectively.

B. Performance on LRFE Dataset
The LRFE dataset is divided in the ratio of 80:20 for training and testing purpose, the batch size is 32 and all the models are trained for 100 epochs. During model implementation, 80 percent of LRFE dataset is used for training the model and remaining 20 percent is divided into validation and testing the model. Table V shows the accuracy and loss comparison of different deep learning models and proposed FerExpNet model on Low resolution Facial Expression (LRFE) dataset. The proposed FerExpNet achieves an accuracy of 0.95 0.71 on training and testing sets of LREF dataset, respectively. The results clearly indicate that the proposed FerExpNet is performing better than the state-of-models on LRFE dataset. The state-of-art VGG variants VGG16, VGG19 obtained 0.69 0.66 accuracy on LRFE dataset. The MobileNet model achieved only 0.65 accuracy on LRFE dataset, which makes it less efficient in Facial expression recognition, when compared with Xception and FerExpNet on LRFE dataset. The Xception model achieved 0.69 accuracy on LRFE dataset, which is second best after the FerExpNet on LRFE dataset. The latest EfficientNet-B7 obtained an 0.65 accuracy on LRFE dataset. In Fig. 7(a), 7(b) and 7(d) we can see that train loss and test loss converges quickly, as the number of epochs increases the train loss and test loss decreases quickly. In Fig. 7(c) and 7(e) we can see that the Xception and DenseNet121 models are taking much more number of epochs to decrease the train loss and test loss, when compared to proposed FerExpNet model. The Fig. 7(f) shows the accuracy vs loss comparison graph of proposed FerExpNet model. After analyzing all the results, the proposed FerExpNet model is performing better than the existing state-of-art models on LRFE dataset for facial expression recognition in terms of accuracy and loss.   There is a significant increase in both accuracy and loss, when compared to FerExpNet without filtering techniques. The results in Table VI show that the FerExpNet with hybrid filtering method is performing better than other filtering techniques on LRFE dataset. The Fig. 8(a), 8(b) and 8(c) represents the accuracy and loss comparison of various denoising techniques on proposed model on LRFE dataset, respectively.

C. Performance on Mixed Dataset
Randomly various images from each emotion category are mixed together from FER2013 and LRFE dataset to form the mixed dataset. Later, the dataset is divided in the ratio of 80:20 for training and testing purpose, the batch size is 32 and www.ijacsa.thesai.org all the models are trained for 100 epochs. During implementation, 80 percent of dataset is used for training and 20 percent of dataset is used for validation and testing purpose.

V. CONCLUSION
A Novel optimized neural network on basis of convolutional neural network is proposed in this paper. We also designed a new hybrid filtering method, which is a combination of Gaussian, bilateral and non-local means filtering techniques. This hybrid filtering method is used for removing any noise present in the images. All the datasets are divided in the ratio of 80:20 for training and testing purpose. The proposed FerExpNet achieved 0.65 accuracy on Fer2013 dataset and this model is performing better than the state-ofart models on Fer2013. When the hybrid filtering method is combined with the proposed FerExpNet, the model achieves 0.85 accuracy on Fer2013 dataset. There is a significant increase in the accuracy when hybrid filtering method is applied. The proposed FerExpNet obtained 0.74 accuracy on LRFE dataset, outperforming the existing models, similarly when the hybrid filtering method is combined with FerExpNet the accuracy increased to 0.95 on LRFE dataset. The results show that the proposed FerExpNet is performing better than www.ijacsa.thesai.org the existing models. The average time taken for each epoch on LRFE dataset is 1sec, similarly the average time taken for each epoch on Fer2013 is 5sec for FerExpNet respectively. The future work of this paper is to build a more sophisticated convolutional neural network model which can be integrated into a mobile device for wide use in real-world applications.
ACKNOWLEDGMENT I sincerely thank P.V.V.S Srinivas for guiding me in this research work and helping through difficult times.