An Improvised Facial Emotion Recognition System using the Optimized Convolutional Neural Network Model with Dropout

Facial expression detection has long been regarded as both verbal and nonverbal communication. The muscular expression on a person's face reflects their physical and mental state. Using computer programming to integrate all face curves into a categorization class is significantly more important than doing so manually. Convolutional Neural Networks, an Artificial Intelligence approach, was recently developed to improve the task with more acceptance. Due to overfit during the learning step, the model performance may be lowered and regarded underperforming. There is a method dropout uses to reduce testing error. The influence of dropout is applied at convolutional layers and dense layers to classify face emotions into a distinct category of Happy, Angry, Sad, Surprise, Neutral, Disgust, and Fear and is represented as an improved convolutional neural network model. The experimental setup used the datasets namely JAFEE, CK48, FER2013, RVDSR, CREMAD and a selfprepared dataset of 36,153 facial images for observing train and test accuracy in presence and absence of dropout. Test accuracies of 92.33, 96.50, 97.78, 99.44, and 98.68 are obtained on Fer2013, RVDSR, CREMA-D, CK48, and JAFFE datasets are obtained in presence of dropout. The used features are countably large in the computation as a result the higher computation support of NVDIA with the capacity of GPU 16GB, CPU 13GB and memory 73.1 GB are used for the experimental purposes. Keywords—Convolutional neural network (CNN); facial emotion recognition (FER); dropout; FER 2013; CREMAD; RVDSR; CK48; JAFFE


I. INTRODUCTION
The face expressions connect to emotions; they are essential identifiers for human sentiments. Most of the time, a person's facial expression is a nonverbal means of expressing emotion, and it may be used as tangible proof of human state. Human-computer interaction, psychiatric observations, intoxicated driver recognition, and lie detection are all viable uses facial emotion recognition.
The convolutional layers which act as a backbone for classification with artificial intelligence in several applications few of them are cancer detection [1], Brain Tumor Segmentation, Object detection [2], and crowd counting [3]. The CNN takes input facial image data, modify it using kernels, and then transmit the outputs to the next convolution layer. The final CNN layer's output is flattened and sent into the Feed Forward Neural Network for categorization into an emotion class. The learning stage entails training the model, while the evaluation stage entails putting it to the test and determining the acceptance percentage. Due to the impact of overfitting, it is more likely that the training phase is more fitted implies reduction of the test accuracy. By avoiding overfitting [4], the under described model expresses a research direction of reaching higher accuracy of facial emotion recognition.
The remaining of the paper is organized as follows: Section 2, Related Studies and Motivations; Section 3, Research Method; Section 4, Result and discussion; and Section 4, Conclusion.

II. RELATED STUDIES AND MOTIVATIONS
Image is a set of pixels represented by the function such that the scalar quantity of an image's and is equivalent to the amount of energy emitted from the location where the image was captured. Assume that denotes a continuous variable picture that is transformed to a digital image in the form of with { } and { } Here M, N are the length and breadth of the digital image [5].

( )
CNN is a Deep Learning method that can take an image as input, assign importance of learnable weights and biases to distinct aspects in the image, and distinguish one from the other. When compared to other classification methods, the amount of pre-processing required by a CNN is significantly less. While basic techniques need hand crafted of filters, CNN can learn these filters or characteristics with enough training [6] and successfully capture the Spatial and Temporal dependencies in a picture. Due to the reduced number of parameters involved and the reusability of weights, the architecture performs superior trained to better recognition of the image. The goal of the CNN is to compress the images into a structure that is easier to process while preserving important elements for a successful prediction. By retaining picture features, the CNN's flow from layer to layer minimises the dimension. Different kernels and pooling layers of CNN can accomplish this task. The residue left over after a few repetitions of the previous stages is fed into the dense layer for categorization according to the need and model specification. www.ijacsa.thesai.org Current techniques largely focus on face study while keeping the surrounding intact, resulting in a variety of redundant and erroneous features that hamper CNN training. Happy, Angry, Sad, Surprise, Neutral, Disgust, and Fear are the seven basic face emotion classes that the model learns during learning stage. Recently [7] [8], researchers have achieved remarkable progress in facial expression identification with higher number of classes [9], leading to advancements in neurology and applied mathematics, etc that are boosting research in the field of facial expression further. Moreover, advances in computer vision and machine learning have made emotion recognition more accurate tools of classification.
Dropout changed the idea of learning all the weights in the network in each training cycle to learning a proportion of the weights in the network. This problem is solved by addressing overfitting [10] in networks with many neurons, which are then associated with weights. Regularization was an important study topic prior to Dropout. Regularization approaches in neural networks, such as L1 and L2 weight penalties, are introduced. These regularizations, however, did not entirely alleviate the overfitting problem [11].
Co-adaptation is a key challenge when learning networks with high neurons. If all the weights are learnt at the same time in such a network, it is likely that certain connections will be more predictive than others, As the network is trained repeatedly in this case, the stronger connections are learned more, while the weaker ones are ignored. The traditional regularization, such as the L1 and L2, could not avoid this. The reason for this is that they also regularize depending on the connections' prediction abilities. As a result, they approach determinism in selecting and rejecting weights. As a result, the strong become stronger and the weak become weaker.
The researchers Nitish Srivastava et al. [10] extensively studied the impact of dropout and concluded that increasing the size of the neural network would not help. As a result, neural networks' size and accuracy have been limited. Dropout is a coadaption strategy that can help to avoid these problems.
As seen below, there are numerous examples of major contributions in this field in the literature.
With the CK48 [12] and FER2013 [13] datasets, Mollahossei [14] proposed CNN for FER [13] obtained 93.2% and 61.1% accuracy, respectively. Using the CK48 dataset, [15] investigated the impact of data pre-processing prior to training the network on emotion categorization and found that it improved accuracy by 96.76%. Using the CK48 [12] dataset, [16] used the notion of action units and achieved 97.01% accuracy. [17] suggested a unique architecture based on Sparse Batch normalization, estimating the model's accuracy at 95.24% for JAFEE [18] and 96.87% for the CK48 [12] dataset. Using the FER2013 [13] database, Agrawal et Mittal [19] investigated the influence of adjusting CNN parameters on classification results and found a 65.23% average accuracy without dropout and 65.77% with dropout. The author in [20] used the datasets JAFEE [18] and CK48 [12] to train the model, achieving 95.23 % and 93.24 % accuracy, respectively.

A. Proposed CNN Model
The convolutional neural network proposed in this section is an optimal framework that contains layers namely an input, an interleaved Z grouped convolutional, a batch normalization, a polling, a fully connected, a dropout, a flatten and a dense that forward to an output layer at the end. The fully connected layer is resulted after slicing the convolutional, batch normalization and max pooling layers that are connected through a cross-layer connection as shown in the Fig. 1.
The input image to the proposed framework is a threedimensional array with dimensions where , are spatial dimensions and is the channel whose value is 1 as gray scale images are considered as input, It is used to perform the convolution operation and plays the role of activation function. The output of the convolutional layer is represented as follows.
( ) Where as represents the weight matrix between the feature space of hidden layer that belongs to and feature space of hidden layer which belongs to . is feature plane of , represents feature space of the hidden layer and represents the feature space of hidden layer . (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 7, 2021 383 | P a g e www.ijacsa.thesai.org The output of the convolution layer obtained from Equation 1 is given to the Batch normalization layer where the batch of inputs is represented as { } and the mean of the batch is given as.
And the variance of the batch is Where represents the normalized values After calculating mean, variance, and normalized values. The output of the batch normalization layer is given as (5) represents feature space of hidden layer, represents the batch normalization of layers feature space of sample The output of the Batch normalization layer is fed as an input to the max polling layer in which fixed step sized feature space is used. The output of the polling layer is calculated by the following equation.
In the proposed model a convolutional layer, a Batch normalization layer, and a Max Polling layer forms one fully connected layer. And the output of the fully connected layer is given to dropout layer where some of the neurons will be removed before going to the next fully connected layer. is the output of a fully connected layer and will be given as input to the dropout layer. The output of the dropout layer is calculated as given below.
[ ] is the weights associated to layer , [ ] amount of dropout applied on layer , is the output of fully connected layer and It is called Hadamard product.
The above process is repeated for all the fully connected layers of the proposed convolutional neural network model and the fully connected form of the proposed model which is also called flattening is given as mentioned below.
The binary string represents a crossover indicator which indicates the cross-layer connection. For example says that all convolution, batch normalization, polling, dropout which forms a fully connected layer are connected to the final fully connected layer which is also called as flatten layer, the representation says that only the first one is the fully connected layer and the remaining are normal convolutional layers and the representation says that there are no fully connected layers and all the layers in the network are just convolution layers. For the proposed model in this paper that says that we have a fully connected layer followed by a convolutional layer which was followed by a fully connected layer again. After performing flattening operation resulted from Equation 8 is fed as an input the dense layer in which Relu activation function is used followed by dropout layer which was again followed by a dense layer where the output is obtained. SoftMax function is used in the final dense layer to classify the output.
denotes the output of the first hidden layer, represents the number of hidden neurons used in the dense layer, says the Relu Activation function is used here for activating the neurons and is nothing but the output of the flatten layer or final fully connected convolutional layer.
is the output of the dropout layer after the dense layer, is the activation function, [ ] and [ ] are respective weights assigned and the amount of dropout applied. And, finally the output of the last dense layer where the classified output is obtained as given below.
is the final output of the proposed convolutional neural network, represents the total classes to classify, is the SoftMax activation function.
For the sample l, the optimization model that was propose uses the formulas given below to calculate the activation of all Convolutional, Batch Normalization, Max Polling, Fully connected, Dropout, Dense layers.

B. The effect of Drop Out on FER Classification
Dropout is a process of dropping a %age of connection with probability (1-p). The Expectation with dropout is represented as: where is a dropout rate: ~ Bernoulli(p).
The Bernoulli(p) satisfy the properties listed below.
If is a random variable with Bernoulli distribution, then: The probability mass function on this distribution, over possible outcomes , is.
Lemma 1: Dropping network connections increases generalization over non-dropout.
Let a weight is associated in between two consecutive hidden layers of p and q with absolute Mean Square Error E for a neural network. The projection for respective connections is represented as (15) is the parameter that control the Connection [27] with high projection improves the generalization of the Neural network. The generalization potential of the network with normalization is represented as Where is the normalized view of the .
After few of the connection dropped the generalization of the neural network measured as As can be seen from the equation above, dropping connections can result in a high degree of generalization.
Theorem 1: With every rise in hidden layer, the frequency of activation neuron falls.
To simplicity, the number of hidden layers is an even and it is . An adaptive probability density function [28] for any hidden layer is defined as: The derivative of the function √ it is cleared from above two equations that min and max ranges in between 0 and 1 is , Hence Proved.

A. About Datasets
For experimentation six different datasets are considered namely FER2013 [13], RVDSR [22], CREMAD [21], CK48 [12], JAFFE [18] and own dataset each of which consists of gray scale images of dimensions 48 x 48. For own dataset images and videos are collected from different web resources, all the images are converted to gray scale and resized to 48x48, the videos are converted to frames and preprocessed manually by considering only those frames which are of good. FER 2013 [13] and own dataset consists of images belonging to seven different classes namely Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise, whereas CREMAD [21] and RVDSR [23] datasets consist of images belonging to six different classes namely Angry, Disgust, Fear, Happy, Neutral, and Sad. FER 2013 [13] consists of 35557, RVDSR [22] consists of 61,673, CREMAD [21] Consists of 61,309 CK48 [12] and JAFFE [18] consists of 3540 and 3406, own dataset consists of 36,153 images, respectively. RVDSR [22] and CREMAD [21] datasets consist of videos of different expressions, all the videos are converted to frames. While converting the videos to frames, only the frame for every second is considered, after which the images are resized to 48x48 and taken for experimentation. The details of the images in the dataset are given in Table I.

B. About Experimentation Setup and Resources
As the datasets consists of a huge number of images, implementation on the systems with general configuration will take more time. So, the support of Kaggle cloud platform was taken for performing the implementation that has the NVDIA GPU support of 16GB, CPU support of 13GB and memory support of 73.1GB, respectively.

B. Experimentation
The implementation is done on the four datasets mentioned above in which three cases were considered. In Case 1, a Convolutional Neural Network model where the dropout layer was included in between the fully connected layers and after the flattening layer was considered. After the flatten layer, the dropout layer was added in between two dense layers. The amount of dropout percentage applied in between the fully connected layers is 0.25, whereas in between the dense layers is 0.5. Along with dropout l2. Kernel regularizer and l2. Bias regularizer of 0.01 and 0.01 were added to the convolutional and dense layers. Each fully connected layer consists of a convolutional layer, batch normalization layer and a max polling layer. Three fully connected layers, three dropout layers, one flatten layer and two dense layers are used to construct the model.
In Case 2, the same Convolutional Neural Network with slight modifications is used. In this case will have a dropout layer only between the denser layers and no dropout layers were used in between the fully connected layers. 0.5 is the dropout percentage applied in between the dense layers, l2. Kernel regularizer and l2. Bias regularizer are used in the same way as Case1. Whereas in Case 3 no Dropout layers were used, and the remaining considerations are the same as in Case 1 and Case 2. The total parameters used by the proposed convolutional network are 32,115,718 out of which 32,115,078 are trainable parameters and 640 are non-trainable parameters. The detailed description of the results obtained for all the three cases on the respective datasets is given in the tables and figures given below. An observation from the results after experimentation is overall performance of the model has been increased after using dropouts in between fully connected layers as well as dense layers and the over fitting and under fitting problems normally a CNN Model has been outshined by using dropouts, l2 kernel and bias regularization techniques. Fig. 2(a) to 2(f), 3(a to 3(f), 4(a) to 4(f) and 5(a) to 5(f) shows the Model Accuracy on FER 2013, RVDSR, CREMAD and own datasets in all the three cases which were explained above. Fig. 2(a) and 2(b), Fig. 3(a) and 3(b), Fig. 4(a) and 4(b), and Fig. 5(a) and 5(b) gives the Model Accuracy and loss in the case where dropout is used in between convolutional layers and in between dense layers, Fig. 2(c) and 2(d), Fig. 3(c) and 3(d), Fig. 4(c) and 4(d) and Fig. 5(c) and 5(d) gives model accuracy and loss in the case where dropout is used only after flattening the layer, i.e., between dense layers whereas Fig. 2(e) and 2(f), Fig. 3(e) and 3(f), Fig. 4(e) and 4(f) and Fig. 5(e) and 5(f) gives the model accuracy and loss of the case where dropout is not considered. It is observed that the case where dropouts are considered produced high accuracy and low loss when compared to other cases. The detailed description of Train and Test Accuracies, Train and Test Loss, Macro-average of Precision, Recall and f1-score, Weighted-average of Precision, Recall and f1-score are given in Table II to Table V

V. CONCLUSION AND FUTURE WORK
It is demonstrated in the proposed method that the use of dropout handled overfitting, resulting in a 3.09 percent gain in test accuracy in the category of CNN Model that uses dropout in fully connected convolutional layers and a 0.37 percent gain in test accuracy in the other category of CNN Model that uses dropout in dense layers using FER2013 dataset. Also, it has been noticed that there is a significant improvement of test accuracy for other datasets namely JAFEE, CK48, RVDSR, CREMAD and a self-prepared. Some FER recent developments in the literature were compared to the proposed model and found to be greater due to the implementation of dropout, as shown in Table VI.
The future scope of this paper will be to investigate the trade-off between overfitting and underfitting for FER by CNN models and developing mathematical models for managing a percentage of overfitting and or underfitting to achieve higher accuracy.