Human Emotion Recognition by Integrating Facial and Speech Features: An Implementation of Multimodal Framework using CNN

This Emotion recognition plays a prominent role in today's intelligent system applications. Human computer interface, health care, law, and entertainment are a few of the applications where emotion recognition is used. Humans convey their emotions in the form of text, voice, and facial expressions, thus developing a multimodal emotional recognition system playing a crucial role in human-computer or intelligent system communication. The majority of established emotional recognition algorithms only identify emotions in unique data, such as text, audio, or image data. A multimodal system uses information from a variety of sources and fuses the information by using fusion techniques and categories to improve recognition accuracy. In this paper, a multimodal system to recognise emotions was presented that fuses the features from information obtained from heterogenous modalities like audio and video. For audio feature extraction energy, zero crossing rate and MelFrequency Cepstral Coefficients (MFCC) techniques are considered. Of these, MFCC produced promising results. For video feature extraction, first the videos are converted to frames and stored in a linear scale space by using a spatial temporal Gaussian Kernel. The features from the images are further extracted by applying a Gaussian weighted function to the second momentum matrix of linear scale space data. The Marginal Fisher Analysis (MFA) fusion method is used to fuse both the audio and video features, and the resulted features are given to the FERCNN model for evaluation. For experimentation, the RAVDESS and CREMAD datasets, which contain audio and video data, are used. Accuracy levels of 95.56, 96.28, and 95.07 on the RAVDESS dataset and accuracies of 80.50, 97.88, and 69.66 on the CREMAD dataset in audio, video, and multimodal modalities are achieved, whose performance is better than the existing multimodal systems. Keywords—Emotion recognition; multimodal; fusion; MFCC; MFA; FERCNN; CREMAD; RAVDESS


I. INTRODUCTION
Emotion recognition is the process of determining a person's emotional state. Affective computing and humancomputer interaction (HCI) applications rely heavily on it [1]. In recent studies, emotion identification has sparked increased attention in academics and the commercial sector [2]. It is used in a variety of applications, including analysis of Twitter, tutoring systems, playing video games, prediction of consumer satisfaction, and military healthcare [3][4][5].
Speech or audio emotion recognition has been employed in medical studies to examine the changes in the emotions of depressed patients and in children who are having communication difficulties. It can also be used to warn the drivers during driving when the condition of the driver is fatigued to avoid accidents. Low level information from the speech or audio signals is extracted by the speech or audio emotion recognition system to comprehend the emotion status. Compilation of databases related to emotions, extraction of emotional features from the speech or audio signal, reduction of features by using dimensionality reduction techniques, and classification of emotions into respective classes are all part of this classification problem based on speech or audio signal sequences. K nearest neighbour, Gaussian mixture model, Support vector machines, and artificial neural networks are some of the traditional techniques that are used for speech or audio emotional recognition and are not that efficient because human emotions have high complexity and uncertainty [6].
About 93% of communication with humans is done through nonverbal means such as voice tone, facial expressions, and body language [7]. Identifying emotions through facial expressions which has been extensively studied [8] [9] resulted in higher accuracies by making the changes at the pre-processing stage. To reduce overfitting during the training stage, adding dropout to the CNN model plays a prominent role in reducing overfitting during training [10]. Extracting of faces from the chain of video sequences and extracting the features from the resulted images are the steps followed in general to detect the emotions of the faces in the video sequences [11]. The robust face detection algorithm [12], the AdaBoost learning algorithm [13], and the spatial template tracker [14] are some of the techniques used in detecting the faces in the video. Fisher vectors, Active Shape model, Active Appearance model, local binary patterns, principal component analysis [15] and Gaussian mixture model [16] are some of the methods that are used for feature extraction in facial images. Occlusions and light changes may also lead the identification technique to be misled. If the emotion is to be identified through speech, ambient noise and differences in the voices of different participants are major factors that might affect the final recognition result. According to both physiological and psychological research, humans need both audio and visual signals to correctly understand emotions for which multimodal systems that fuse audio and video signals can be used.
Thanks to recent research interest in multimodal systems, the limitations of monomodal systems [17] [18] have been overcome. The information obtained from different modalities at different levels of fusion was fused by multimodal systems. The different fusion levels are classified into two different categories, namely: matching prior to fusion and matching after fusion. Feature level and sensor level fusion techniques [19] come under the first category, and decision, rank, and score level fusion techniques come under the second category. To combine the audio and video features of the multimodal, a fusion method that takes advantage of both decision and feature-level fusion was developed. Latent space fusion methods preserve analytical or numerical correlation between the different modalities and store them in a common latent space.

II. RELATED STUDIES AND MOTIVATIONS
Many attempts have been made by researchers to enhance emotion identification using a combination of audio and visual information [20]. According to [21], audio-visual emotion detection may be categorized as kernel-based, feature level, model-level, decision-level, score-level, and hybrid level fusion techniques. In this paper, we focus on latent-space fusion methods and multimodal recognition to detect emotions. Multimodal emotion recognition systems consistently outperform unimodal systems [22], [23], and [24]. Although there are certain benefits to using multimodal affective systems, they also face some important challenges [24]. Selecting the modalities that result in the best combinations is the area that has been focused on in recent studies [25]. CREMA-D [26], RAVDESS [27], and SAVEE [28] are some of the existing multimodal datasets that have been considered for research in recent times. A multimodal method by Cid et al. [29] used tempo, pitch, and energy feature extraction techniques to extract the audio features and a Bayesian classification method to classify the emotions. Edge-based characteristics are obtained from visual images to classify them in the SAVEE database.
Gharavian et al. [30] evaluated the performance of a neural network called FAMNN. MFCC, Zero Crossing Rate, and pitch are some of the audio feature extraction techniques used to extract the audio features. Visual information is obtained by using marker positions on the face concept, and the resulted features are given to a feature selection algorithm (FCBF). For audio features, Huang et al. [31] used prosodic and frequency domains, while for facial expression description, they used geometry and appearance-based features. Using a backpropagation neural network, each feature vector was utilised to train a single-modal classifier. They suggested a genetic learning-based collaborative decision-making model, which was compared to concatenated equal weighted choice fusion, BPN learning-based weighted decision fusion, and feature fusion methods. The audio spectrum features are obtained from BERT and CNN and are combined in parallel to form a multimodal [32].
A HGFM method was proposed by Xu [33], which fuses the hand-crafted features and the features extracted from the gated recurrent unit. The key frame videos are summarized by the method proposed by Noroozi [34] which uses a CNN model and the concept of stack fashion or late fusion for detecting the emotions. Xu et al. [35] proposed a multi-hop memorized network that describes the single-modality and cross-modality interactions among the three different feature domains in aspect-level sentimental analysis of a multimodal system. Zadeh et al. [36] introduced a tensor fusion network that uses the product of audio, visual and image elements to represent multimodal fusion information.
RMFN, a multistage recurrent network for fusion described by Liang et al. [37], divides the multimodal fusion into various stages that utilize LSTM to record multimodal interactions in both synchronous and asynchronous modes. Liu et al. [38] lowered the computational complexity of the parameters by using a low-rank multimodal fusion approach that employs a low-rank tensor to relieve the increased computational cost of considering all three modalities. Poria et al. [39] used LSTM to isolate audio, video and text elements before combining them in a multi-level architecture. Ghosal et al. [40] developed a multi-attention recurrent network architecture for multimodal representation that learns features through attention. Tsai et al. [41] suggested learning interactions between modalities by employing multimodal transformers to construct an attentionbased cross-modal architecture.
By using the RAVDESS dataset Fu Z et al. [47], R. Chatterjee et al. [48], Chang X et al. [49], Wang W et al. [50] achieved test accuracies of 75.76, 90.48, 91.4, and 89.8 on their respective multimodal systems. Ghaleb E et al. [52], He G et al. [53] proposed multimodal systems which resulted in test accuracies of 66.5 and 64 on the CREMAD dataset. Rory Beard et al. [51] proposed a multimodal where CREMAD and RAVDSR datasets are used for experimentation and resulted in test accuracies of 65.0 and 58.3, respectively.

A. Dataset Description
CREMAD and RAVDESS datasets are used for experimentation and evaluation purposes. Both datasets consist of data related to the emotions of actors in both audio and video modes. Angry, disgust, fear, happy, neutral, and sad are the common emotions present in both datasets in both modes, whereas RAVDESS audio data consists of two more emotions, calm, and surprise. CREMAD consists of 22326 and 60359 emotions related to audio and video. RAVDESS consists of 4321 and 45225 emotions related to audio and video. A detailed overview of the datasets is given in Table I below.

B. Image Feature Extraction
From the given set of video sequences of the multimodal dataset the videos should be converted into images and then facial features should be extracted from the images. The detailed description of the features is extracted from the videos is given below.
From the given set of facial emotion videos of a multimodal dataset, the images are represented in linear scale space which is obtained by convoluting with 3 dimensional Gaussian Kernel. 593 | P a g e www.ijacsa.thesai.org Whereas and represents the axis of the frames that are obtained from the facial input video sequence f vid , t d denotes the axis if time in the temporal domain A method proposed by Forstner and Harris [42] [43] considers a Gaussian window to identify distinct points of the image which in turn determines the locations in f vid when there are significant changes in the intensity of image in the given space and time domains when sliding the Gaussian window in various directions. The distinct points can be detected by convoluting Spatial-Temporal Second Momentum matrix with the given Gaussian weighted function Gau k (. ; σ i 2 , τ i 2 ).
The Spatial-Temporal Second Momentum matrix is 3 × 3 dimensional matrix and is given as And the distinct points identification is given by Where L ssx , L ssy & L sst are first order derivatives that are defined as follows L ssy (. , σ lss 2 , τ lss 2 ) = ∂ y (Gau k * f vid ) L ssz (. , σ lss 2 , τ lss 2 ) = ∂ z (Gau k * f vid ) Where σ i 2 = S ssk * σ lss 2 , τ i 2 = S ssk * τ lss 2 and S ssk is a constant The existence of distinct points in the f vid is indicated by the eigen values λ 1 , λ 2 , λ 3 that can hold larger values. In the Spatial-Temporal domain the variations that are existing in the intensity of image are obtained by concatenating the trace lss and determinant of μ ch which is given as K is a constant and the function is normalized such that the effect of variations in the images due to illumination can be removed

C. Audio Feature Extraction
Zero crossing rate (ZCR), Mel Frequency Spectrum Coefficient (MFCC), pitch and energy are some of the feature extraction techniques used to extract the features of the emotions from the given audio signal.

1) Zero crossing rate:
The number of times the audio signal crosses the zero-line, x-axis, is referred to as the zerocrossing rate, and it is stated as follows.
is the respective audio signal that was divided into segments by using a sliding window that was having al length of T, n ∈ [0, N] and x Audt (n) is the t n th Segments time sequence

2) MFCC (Mel frequency cestrum coefficient):
The coeeficients of the corresponding spectral form of the audio stream are represented using a nonlinear Mel scale. The Mel frequency was used to analyse cepstral coefficients, and the steps below were followed.

1:
ℎ . Acoustic tube characteristics pitch and energy are exhibited by MFCC that contains great amount of emotional information which plays a key role in emotion recognition. 594 | P a g e www.ijacsa.thesai.org 3) Pitch: It depicts the signal's fundamental frequency [44]. The valence of an audio stream is connected to its rhythm and average pitch from an emotional standpoint. For example, higher amount of pitch may be associated to discomfort, lower standard deviation to sadness and usually happiness and discomfort are having higher talk and pitch rates whereas sadness can be represented by lower talk and pitch rates [45]. Autocorrelation is used to calculate the pitch of the audio signal and is given as follows.
x Aud [n] be Stochastic Process Sinusoidal function x Aud [n] = Cos(w 0 n + ∅ ) and the autocorrelation of x Aud [n] is given as Maximum of the autocorrelation value is used to calculate the pitch, S Aud Samples are used to calculate the estimate of

4) Energy:
It represents the signal's intensity or total energy. From an emotional standpoint, an audio signal having exciting emotions (e.g., pain or happiness) has more energy than an audio signal containing sadness or fatigued feelings [46]. The energy of the audio signal ( ) is given as

D. Feature Level Fusion
From the features obtained from audio and video signals, only a few portions of the features are related to emotions. Personality, age, gender, and many other features are obtained from audio and video signals, which may impact the quality of recognition of the emotions that are used in the model for training. Feature Level Latent Space methods are one of the existing categories of methods that are used to find the common features related to emotions and map them into the required latent space. By maximizing the cross correlation of the respective features and by minimizing the feature distance or by taking the normalization of the features, they can be used in feature level fusion. Marginal Fisher Analysis (MFA) is a supervised method that is used for audio video feature level for fusion by extracting the required features from the respective modalities. The process of feature level fusion is given as below.
Information related to class labels is used in latent space generation. The compactness in the intra class is given as X AV = {x 1 , x 2 , . . . , x n } Pis the frame set, N is the total samples and N k1 + is k 1 in the same class.
And the Inter-Class Separability is given by c i is the emotion of class i, P k2 (c i ) is the set of K 2 nearest pairs and S AV is given by And the objective function is given as follows And the optimal solution is given by Where L AV = D AV − S AV and L AV P = D AV P − S AV P are called Laplacian matrices for W AV and W AV P

E. Proposed CNN Architecture
The proposed CNN architecture consists of four fully connected layers, one flattening layer and two dense layers. All the fully connected layers are interconnected with each other where the output features obtained from each fully connected layer are given as an input to the next fully connected layer. The inputs to the first fully connected layer are audio, video, and multimodal features that are obtained during preprocessing by applying the audio feature, image feature, and feature level fusion extraction techniques described in the above sections. The first fully connected layer consists of convolution and max polling layers, and the representation of the first fully connected layer is given as Where Out conv1 is the output of the convolutional layer, Act is the activation function, L AV is the latent space or latent features obtained after applying feature level fusion, and W i,j n is the set of weights associated with the convolutional layer 595 | P a g e www.ijacsa.thesai.org Out convf1 is the output obtained from the max polling layer, where the input is Out conv1 , the first convolutional layer output. The output of the first fully connected layer Out Maxpoll1 is given as input to the second fully connected layer, which consists of convolutional, max polling, and dropout layers, and the representation of the second fully connected layer is given as Out conv2f is the output of the second fully connected layer, Drop(0.2) means that 20% of the features are dropped from the output of the max polling layer, and W [2n+1] are associated weights used.
Out conv2f the output of second fully connected layer, is given as input to the third fully connected layer which consists of the same layers as second fully connected layer and the output of the third fully connected layer is given as Out conv3f is given as input to the fourth fully connected layer which consists of a convolution and max polling layers and the output is given as The output of the fourth fully connected layer is flattened by giving to a flatten layer and the output is represented as The output of a flattening layer is given to a dense layer and a dropout of 20% is applied to the output obtained from the dense layer. The resultant features are given as input to the next dense layer where the output is classified. Relu activation function is used in the dense layers that are used in between, and a SoftMax activation function is used in the final dense output layer. The representation of the dense, dropout, and final output layers is as follows:

F. Data Preprocessing
For experimentation, the RAVDESS and CREMAD datasets are used in this paper. The datasets contain data related to audio and video emotions of various actors, and the description of the data is given in the dataset description section of the same module. The features of the video and audio data are obtained by using the image feature extraction and audio feature extraction methods explained above. There is dissimilarity in the number of features obtained from audio and video datasets. There are more features in the resultant dataset of video images when compared to audio files. A dimensionality reduction technique is applied to the image set to reduce the number of features so that the same number of features is present in the audio and video resultant datasets. Finally, a multimodal dataset is obtained by combining the resultant features of audio and video from the respective datasets by using the feature-level fusion technique that was explained in Section D, namely the "Feature Level Fusion" of the same module. The features obtained after applying Feature Level Fusion are given to the proposed CNN Model for Evaluation, and the description of the proposed CNN Model is explained in Section E, named "Proposed CNN Architecture." Fig. 2 gives the workflow of the proposed work done in this paper. 596 | P a g e www.ijacsa.thesai.org  Fig. 3(a) and (b), 3(c) and (d) and 3(e) and (f) represent training and testing accuracy and loss comparisons in audio, video, and multimodal modes on the RAVDESS dataset. Test accuracies of 95.96, 96.28, and 95.07 were observed. On the CREMAD dataset, train and test accuracies and train and test accuracies losses are shown in Fig. 4(a) and (b), 4(c) and (d), and 4(e) and (f) represent training and testing accuracy and loss comparisons in audio, video, and multimodal modes. Test accuracies of 80.70, 97.88, and 69.66 were observed. A detailed description of the results is given in Table II.    Table III. 599 | P a g e www.ijacsa.thesai.org   Fig. 6(a) and 6(b) represent the confusion matrix/classification report of how the classes are classified during the testing phase on video data of the RVDSR and CREMAD datasets. It has been observed that an average precision, recall, f1-score and support values of 0.98, 0.985, 0.985 and 997 on RAVDESS, 0.973, 0.975, 0.975 and 1027 on CREMAD datasets, respectively. The detailed description of various emotions and their respective performance measure values of RAVDESS and CREMAD video data is given in Table IV.  A multi-modal dataset has been obtained by combing the features of audio and video by using the feature level fusion techniques described in the feature level fusion section of the proposed method on the RAVDESS and CREMAD datasets.  A detailed description of the macro average and weighted average accuracies Precision, recall, f1-score and support of RAVDESS and CREMAD datasets in all the three modes (Audio, video, and multimoded) are given in Table VI. The performance of the current work done has been compared with earlier work. It has been observed that the proposed method performed better, and a detailed description of the comparisons is given in Table VII. 601 | P a g e www.ijacsa.thesai.org V. CONCLUSION AND FUTURE WORK A multimodal system for emotion recognition was proposed in the current work. Audio and video information are used here. Audio features are obtained by the Mel-Frequency Cepstral Coefficients extraction technique, and all the videos are converted into images and stored in a spatial-temporal space. The image features are extracted by using a Gaussian weighted function. The MFA fusion technique is to fuse the audio and video features, and the resultant features are given to the FERCNN Model for training and evaluation. For experimentation, the RAVDESS and CREMAD datasets, which consist of audio and video data, are used. Test accuracies of 95.07 and 69.66 were obtained on the RAVDSR and CREMAD datasets in multimodal mode. Even though many multimodal emotional datasets exist, only two of them are considered. An efficient multimodal system that is generic to all types of multimodal emotional databases can be designed, and the maximum multimodal data accuracy on the CREMAD dataset is 69.66%, which can be further improved.