Emotions Classification from Speech with Deep Learning

—Emotions are the essential parts that convey mean- ing to the interlocutors during social interactions. Hence, recognising emotions is paramount in building a good and natural affective system that can naturally interact with the human interlocutors. However, recognising emotions from social inter- actions require temporal information in order to classify the emotions correctly. This research aims to propose an architecture that extracts temporal information using the Temporal model of Convolutional Neural Network (CNN) and combined with the Long Short Term Memory (LSTM) architecture from the Speech modality. Several combinations and settings of the architectures were explored and presented in the paper. The results show that the best classifier achieved by the model trained with four layers of CNN combined with one layer of Bidirectional LSTM. Furthermore, the model was trained with an augmented training dataset with seven times more data than the original training dataset. The best model resulted in 94.25%, 57.07%, 0.2577 and 1.1678 for training accuracy, validation accuracy, training loss and validation loss, respectively. Moreover, Neutral (Calm) and Happy are the easiest classes to be recognised, while Angry is the hardest to be classified.


I. INTRODUCTION
Emotions are one of the essential communication factor during the social interactions. They provide additional meanings to verbal communication. Most of the conversation meaning can be captured mostly via non-verbal channels (e.g. speech prosody, body gestures and facial expressions) [1], [2], [3]. Hence, capturing emotions during social interactions between interlocutors is essential to building a system that can interact with humans effectively, efficiently, and naturally. Several efforts have been made to build models that can automatically classify emotions from non-verbal cues in the conversation. Some researchers aim to model the emotions classifier from image or video modality (e.g. Facial Expression Recognition and Hand and Body Gesture). The others use speech and text modality to recognise emotions from the conversation. Generally, the emotions are classified into six basic emotions plus neutral [4]. Recognising emotions from the conversation is a cumbersome task to a social ignorant computer [3]. Several problems exist in building good emotions classifier model from social conversation. First is the dataset; most datasets exist to model the emotions recognition are not balanced in the emotions class. This is due to not all emotions being expressed equally. The second problem is that not all the emotions recognition models have good performance to recognise emotions from the conversation. The results depend on the implemented machine or deep learning algorithms, the dataset used, pre-processing applied, and the modality used (video, image, text or speech). This research proposes and explores several deep learning architectures based on Temporal Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) to extract features and classify emotions from speech. Most of the emotions recognition required temporal information to improve the model performance. Hence, this research proposes a combination of Temporal CNN to extract the features from the speech signals with LSTM to extract the features further and classify the emotions. The results have shown that MODEL-5 achieved the best model with the training accuracy score of 99.92%, validation accuracy of 78.22%, training loss of 0.0144 and validation loss of 0.8432. The rest of the sections in this paper are organised as follows: The next section illustrates the related work and state of art of emotions recognition from speech. The next section, Emotions Recognition from Speech, demonstrates the proposed framework to model emotions recognition from speech signals. The details of the experiment's settings are also shown in this section. The results are comprehensively presented and discussed in section Results and Discussion. Finally, the last section demonstrates the conclusion and future research direction of this research.

A. Emotion Detection / Recognition
Emotions are one of the essential parts of social interactions. Emotions convey more than 80% meanings during the social interactions between interlocutors [2], [1]. Hence, detecting or recognising emotions is a paramount task to build a good and natural affective system. Emotion Detection / Recognition is a classification method that can bring up an important feature, namely the emotion contained in an input used for various uses [5]. The input used consists of various forms such as: speech [6], text [7] and visual [8], [9] cues. Most emotions detection/recognition tasks implement machine or deep learning (e.g. convolutional based, attention-based, recurrent based and transformer-based) to model the detector or recogniser. Analysing emotions can help in various fields, one of which is human and computer interaction which can later make computers better decisions for their users. some research regarding emotion detection has many variations,

B. Speech Emotion Recognition
Speech Emotion Recognition (SER) is a method for mapping the features of a speech into the emotions contained in the speech. SER is not a new field of study [13]. However, along with the development of technologies, several methodological developments can be applied to SER. Thus, making research in the SER field more varied and complex to achieve more optimal results. SER usually utilises a classification algorithm to map input in a speech to output in the form of emotion classification. In general, the pipeline for SER is data preprocessing, features extraction and model training + evaluation. The data pre-processing generally involves data augmentation as well as data framing and windowing. Features extraction techniques are implemented to the data after the data is being pre-processed. The features can be extracted in the form of Spectral features, Prosodic features and the combination of both Spectral and Prosodic features. Finally, the features are then trained and evaluated using machine (or deep) learning algorithms.

C. Convolutional Neural Network in Speech Emotion Recognition
Although CNN is well-designed for Image Recognition it could be extended to Natural Language Processing and Speech Processing [14][15] [16] Research regarding CNN for Speech Emotion Recognition conducted in 2016 [17] and 2018 [18] using RECOLA datasets [19]. by combining Convolutional Neural Network and Long Short-Term Memory which resulted in an outperformed model compared to traditional approaches on signal processing techniques. Then in 2017 [6] conducted research regarding Speech Emotion Recognition using Deep Convolutional Neural Network (DCNN) and Discriminant. Temporal Pyramid Matching (DTPM) to classify speaker's emotion resulting in a good model for automatic feature learning on speech emotion recognition tasks. The research concludes that DCNN is not only effective for image recognition but also in Speech Emotion Recognition. In 2020 [20] conducted research about CNN based Framework for Enhancing Audio Signal Processing for Speech Emotion Recognition proposing a framework that utilises a discriminative CNN using spectrogram which according to the author, the spectrogram has many features that texts or phonemes cannot represent.

III. EMOTIONS RECOGNITION FROM SPEECH
This research proposes the five best architectures by combining Temporal Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) to extract and classify emotions from speech signals. The dataset used in this research is The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [21]. The dataset contains more than 7000 audio files of emotional speech acted by twenty-four professional actors. This research only uses the emotional song dataset, where there are 920 audio data files and four basic emotions (i.e. Angry, Fear, Happy and Sad) plus Neutral. The RAVDESS dataset encodes the audio information (e.g. label and gender) in the filename. Hence, some text sub-string methods were applied to extract the label of the files. In this research, gender information is not used. Moreover, several pre-processing techniques were implemented to the dataset to enhance the quality of the dataset. First, the sample rate of the dataset was set to 16 KHz, and to normalise the speech time, the signal audio was padded to a maximum of 3 seconds. The dataset was split into two sets of data train and test with the ratio of 80%:20% (736:184). Moreover, the training data then were augmented to improve the quality of the data. This research proposed two settings of the data augmentation: seven times of the training data (5,152) and three times of the training data (2,208). Table I illustrates the emotions class distribution on each augmentation setting. The column Train Aug 1 denotes the augmentation with three times of the training data, while column Train Aug 2 refers to the augmentation with seven times of the training data. The dataset has an imbalanced dataset, where the Sad class is the majority class, and the Angry class is the minority class. Fig. 1 illustrates the example of the speech signal. The X-axis indicates the time is second (s), while the Y-axis indicates the amplitude of the signals in Decibel (dB). The left side of the image illustrates the original speech signal and the right side of the image demonstrates the augmented speech signal with noise. Feature extraction methods using Short-time Fourier Transform (STFT) and Mel Frequencies were applied to generate the Mel-Spectrogram representation on each audio file. The features were extracted using several parameters, such as: the hop length of 512, the window of 256. To normalise the features, all the vector then padded with zeros up to 2,048 to match the Fast Fourier Transform input. The next step was to generate Mel-Spectrogram from the Mel frequencies generated from Mel bins of 128 and the maximum frequency of 4.0 KHz. Finally, the features were framed with a window step of 128 and a window size of 64. Fig. 1 and Fig. 2 Table II. The Flatten layer aims to flatten all the extracted layers with temporal features. Moreover, the LSTM block consists of 128 units of LSTM layers. The LSTM block has one to two LSTM layers plus one bi-directional layer in the proposed architecture (see Table II. Finally, the

IV. RESULTS AND DISCUSSION
Five architectures with two settings of augmentation data were explored in this research. The architectures combine Temporal CNN and LSTM (or Bidirectional LSTM) to extract and classify the emotions. Fig. 3 illustrates the baseline of the proposed architectures and Table II demonstrates the proposed   TABLE II   The results have shown that the models trained with seven times training data augmentation perform better than the models trained with three times training data augmentation.
Overall, there are no significant differences in the training accuracy score of models trained with three times training data augmentation compared to the models trained with seven times training data augmentation. However, the seven times training data augmentation model provides higher validation accuracy and lower validation loss. Moreover, the models trained with three times training data augmentation suffer from over-fitting despite batch normalisation and dropout were applied to the CNN and LSTM architectures.  Fig. 4 illustrates the confusion matrix for each classes in the best model (i.e. MODEL-5).
The result shows that Neutral (Calm) and Happy emotions are the easiest emotions to classify from the given speech dataset. Moreover, the Angry emotion is the hardest emotion to classify compared to the other classes. The Angry emotion is also mostly miss-classified as the false positive in the other classes. Most likely, it is due to the number of the Angry class in both the training and testing dataset. Finally, the Adam and SGD optimiser do not provide a significant difference to the training accuracy, validation accuracy, training loss and validation loss.

V. CONCLUSION AND FUTURE WORK
Five settings of architectures with the combination of CNN and LSTM (or Bidirectional LSTM), number of dropouts and the data augmentation settings were explored in this research. The architectures were implemented to train the emotions recognition models using The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. The dataset was pre-processed and augmented with two data augmentation settings (i.e. three times and seven times of the original data). The results show that the best model was achieved by MODEL-2, which provides 94.25%, 57.07%, 0.2577 and 1.1678 for training accuracy, validation accuracy, training loss and validation loss, respectively. Moreover, Neutral (Calm) and Happy emotions are the easiest emotions to classify from the given speech dataset, while the Angry emotion is the hardest emotion to classify compared to the other classes. This is due to the number of data in the Angry class in both the training and testing dataset.
For future direction research, more combinations of the architectures, such as the attention architectures and Transformer based architectures, will be explored to increase the recogniser model performances. Moreover, the multi-modal features can also be explored to increase the accuracy and tackle the overfitting problem. Furthermore, the features from videos (e.g. facial expressions and body gestures), speech and text, can be explored to build a better model for emotion recognition. Finally, the emotions recogniser model that has been trained can be implemented to the more complex affective system such as virtual humans, where recognising emotions can be one of the tools to extract non-verbal meanings from the human interlocutors.