Introducing the Urdu-Sindhi Speech Emotion Corpus: A Novel Dataset of Speech Recordings for Emotion Recognition for Two Low-Resource Languages

Speech emotion recognition is one of the most active areas of research in the field of affective computing and social signal processing. However, most research is directed towards a select group of languages such as English, German, and French. This is mainly due to a lack of available datasets in other languages. Such languages are called low-resource languages given that there is a scarcity of publicly available datasets. In the recent past, there has been a concerted effort within the research community to create and introduce datasets for emotion recognition for low-resource languages. To this end, we introduce in this paper the Urdu-Sindhi Speech Emotion Corpus, a novel dataset consisting of 1,435 speech recordings for two widely spoken languages of South Asia, that is Urdu and Sindhi. Furthermore, we also trained machine learning models to establish a baseline for classification performance, with accuracy being measured in terms of unweighted average recall (UAR). We report that the best performing model for Urdu language achieves a UAR = 65.00% on the validation partition and a UAR = 56.96% on the test partition. Meanwhile, the model for Sindhi language achieved UARs of 66.50% and 55.29% on the validation and test partitions, respectively. This classification performance is considerably better than the chance level UAR of 16.67%. The dataset can be accessed via https://zenodo.org/record/3685274 Keywords—Speech emotion recognition; affective computing; social signal processing


I. INTRODUCTION
According to the Oxford dictionary 1 , the word emotion is defined as a strong feeling such as love, fear, or anger; the part of a person's character that consists of feelings. However, in research literature from the field of psychology, one finds that there is no consensus on a definition of emotion. According to [1] an emotion is any mental experience with high intensity and high hedonic content (pleasure/displeasure). Meanwhile, [2] defines emotion as a complex psychological event that involves a mixture of reactions: 1) a physiological response, 2) an expressive reaction (distinctive facial expression, body posture, or vocalization), and 3) some kind of subjective experience (internal thoughts and feelings).
Expression of feelings and by extension emotions is a fundamental part of human behavior. Emotions play an important role in how one thinks and behaves which means that 1 https://www.oxfordlearnersdictionaries.com/definition/english/emotion analysis of emotions exhibited by individuals can be used to gain insights into their thought process.
In the age of artificial intelligence, there has been a growing desire amongst the research community to enable interaction between machines (say, robots) and human beings on a more natural level. This is possible when machines can understand, interpret, and recognize human emotion. To achieve this, researchers from the field of affective computing and social signal processing have explored the development of computational methods for emotion recognition from various modalities such as speech [3], [4], facial expressions [5], [6], text [7], [8], and physiological signals [9], [10].
Amongst these modalities, speech is particularly interesting since it is the most natural way for human beings to exhibit emotions [3]. In addition to providing social intelligence to machines, speech emotion recognition can be used to assist emergency services and healthcare professionals. For example, an emotion recognition system linked with emergency services call centers can be useful to gauge the intensity of distress of the caller and subsequently assign their call to a higher priority.
While a great deal of research literature is available on emotion recognition, an overwhelming majority of it caters to western European languages such as English, German, and French -this is mainly because most datasets available are in these languages. Based on our literature survey, we find that there is a particular scarcity of datasets from the South Asian family of languages, even though the region is home to more than 1.891 billion people 2 .
We note that recently there have been efforts by several researchers to design and create datasets for speech emotion recognition for South Asian languages. Koolagudi et al. [11] had published a large dataset for speech emotion recognition for Telugu language, a language predominantly spoken in Southern India. The dataset consists of 12,000 utterances in total for eight types of emotions including anger, disgust, fear, happiness, neutral, sadness, sarcasm, and surprise. In [12], Syed et al. introduced the Emotion-Pak Corpus, which included four emotions which include sadness, comfort, anger, and happiness in five languages spoken in Pakistan. These languages include Urdu, Sindhi, Balochi, Punjabi, and Pashto. The dataset was recorded using ten native speakers for the five languages. While this dataset is most relevant to our work, we could not get a reply from Syed et al. after requesting access to the Emotion-Pak Corpus. Finally, Latif et al. [13] introduced an emotion corpus for Urdu language. The dataset consists of 400 audio recordings for four emotions that were collected from television programs. The dataset is available for academic research on speech emotion recognition 3 .
In this paper, we introduce a novel speech emotion dataset consisting of 1,435 audio recordings which can be used to train machine learning models for speech-based emotion recognition in two South Asian languages, namely Urdu and Sindhi. Urdu 4 is the national language as well as the lingua franca of Pakistan and is also widely spoken in India. There are upwards of 68.62 million native speakers of Urdu and more than 101.58 million individuals speak Urdu as a secondary language. Meanwhile, Sindhi 5 has more than 25 million native speakers in South Asia, mostly centered in the Sindh province of Pakistan. It is one of the three official languages of the Sindh province in addition to being one of the recognized languages of India.
The rest of the paper is organized as follows: In section II we introduce the methodology for collection of Urdu-Sindhi Speech Emotion Corpus whereas in section III we detail the methodology for establishing the baseline classification performance for the dataset. Experimental results and discussion is provided in section IV, and conclusion in provided in section V.

II. DATASET COLLECTION
In this section we shall introduce the data collection methodology for the Urdu-Sindhi Speech Emotion Corpus with the aid of Fig. 1 which illustrates data collection framework. We prepared 10 sentence scripts each for seven types of emotional utterances in Urdu and Sindhi languages. These emotions include anger, disgust, happiness, neutral, sarcasm, sadness, and surprise. The scripts were validated by the authors of this paper as well as two post-graduate students before being passed down to volunteer participants.
Participations for this study were recruited from amongst undergraduate students currently studying in the Department of Telecommunication Engineering at Mehran University, Pakistan. These participants were instructed to recording themselves uttering the scripts with the predefined emotions and send audio recordings to the authors via WhatsApp 6 . We specifically chose to utilize a WhatsApp based data collection instead of a bespoke recording studio/room since the former enables us to recruit a larger number of participants, including those who may not be able to come to the recording studio.
The audio recordings sent by participants were collected via Twilio 7 , an API that provides connectivity with WhatsApp and a desktop computer. Through this process, we were able to collect 734 speech recordings for Urdu language and 701 recordings for Sindhi language. A summary of the number of recordings for each emotion is provided in Table I Each of these recordings was manually checked to ensure that their content was as desired for this study. Readers who are interested in the dataset can access it via https://zenodo.org/ record/3685274.

III. METHODOLOGY FOR BASELINE CLASSIFICATION PERFORMANCE
It is common practice in the field of affecting computing and social signal processing to provide a baseline classification performance for every novel dataset when it is introduced for academic research. This helps the larger research community getting familiarized with the dataset. Therefore, we shall provide a baseline classification performance for the Urdu-Sindhi Speech Emotion Corpus as well. Our motivation is to use opensource and freely available tools (at least for non-commercial research) so that the baseline classification performance can be reproduced with relative ease.
A generic process flow diagram for speech emotion classification is illustrated in fig. 2. The first step is to compute audio features which can represent acoustic characteristics of speech which are relevant for the task at hand. For this purpose, we use five types of feature sets from the OpenSmile toolkit [14], [15] which include the Prosody feature set, the IS09-Emotion feature set, the IS10-Paralinguistics feature set, the ComParE feature set, and the eGeMAPS feature set. As the reader shall see, these feature sets have proven to be useful for quantifying paralinguistic characteristics of speech such as prosody, voice quality, speech spectra etc. In subsequent paragraphs, we shall briefly describe these feature sets.
Prosody feature set: The Prosody feature set produces a 35-dimensional vector based on functionals of four types of acoustic low-level descriptors. These include two prosody features, which include pitch and loudness, and two types of voice quality features, that is harmonic to noise ratio (HER) and the probability with which a speech segment contains voice speech (voicing probability). We refer the reader to [15], [14] for further details about the prosody feature set.
IS09-Emotion feature set: The OpenSmile IS09-Emotion feature set produces a 384-dimensional vector based on functionals of four types of features with one each to describe the prosodic, voice quality, spectral, and temporal characteristics of speech. Similar to the Prosody feature set discussed earlier, the IS09-Emotion feature set uses pitch and voicing probability as prosody and voice quality features, respectively. In addition to these, Mel Frequency Cepstral Coefficients (MFCC) features are used to describe the spectral characteristics of voice, whereas the zero crossing rate of the voice signal is used to describe its temporal characteristics. The IS09-Emotion feature set was introduced for the year 2009 edition of the Interspeech Computational Paralinguistics Challenge [16] and the feature set was shown to be useful for the task of emotion recognition from speech. We refer the reader to [15], [16], [14] for further details about this feature set IS10-Paralinguistics feature set: The IS10-Paralinguistic feature set produces a 1,582-dimensional vector based on functionals for eight types of features which describe the prosodic, voice quality, and spectral characteristics of speech. Prosody is characterized using pitch and loudness features, whereas voice quality is characterized using voicing probability, jitter, and shimmer features. Spectral characteristics of voice are described using MFCCs, spectral bands filtered by log-Mel filters, and the line spectral pairs of frequencies features which represent linear prediction coefficients. The IS10-Paralinguistic feature set was introduced for the year 2010 edition of the Interspeech Computational Paralinguistics Challenge [17] and these features were shown to be useful for a variety of classification tasks related to speech paralinguistics.
ComParE feature set: The Computational Paralinguistics Challenge (ComParE) is a 6,373-dimensional feature set which was introduced for the year 2016 edition of the Interspeech Computational Paralinguistics Challenges [18]. The ComParE feature set is often referred to as a brute-force feature set since it includes features which describe a wide range of acoustic characteristics. It has been shown to work well for a variety of tasks related to speech paralinguistics and has been used to establish strong baselines for classification and regression tasks for Interspeech Computational Paralinguistics Challenges [18], [19], [20], [21]. We refer the reader to [18], [14] for further details about this feature set. eGeMAPS feature set: The Extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) feature set was designed by some of the leading researchers in the field of social signal processing in order to facilitate a common framework for re-search into speech paralinguistics. It was also intended to serve a more efficient and lower dimensional feature set than the ComParE feature set. The eGeMAPS feature set produces an 88-dimensional vector based on functionals for various types of prosody, voice quality, and spectral features. Similar to IS10-Paralinguistics feature set, prosody is characterized through pitch and loudness features, and voice quality is characterized by voicing probability, jitter, and shimmer features. In addition to these, the eGeMAPS also uses harmonic difference features to describe voice quality. These include H1-H2 and H1-H3, which quantify differences in the amplitude of second and third harmonics with respect to the amplitude of the first harmonic. The eGeMAPS feature set uses eight types of features to describe the spectral characteristics of speech. Spectral features used in eGeMAPS include alpha ratio, the Hammarberg index, spectral slopes, spectral flux, formant frequencies, relative energies for each formant frequency with respect to the first formant, and the bandwidth for the first formant frequency. We refer the reader to [22], [14] for further details about the eGeMAPS feature set.
Once audio features have been computed for all audio recordings in the dataset, a classifier can be trained for emotion recognition. We choose the logistic regression classifier for this purpose although any other classification algorithm could have also been used. We make use of cross-validation in order to assess the predictive performance of these machine learning models. Cross-validation makes it possible to infer the performance of machine learning models outside of the samples which were used to train those models.

IV. EXPERIMENTATION, RESULTS AND DISCUSSION
We use the implementation of logistic regression classifier which is available in the scikit-learn toolkit 8 . The complexity value of the logistic regression algorithm is optimized over a logarithmically spaced grid between 10 −7 to 10 7 . The classifier is trained with an l2-penalty for up to 10,000 iterations.
Audio features are computed as per the discussion in the previous section. The dataset is divided into three partitions, that is training, validation, and test with a 60:20:20 ratio. The classifier is trained using the training partition, its hyperparameter is optimized using the validation partition, and the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 classification results being compared against the test partition. For the sake of completeness, we report the results for both validation and test partitions.

A. Classification Performance for Urdu Language
In table II, results for the classification performance of five audio feature sets is summarized for Urdu language. Here, one can note that for the validation partition, the ComParE feature set provides the highest UAR i.e. 65.49%, which is a considerably strong performance given that chance level UAR is only 14.28%. Amongst other features, one finds that the IS10-Paralinguistics feature set provides the secondbest performance, achieving a UAR of 59.46%. Interestingly, the IS09-Emotion and eGeMAPS feature sets which were explicitly designed for tasks related to emotion recognition do not yield good classification results as compared to ComParE or IS10-Paralinguistics feature sets. On the test partition, the ComParE feature set achieves a UAR = 56.96% whereas the IS10-Paralinguistics achieves a UAR = 59.40%. In fig. 3, the confusion matrix of the best performing model (based on ComParE features) for speech emotion recognition in Urdu language has been shown. Here, one can note that the class with the most accurate prediction of its labels is Surprise, which is followed by Sadness and Neutral. Meanwhile, it is apparent that the classifier had most difficulty in classifying Disgust emotion, often mistaking it for Happiness and Sadness emotions.

B. Classification Performance for Sindhi Language
In table III, the results for classification performance of speech emotion recognition for Sindhi language is summarized. Here, one can note that the ComParE feature set again provides the best classification performance on the validation partition. It achieves a UAR = 66.54%, which is comparable to the UAR achieved by the same feature set for Urdu language. Similarly, we find that the IS10-Paralinguistics feature set achieves the second-best performance with a UAR = 62.17%. On the test partition, these features achieve a UAR = 55.29% and UAR = 46.82%, respectively.
The confusion matrix for the best preforming model (based on ComParE features) for Sindhi language is shown in fig. 4. Here, one can note that the classifier performs best for Happiness. It performs worst for the Neutral class, often mistaking it for emotions of Anger, Sadness, and Sarcasm.
Overall, we report that the ComParE feature set is suitable for emotion recognition in the two South Asian languages considered, that is Urdu and Sindhi. We hypothesize that this

C. Cross-language Classification Performance
Finally, we seek to quantify how well machine learning models perform when they are optimized for speech emotion recognition in one language, say Urdu, and are tested for the other language, say Sindhi, and vice versa. One would assume that given the two languages are widely spoken in the same region, emotional intonation between the two languages may be similar and as a result, some degree of transferability between models may exist.
To this end, we summarize in table IV the results of crosslanguage classification performance of the top-two performing feature sets, that is IS10-Paralinguistics and the ComParE feature set. Contrary to our surmisal, one finds that there is little transferability of information between the two languages. When the logistic regression model is trained on Urdu language, the highest UAR it achieves on the test partition of the Sindhi language is 19.15% which is rather poor. Similarly, a model trained on Sindhi language only achieves a maximum UAR of 17.69% on the test partition of Urdu language.
We believe that the results in table IV are particularly interesting because they show that the transferability of machine learning models for emotion recognition does not always hold even when the two languages belong to the same language group and are spoken in the same region. However, one can argue that the more powerful machine learning models, such as those based on deep learning [23] are likely to perform better than logistic regression.

V. CONCLUSION
In this paper, we introduced a novel dataset, called the Urdu-Sindhi Speech Emotion Corpus, which can be used to train machine learning models for speech emotion recognition for two low-resource languages. We have made the dataset available for academic research on the Zenodo platform. Furthermore, we also conducted experiments to establish baseline classification performance in terms of UAR using feature sets from the OpenSmile toolkit -a toolkit used by researchers in the field to set empirical baselines for classification performance. Based on our experiments, we reported that logistic regression models trained on the ComParE feature set are the best performing in terms of classification performance for speech emotion recognition for both Urdu and Sindhi languages.