Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Esraa A. Mohamed; Abdelrahim Koura; Mohammed Kayed

doi:10.14569/IJACSA.2024.0150359

DOI: 10.14569/IJACSA.2024.0150359

PDF

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Author 1: Esraa A. Mohamed

Author 2: Abdelrahim Koura

Author 3: Mohammed Kayed

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 3, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Speech Emotion Recognition (SER) is a fast-developing area of study with a primary goal of automatically identifying and analyzing the emotional states expressed in speech. Emotions are crucial in human communication as they impact the effectiveness and meaning of linguistic expressions. SER aims to create computational approaches and models to detect and interpret emotions from speech signals. One of the primary applications of SER is evident in the field of Human-Computer Interaction (HCI), where it can be used to develop interactive systems that adapt to the user's emotional state based on their voice. This paper investigates the use of speech data for speech emotion recognition. Additionally, we applied a transformation process to convert the speech data into 2D images. Subsequently, we compared the outcomes of this transformation with the original speech data, aligning the comparison with a dataset containing labeled speech samples in both Arabic and English. Our experiments compare three methods: a transformer-based model, a Vision Transformer (ViT) based model, and a wave2vec-based model. The transformer model is trained from scratch on two significant audio datasets: the Arabic Natural Audio Dataset (ANAD) and the Toronto Emotional Speech Set (TESS), while the vision transformer is evaluated alongside wave2vec as part of transfer learning. The results are impressive. The transformer model achieved remarkable accuracies of 94% and 99% on ANAD and TESS datasets, respectively. Additionally, ViT demonstrates strong capabilities, achieving accuracies of 88% and 98% on the ANAD and TESS datasets, respectively. To assess the transfer learning potential, we also explore the Wave2Vector model with fine-tuning. However, the findings suggest limited success, achieving only a 56% accuracy rate on the ANAD dataset.

Keywords: Speech emotion recognition; transformer encoder; fine-tuning; wav2vec; multimodal emotion recognition

Esraa A. Mohamed, Abdelrahim Koura and Mohammed Kayed, “Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets” International Journal of Advanced Computer Science and Applications(IJACSA), 15(3), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0150359

@article{Mohamed2024,
title = {Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.0150359},
url = {http://dx.doi.org/10.14569/IJACSA.2024.0150359},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {3},
author = {Esraa A. Mohamed and Abdelrahim Koura and Mohammed Kayed}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Upcoming Conferences