The Science and Information (SAI) Organization
  • Home
  • About Us
  • Journals
  • Conferences
  • Contact Us

Publication Links

  • IJACSA
  • Author Guidelines
  • Publication Policies
  • Digital Archiving Policy
  • Promote your Publication
  • Metadata Harvesting (OAI2)

IJACSA

  • About the Journal
  • Call for Papers
  • Editorial Board
  • Author Guidelines
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Fees/ APC
  • Reviewers
  • Apply as a Reviewer

IJARAI

  • About the Journal
  • Archives
  • Indexing & Archiving

Special Issues

  • Home
  • Archives
  • Proposals
  • Guest Editors
  • SUSAI-EE 2025
  • ICONS-BA 2025
  • IoT-BLOCK 2025

Future of Information and Communication Conference (FICC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Computing Conference

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Intelligent Systems Conference (IntelliSys)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Future Technologies Conference (FTC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact
  • Home
  • Call for Papers
  • Editorial Board
  • Guidelines
  • Submit
  • Current Issue
  • Archives
  • Indexing
  • Fees
  • Reviewers
  • Subscribe

DOI: 10.14569/IJACSA.2024.0150359
PDF

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Author 1: Esraa A. Mohamed
Author 2: Abdelrahim Koura
Author 3: Mohammed Kayed

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 3, 2024.

  • Abstract and Keywords
  • How to Cite this Article
  • {} BibTeX Source

Abstract: Speech Emotion Recognition (SER) is a fast-developing area of study with a primary goal of automatically identifying and analyzing the emotional states expressed in speech. Emotions are crucial in human communication as they impact the effectiveness and meaning of linguistic expressions. SER aims to create computational approaches and models to detect and interpret emotions from speech signals. One of the primary applications of SER is evident in the field of Human-Computer Interaction (HCI), where it can be used to develop interactive systems that adapt to the user's emotional state based on their voice. This paper investigates the use of speech data for speech emotion recognition. Additionally, we applied a transformation process to convert the speech data into 2D images. Subsequently, we compared the outcomes of this transformation with the original speech data, aligning the comparison with a dataset containing labeled speech samples in both Arabic and English. Our experiments compare three methods: a transformer-based model, a Vision Transformer (ViT) based model, and a wave2vec-based model. The transformer model is trained from scratch on two significant audio datasets: the Arabic Natural Audio Dataset (ANAD) and the Toronto Emotional Speech Set (TESS), while the vision transformer is evaluated alongside wave2vec as part of transfer learning. The results are impressive. The transformer model achieved remarkable accuracies of 94% and 99% on ANAD and TESS datasets, respectively. Additionally, ViT demonstrates strong capabilities, achieving accuracies of 88% and 98% on the ANAD and TESS datasets, respectively. To assess the transfer learning potential, we also explore the Wave2Vector model with fine-tuning. However, the findings suggest limited success, achieving only a 56% accuracy rate on the ANAD dataset.

Keywords: Speech emotion recognition; transformer encoder; fine-tuning; wav2vec; multimodal emotion recognition

Esraa A. Mohamed, Abdelrahim Koura and Mohammed Kayed, “Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets” International Journal of Advanced Computer Science and Applications(IJACSA), 15(3), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0150359

@article{Mohamed2024,
title = {Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.0150359},
url = {http://dx.doi.org/10.14569/IJACSA.2024.0150359},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {3},
author = {Esraa A. Mohamed and Abdelrahim Koura and Mohammed Kayed}
}



Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

IJACSA

Upcoming Conferences

Computer Vision Conference (CVC) 2026

16-17 April 2026

  • Berlin, Germany

Healthcare Conference 2026

21-22 May 2026

  • Amsterdam, The Netherlands

Computing Conference 2025

19-20 June 2025

  • London, United Kingdom

IntelliSys 2025

28-29 August 2025

  • Amsterdam, The Netherlands

Future Technologies Conference (FTC) 2025

6-7 November 2025

  • Munich, Germany
The Science and Information (SAI) Organization
BACK TO TOP

Computer Science Journal

  • About the Journal
  • Call for Papers
  • Submit Paper
  • Indexing

Our Conferences

  • Computing Conference
  • Intelligent Systems Conference
  • Future Technologies Conference
  • Communication Conference

Help & Support

  • Contact Us
  • About Us
  • Terms and Conditions
  • Privacy Policy

© The Science and Information (SAI) Organization Limited. All rights reserved. Registered in England and Wales. Company Number 8933205. thesai.org