Computer Vision Conference (CVC) 2026
21-22 May 2026
Publication Links
IJACSA
Special Issues
Computer Vision Conference (CVC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 17 Issue 5, 2026.
Abstract: Advancements in Artificial Intelligence (AI) technology have enabled the recognition of human emotions. Along with the development of deep learning and multimodal processing methods, emotion analysis can now be performed by utilizing multiple data sources simultaneously, such as facial expressions and speech signals. However, existing emotion recognition systems still face limitations in terms of accuracy. This study aims to develop and evaluate a more accurate emotion recognition system by implementing a Convolutional Neural Network (CNN)-based prediction model that integrates facial and audio data simultaneously. The study utilizes the CREMA-D dataset, which consists of visual data in the form of facial images and audio data containing variations of emotional expressions. The research process includes data preprocessing, feature extraction, and multimodal integration using an optimized Convolutional Neural Network (CNN) architecture. The evaluation results based on the F1-score indicate that the multimodal facial and audio data enable the model to recognize emotions effectively. Model performance was measured using accuracy, precision, recall, and F1-score metrics. Experimental results show that the angry (ANG) class achieved the best performance with an F1-score of 82%, while the fear (FEA) class demonstrated the lowest performance with an F1-score of only 58%. The results further indicate that the multimodal model achieved higher accuracy than unimodal models, significantly improving generalization capability on diverse testing data. This study demonstrates an overall emotion recognition accuracy improvement of 69% through the combination of facial and audio features. The analysis of combined facial and speech features on emotion classification performance shows that the proposed model achieves good overall performance, where the integration of image and audio modalities improves the correctness of facial expression predictions. Future research is expected to further improve accuracy by incorporating additional modalities beyond facial and audio data.
Karnadi , Ermatita and Abdiansah. “Improving Emotion Recognition Accuracy Using a Multimodal Model (Face and Voice Video) Based on a Convolutional Neural Network (CNN)”. International Journal of Advanced Computer Science and Applications (IJACSA) 17.5 (2026). http://dx.doi.org/10.14569/IJACSA.2026.0170518
@article{2026,
title = {Improving Emotion Recognition Accuracy Using a Multimodal Model (Face and Voice Video) Based on a Convolutional Neural Network (CNN)},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2026.0170518},
url = {http://dx.doi.org/10.14569/IJACSA.2026.0170518},
year = {2026},
publisher = {The Science and Information Organization},
volume = {17},
number = {5},
author = {Karnadi and Ermatita and Abdiansah}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.