Computer Vision Conference (CVC) 2026
21-22 May 2026
Publication Links
IJACSA
Special Issues
Computer Vision Conference (CVC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 16 Issue 11, 2025.
Abstract: Speech Emotion Recognition (SER) has become a pivotal topic within affective computing and human–computer interaction, where the core challenge lies in jointly capturing both the time–frequency structure and the semantic context of speech. To overcome the shortcomings of current approaches—including single-view feature representation, the lack of emotional discriminability in self-supervised models, and suboptimal complementarity among fusion strategies—this study proposes a parallel dual-branch fusion architecture for SER. The framework consists of a wav2vec 2.0 branch and a CNN–Transformer spectrogram branch, which respectively extract contextual semantic representations from raw waveforms and explicit time–frequency features from spectrograms. A logistic regression fusion mechanism is further introduced at the decision level to achieve adaptive weighting in the probability space, thereby fully leveraging the complementary strengths of the two feature types. Experiments carried out on the RAVDESS audio subset showed that the proposed model surpassed several mainstream baselines (e.g., CNN-n-GRU and RELUEM), achieving 92.7% accuracy and 92.2% Macro-F1, with an average improvement of about 3.2 percentage points. The layer unfreezing studies confirmed the effectiveness of partial fine-tuning for transferring pretrained features, while the comparative experiments on fusion strategies validated the superiority of probability-space fusion in both performance and stability. Overall, the proposed framework achieves simultaneous gains in accuracy and robustness through feature complementarity, branch decoupling, and lightweight fusion. Future work will explore cross-lingual generalization, multimodal extensions, lightweight deployment, and dynamic emotion modeling, contributing to more efficient affective computing and intelligent interaction systems.
Zhongliang Wei, Chang Ge, Lijun Zhu and Jinmin Ye. “Speech Emotion Recognition via Parallel Dual-Branch Fusion Model”. International Journal of Advanced Computer Science and Applications (IJACSA) 16.11 (2025). http://dx.doi.org/10.14569/IJACSA.2025.0161115
@article{Wei2025,
title = {Speech Emotion Recognition via Parallel Dual-Branch Fusion Model},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2025.0161115},
url = {http://dx.doi.org/10.14569/IJACSA.2025.0161115},
year = {2025},
publisher = {The Science and Information Organization},
volume = {16},
number = {11},
author = {Zhongliang Wei and Chang Ge and Lijun Zhu and Jinmin Ye}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.