Computer Vision Conference (CVC) 2026
21-22 May 2026
Publication Links
IJACSA
Special Issues
Computer Vision Conference (CVC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 16 Issue 8, 2025.
Abstract: Multimodal Sentiment Analysis (MSA) has emerged as a critical task in Natural Language Processing (NLP), driven by the growth of user-generated content containing textual, visual, and auditory cues. While transformer-based approaches achieve strong predictive performance, their lack of interpretability and limited adaptability restrict their use in sensitive applications such as healthcare, education, and human–computer interaction. To address these challenges, this study proposes an explainable and adaptive MSA framework based on a hierarchical attention-based transformer architecture. The model leverages RoBERTa for text, Wav2Vec2.0 for speech, and Vision Transformer (ViT) for visual cues, with features fused using a three-tier attention mechanism encompassing token/frame-level, modality-level, and semantic-level attention. This design enables fine-grained representation learning, dynamic cross-modal alignment, and intrinsic explainability through attention heatmaps. Additionally, contrastive alignment loss is incorporated to align heterogeneous modality embeddings, while label smoothing mitigates overconfidence, improving generalizability. Experimental evaluation on the CMU-MOSEI benchmark demonstrates state-of-the-art performance, achieving 93.2% accuracy, 93.5% precision, 92.8% recall, and 94.1% F1-score, surpassing prior multimodal transformer-based methods. Unlike earlier models that rely on shallow fusion or post-hoc interpretability, the proposed approach integrates explainability into its architecture, balancing accuracy and transparency. These results confirm the efficacy of the adaptive hierarchical attention-based framework in delivering a robust, interpretable, and scalable solution for English-language multimodal sentiment analysis.
Anna Shalini, B. Manikyala Rao, Ranjitha. P. K, Guru Basava Aradhya S, S. Farhad, Elangovan Muniyandy and Yousef A. Baker El-Ebiary. “Explainable Multimodal Sentiment Analysis Using Hierarchical Attention-Based Adaptive Transformer Models”. International Journal of Advanced Computer Science and Applications (IJACSA) 16.8 (2025). http://dx.doi.org/10.14569/IJACSA.2025.0160869
@article{Shalini2025,
title = {Explainable Multimodal Sentiment Analysis Using Hierarchical Attention-Based Adaptive Transformer Models},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2025.0160869},
url = {http://dx.doi.org/10.14569/IJACSA.2025.0160869},
year = {2025},
publisher = {The Science and Information Organization},
volume = {16},
number = {8},
author = {Anna Shalini and B. Manikyala Rao and Ranjitha. P. K and Guru Basava Aradhya S and S. Farhad and Elangovan Muniyandy and Yousef A. Baker El-Ebiary}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.