Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction

Xintao Lu; Yonglong Ni; Zuohua Ding

doi:10.14569/IJACSA.2024.0150290

DOI: 10.14569/IJACSA.2024.0150290

PDF

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction

Author 1: Xintao Lu

Author 2: Yonglong Ni

Author 3: Zuohua Ding

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 2, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Multimodal sentiment analysis is a traditional text-based sentiment analysis technique. However, the field of multi-modal sentiment analysis still faces challenges such as inconsistent cross-modal feature information, poor interaction capabilities, and insufficient feature fusion. To address these issues, this paper proposes a cross-modal sentiment model based on CLIP image-text attention interaction. The model utilizes pre-trained ResNet50 and RoBERTa to extract primary image-text features. After contrastive learning with the CLIP model, it employs a multi-head attention mechanism for cross-modal feature interaction to enhance information exchange between different modalities. Subsequently, a cross-modal gating module is used to fuse feature networks, combining features at different levels while controlling feature weights. The final output is fed into a fully connected layer for sentiment recognition. Comparative experiments are conducted on the publicly available datasets MSVA-Single and MSVA-Multiple. The experimental results demonstrate that our model achieved accuracy rates of 75.38%and 73.95% , and F1-scores of 75.21% and 73.83% on the mentioned datasets, respectively. This indicates that the proposed approach exhibits higher generalization and robustness compared to existing sentiment analysis models.

Keywords: Multi-modal; image-text interaction; multi-head attention mechanism; sentiment analysis; cross-modal fusion

Xintao Lu, Yonglong Ni and Zuohua Ding, “Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction” International Journal of Advanced Computer Science and Applications(IJACSA), 15(2), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0150290

@article{Lu2024,
title = {Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.0150290},
url = {http://dx.doi.org/10.14569/IJACSA.2024.0150290},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {2},
author = {Xintao Lu and Yonglong Ni and Zuohua Ding}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction

Upcoming Conferences