Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment

Zhanbin Che; Huaili Guo

doi:10.14569/IJACSA.2024.0150232

DOI: 10.14569/IJACSA.2024.0150232

PDF

Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment

Author 1: Zhanbin Che

Author 2: Huaili Guo

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 2, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Cross-modal video retrieval remains a major challenge in natural language processing due to the natural semantic divide between video and text. Most approaches use a single encoder to extract video and text features separately, and train video-text pairs by means of contrastive learning, but this global alignment of video and text is prone to neglecting more fine-grained features of both. In addition, some studies focus only on profiling the video description text, ignoring the correlation relationship with the video. Therefore, this paper proposes a video retrieval method based on video-text alignment, which realizes both global and fine-grained alignment between video and text. For global alignment, the video and text are aligned by a single encoder and after linear projection; for fine-grained alignment, the video encoder is trained to align the video and text by masking some semantic information in the text. By experimentally comparing with multiple existing methods on MSR-VTT and MSVD datasets, the model achieves R@1 (recall at 1) metrics of 51.5% and 52.4% on MSR-VTT and MSVD datasets, respectively, which indicates that the proposed model can improve the efficiency of cross-modal video retrieval.

Keywords: Video-text alignment; cross-modal; contrastive learning; similarity measure; feature fusion

Zhanbin Che and Huaili Guo, “Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment” International Journal of Advanced Computer Science and Applications(IJACSA), 15(2), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0150232

@article{Che2024,
title = {Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.0150232},
url = {http://dx.doi.org/10.14569/IJACSA.2024.0150232},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {2},
author = {Zhanbin Che and Huaili Guo}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment

Upcoming Conferences