Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening

Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these metho...

Full description

Saved in:
Bibliographic Details
Main Authors: Yingjia Xu, Mengxia Wu, Zixin Guo, Min Cao, Mang Ye, Jorma Laaksonen
Format: Article
Language:English
Published: Springer 2025-03-01
Series:Visual Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44267-025-00073-2
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.
ISSN:2097-3330
2731-9008