Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening

Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these metho...

Full description

Saved in:
Bibliographic Details
Main Authors: Yingjia Xu, Mengxia Wu, Zixin Guo, Min Cao, Mang Ye, Jorma Laaksonen
Format: Article
Language:English
Published: Springer 2025-03-01
Series:Visual Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44267-025-00073-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849762206622679040
author Yingjia Xu
Mengxia Wu
Zixin Guo
Min Cao
Mang Ye
Jorma Laaksonen
author_facet Yingjia Xu
Mengxia Wu
Zixin Guo
Min Cao
Mang Ye
Jorma Laaksonen
author_sort Yingjia Xu
collection DOAJ
description Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.
format Article
id doaj-art-7df0e7f4e6924701b623e17654f88029
institution DOAJ
issn 2097-3330
2731-9008
language English
publishDate 2025-03-01
publisher Springer
record_format Article
series Visual Intelligence
spelling doaj-art-7df0e7f4e6924701b623e17654f880292025-08-20T03:05:49ZengSpringerVisual Intelligence2097-33302731-90082025-03-013111310.1007/s44267-025-00073-2Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screeningYingjia Xu0Mengxia Wu1Zixin Guo2Min Cao3Mang Ye4Jorma Laaksonen5School of Computer Science & Technology, Soochow UniversitySchool of Computer Science & Technology, Soochow UniversityDepartment of Computer Science, Aalto UniversitySchool of Computer Science & Technology, Soochow UniversitySchool of Computer Science, Wuhan UniversityDepartment of Computer Science, Aalto UniversityAbstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.https://doi.org/10.1007/s44267-025-00073-2Text-to-video retrieval (TVR)Inverted indexPre-screeningContrastive learning (CL)
spellingShingle Yingjia Xu
Mengxia Wu
Zixin Guo
Min Cao
Mang Ye
Jorma Laaksonen
Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
Visual Intelligence
Text-to-video retrieval (TVR)
Inverted index
Pre-screening
Contrastive learning (CL)
title Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
title_full Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
title_fullStr Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
title_full_unstemmed Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
title_short Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
title_sort efficient text to video retrieval via multi modal multi tagger derived pre screening
topic Text-to-video retrieval (TVR)
Inverted index
Pre-screening
Contrastive learning (CL)
url https://doi.org/10.1007/s44267-025-00073-2
work_keys_str_mv AT yingjiaxu efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening
AT mengxiawu efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening
AT zixinguo efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening
AT mincao efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening
AT mangye efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening
AT jormalaaksonen efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening