Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these metho...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-03-01
|
| Series: | Visual Intelligence |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44267-025-00073-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849762206622679040 |
|---|---|
| author | Yingjia Xu Mengxia Wu Zixin Guo Min Cao Mang Ye Jorma Laaksonen |
| author_facet | Yingjia Xu Mengxia Wu Zixin Guo Min Cao Mang Ye Jorma Laaksonen |
| author_sort | Yingjia Xu |
| collection | DOAJ |
| description | Abstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets. |
| format | Article |
| id | doaj-art-7df0e7f4e6924701b623e17654f88029 |
| institution | DOAJ |
| issn | 2097-3330 2731-9008 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Springer |
| record_format | Article |
| series | Visual Intelligence |
| spelling | doaj-art-7df0e7f4e6924701b623e17654f880292025-08-20T03:05:49ZengSpringerVisual Intelligence2097-33302731-90082025-03-013111310.1007/s44267-025-00073-2Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screeningYingjia Xu0Mengxia Wu1Zixin Guo2Min Cao3Mang Ye4Jorma Laaksonen5School of Computer Science & Technology, Soochow UniversitySchool of Computer Science & Technology, Soochow UniversityDepartment of Computer Science, Aalto UniversitySchool of Computer Science & Technology, Soochow UniversitySchool of Computer Science, Wuhan UniversityDepartment of Computer Science, Aalto UniversityAbstract Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.https://doi.org/10.1007/s44267-025-00073-2Text-to-video retrieval (TVR)Inverted indexPre-screeningContrastive learning (CL) |
| spellingShingle | Yingjia Xu Mengxia Wu Zixin Guo Min Cao Mang Ye Jorma Laaksonen Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening Visual Intelligence Text-to-video retrieval (TVR) Inverted index Pre-screening Contrastive learning (CL) |
| title | Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening |
| title_full | Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening |
| title_fullStr | Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening |
| title_full_unstemmed | Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening |
| title_short | Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening |
| title_sort | efficient text to video retrieval via multi modal multi tagger derived pre screening |
| topic | Text-to-video retrieval (TVR) Inverted index Pre-screening Contrastive learning (CL) |
| url | https://doi.org/10.1007/s44267-025-00073-2 |
| work_keys_str_mv | AT yingjiaxu efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening AT mengxiawu efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening AT zixinguo efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening AT mincao efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening AT mangye efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening AT jormalaaksonen efficienttexttovideoretrievalviamultimodalmultitaggerderivedprescreening |