Needle in a haystack: Coarse-to-fine alignment network for moment retrieval from large-scale video collections.
Moment retrieval from large-scale video collections aims to search and localize the temporal boundary of a video moment from a collection of numerous videos according to the given natural language query. Existing methods for moment retrieval in a single video is too time-consuming to directly scale...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0320661 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Moment retrieval from large-scale video collections aims to search and localize the temporal boundary of a video moment from a collection of numerous videos according to the given natural language query. Existing methods for moment retrieval in a single video is too time-consuming to directly scale to this task due to their sophisticated network architecture. In this paper, we decompose the original problem into two mutually boosting subtasks: video retrieval from video collections and moment retrieval in a single video, and propose the coarse-to-fine alignment network (CFAN) including a video alignment module, a cross-modal interaction module and flow of multi-level coarse-to-fine alignment information. Through the interaction of the multi-level information from two subtasks, our method makes full use of the global contextual information in videos and the fine-grained alignment information between videos and queries. We perform sufficient experiments on three public datasets ActivityNet Captions, Charades-STA and DiDeMo and the evaluation results demonstrate the effectiveness of the proposed CFAN method. |
|---|---|
| ISSN: | 1932-6203 |