Multimodal Latent Representation Learning for Video Moment Retrieval

The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in t...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinkwon Hwang, Mingyu Jeon, Junyeong Kim
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/14/4528
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849733822648680448
author Jinkwon Hwang
Mingyu Jeon
Junyeong Kim
author_facet Jinkwon Hwang
Mingyu Jeon
Junyeong Kim
author_sort Jinkwon Hwang
collection DOAJ
description The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications.
format Article
id doaj-art-495c9628705b4639a109e106f8b7028e
institution DOAJ
issn 1424-8220
language English
publishDate 2025-07-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-495c9628705b4639a109e106f8b7028e2025-08-20T03:07:57ZengMDPI AGSensors1424-82202025-07-012514452810.3390/s25144528Multimodal Latent Representation Learning for Video Moment RetrievalJinkwon Hwang0Mingyu Jeon1Junyeong Kim2Department of AI, Chung-Ang University, Seoul 06974, Republic of KoreaDepartment of AI, Chung-Ang University, Seoul 06974, Republic of KoreaDepartment of AI, Chung-Ang University, Seoul 06974, Republic of KoreaThe rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications.https://www.mdpi.com/1424-8220/25/14/4528video moment retrievalvisual language reasoningmultimodal representation learning
spellingShingle Jinkwon Hwang
Mingyu Jeon
Junyeong Kim
Multimodal Latent Representation Learning for Video Moment Retrieval
Sensors
video moment retrieval
visual language reasoning
multimodal representation learning
title Multimodal Latent Representation Learning for Video Moment Retrieval
title_full Multimodal Latent Representation Learning for Video Moment Retrieval
title_fullStr Multimodal Latent Representation Learning for Video Moment Retrieval
title_full_unstemmed Multimodal Latent Representation Learning for Video Moment Retrieval
title_short Multimodal Latent Representation Learning for Video Moment Retrieval
title_sort multimodal latent representation learning for video moment retrieval
topic video moment retrieval
visual language reasoning
multimodal representation learning
url https://www.mdpi.com/1424-8220/25/14/4528
work_keys_str_mv AT jinkwonhwang multimodallatentrepresentationlearningforvideomomentretrieval
AT mingyujeon multimodallatentrepresentationlearningforvideomomentretrieval
AT junyeongkim multimodallatentrepresentationlearningforvideomomentretrieval