Multimodal Latent Representation Learning for Video Moment Retrieval

The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jinkwon Hwang, Mingyu Jeon, Junyeong Kim
Format:	Article
Language:	English
Published:	MDPI AG 2025-07-01
Series:	Sensors
Subjects:	video moment retrieval visual language reasoning multimodal representation learning
Online Access:	https://www.mdpi.com/1424-8220/25/14/4528
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849733822648680448
author	Jinkwon Hwang Mingyu Jeon Junyeong Kim
author_facet	Jinkwon Hwang Mingyu Jeon Junyeong Kim
author_sort	Jinkwon Hwang
collection	DOAJ
description	The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications.
format	Article
id	doaj-art-495c9628705b4639a109e106f8b7028e
institution	DOAJ
issn	1424-8220
language	English
publishDate	2025-07-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-495c9628705b4639a109e106f8b7028e2025-08-20T03:07:57ZengMDPI AGSensors1424-82202025-07-012514452810.3390/s25144528Multimodal Latent Representation Learning for Video Moment RetrievalJinkwon Hwang0Mingyu Jeon1Junyeong Kim2Department of AI, Chung-Ang University, Seoul 06974, Republic of KoreaDepartment of AI, Chung-Ang University, Seoul 06974, Republic of KoreaDepartment of AI, Chung-Ang University, Seoul 06974, Republic of KoreaThe rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications.https://www.mdpi.com/1424-8220/25/14/4528video moment retrievalvisual language reasoningmultimodal representation learning
spellingShingle	Jinkwon Hwang Mingyu Jeon Junyeong Kim Multimodal Latent Representation Learning for Video Moment Retrieval Sensors video moment retrieval visual language reasoning multimodal representation learning
title	Multimodal Latent Representation Learning for Video Moment Retrieval
title_full	Multimodal Latent Representation Learning for Video Moment Retrieval
title_fullStr	Multimodal Latent Representation Learning for Video Moment Retrieval
title_full_unstemmed	Multimodal Latent Representation Learning for Video Moment Retrieval
title_short	Multimodal Latent Representation Learning for Video Moment Retrieval
title_sort	multimodal latent representation learning for video moment retrieval
topic	video moment retrieval visual language reasoning multimodal representation learning
url	https://www.mdpi.com/1424-8220/25/14/4528
work_keys_str_mv	AT jinkwonhwang multimodallatentrepresentationlearningforvideomomentretrieval AT mingyujeon multimodallatentrepresentationlearningforvideomomentretrieval AT junyeongkim multimodallatentrepresentationlearningforvideomomentretrieval

Multimodal Latent Representation Learning for Video Moment Retrieval

Similar Items