Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, mos...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer
Online Access:	https://doi.org/10.1007/s40747-024-01654-2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571160309530624
author	Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi
author_facet	Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi
author_sort	Pufen Zhang
collection	DOAJ
description	Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.
format	Article
id	doaj-art-42edcdd0ac674b15bddd72bde5ed2c98
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2024-11-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-42edcdd0ac674b15bddd72bde5ed2c982025-02-02T12:48:43ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111112010.1007/s40747-024-01654-2Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridgingPufen Zhang0Jiaxiang Wang1Meng Wan2Song Zhang3Jie Jing4Lianhong Ding5Peng Shi6National Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingComputer Network Information Center, Chinese Academy of SciencesNational Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingBeijing Wuzi UniversityNational Center for Materials Service Safety, University of Science and Technology BeijingAbstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.https://doi.org/10.1007/s40747-024-01654-2Audio-visual event localizationMulti-modal learningVideo scene understandingKnowledge transfer
spellingShingle	Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging Complex & Intelligent Systems Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer
title	Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_full	Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_fullStr	Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_full_unstemmed	Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_short	Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_sort	audio visual event localization with dual temporal aware scene understanding and image text knowledge bridging
topic	Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer
url	https://doi.org/10.1007/s40747-024-01654-2
work_keys_str_mv	AT pufenzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT jiaxiangwang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT mengwan audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT songzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT jiejing audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT lianhongding audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT pengshi audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging

Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Similar Items