Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, mos...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi
Format:	Article
Language:	English
Published:	Springer 2024-11-01
Series:	Complex & Intelligent Systems
Subjects:	Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer
Online Access:	https://doi.org/10.1007/s40747-024-01654-2
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.
ISSN:	2199-4536 2198-6053

Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Similar Items