Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging

Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, mos...

Full description

Saved in:
Bibliographic Details
Main Authors: Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi
Format: Article
Language:English
Published: Springer 2024-11-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01654-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832571160309530624
author Pufen Zhang
Jiaxiang Wang
Meng Wan
Song Zhang
Jie Jing
Lianhong Ding
Peng Shi
author_facet Pufen Zhang
Jiaxiang Wang
Meng Wan
Song Zhang
Jie Jing
Lianhong Ding
Peng Shi
author_sort Pufen Zhang
collection DOAJ
description Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.
format Article
id doaj-art-42edcdd0ac674b15bddd72bde5ed2c98
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2024-11-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-42edcdd0ac674b15bddd72bde5ed2c982025-02-02T12:48:43ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111112010.1007/s40747-024-01654-2Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridgingPufen Zhang0Jiaxiang Wang1Meng Wan2Song Zhang3Jie Jing4Lianhong Ding5Peng Shi6National Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingComputer Network Information Center, Chinese Academy of SciencesNational Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingBeijing Wuzi UniversityNational Center for Materials Service Safety, University of Science and Technology BeijingAbstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.https://doi.org/10.1007/s40747-024-01654-2Audio-visual event localizationMulti-modal learningVideo scene understandingKnowledge transfer
spellingShingle Pufen Zhang
Jiaxiang Wang
Meng Wan
Song Zhang
Jie Jing
Lianhong Ding
Peng Shi
Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
Complex & Intelligent Systems
Audio-visual event localization
Multi-modal learning
Video scene understanding
Knowledge transfer
title Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_full Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_fullStr Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_full_unstemmed Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_short Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
title_sort audio visual event localization with dual temporal aware scene understanding and image text knowledge bridging
topic Audio-visual event localization
Multi-modal learning
Video scene understanding
Knowledge transfer
url https://doi.org/10.1007/s40747-024-01654-2
work_keys_str_mv AT pufenzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT jiaxiangwang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT mengwan audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT songzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT jiejing audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT lianhongding audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging
AT pengshi audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging