Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, mos...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2024-11-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01654-2 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832571160309530624 |
---|---|
author | Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi |
author_facet | Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi |
author_sort | Pufen Zhang |
collection | DOAJ |
description | Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models. |
format | Article |
id | doaj-art-42edcdd0ac674b15bddd72bde5ed2c98 |
institution | Kabale University |
issn | 2199-4536 2198-6053 |
language | English |
publishDate | 2024-11-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj-art-42edcdd0ac674b15bddd72bde5ed2c982025-02-02T12:48:43ZengSpringerComplex & Intelligent Systems2199-45362198-60532024-11-0111112010.1007/s40747-024-01654-2Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridgingPufen Zhang0Jiaxiang Wang1Meng Wan2Song Zhang3Jie Jing4Lianhong Ding5Peng Shi6National Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingComputer Network Information Center, Chinese Academy of SciencesNational Center for Materials Service Safety, University of Science and Technology BeijingNational Center for Materials Service Safety, University of Science and Technology BeijingBeijing Wuzi UniversityNational Center for Materials Service Safety, University of Science and Technology BeijingAbstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.https://doi.org/10.1007/s40747-024-01654-2Audio-visual event localizationMulti-modal learningVideo scene understandingKnowledge transfer |
spellingShingle | Pufen Zhang Jiaxiang Wang Meng Wan Song Zhang Jie Jing Lianhong Ding Peng Shi Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging Complex & Intelligent Systems Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer |
title | Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging |
title_full | Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging |
title_fullStr | Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging |
title_full_unstemmed | Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging |
title_short | Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging |
title_sort | audio visual event localization with dual temporal aware scene understanding and image text knowledge bridging |
topic | Audio-visual event localization Multi-modal learning Video scene understanding Knowledge transfer |
url | https://doi.org/10.1007/s40747-024-01654-2 |
work_keys_str_mv | AT pufenzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT jiaxiangwang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT mengwan audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT songzhang audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT jiejing audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT lianhongding audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging AT pengshi audiovisualeventlocalizationwithdualtemporalawaresceneunderstandingandimagetextknowledgebridging |