Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
Abstract Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, mos...
Saved in:
Main Authors: | Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2024-11-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-024-01654-2 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Audio-Language Datasets of Scenes and Events: A Survey
by: Gijs Wijngaard, et al.
Published: (2025-01-01) -
Deep Learning for Traffic Scene Understanding: A Review
by: Parya Dolatyabi, et al.
Published: (2025-01-01) -
Video or audio listening tests for English language teaching context: which is more effective for classroom use?
by: Clara Herlina Karjo, et al.
Published: (2022-02-01) -
A Dual-Channel and Frequency-Aware Approach for Lightweight Video Instance Segmentation
by: Mingzhu Liu, et al.
Published: (2025-01-01) -
Spatial frequency preferences of representations of indoor and natural scene categories in scene-selective regions under different conditions of contrast
by: Yuanyuan Zhang, et al.
Published: (2025-02-01)