A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot intera...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/12/6585 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850156665968525312 |
|---|---|
| author | Yunpeng Zuo Yunwei Zhang |
| author_facet | Yunpeng Zuo Yunwei Zhang |
| author_sort | Yunpeng Zuo |
| collection | DOAJ |
| description | As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications. |
| format | Article |
| id | doaj-art-16044f1f92cf4b9e92f136abffd58d44 |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-16044f1f92cf4b9e92f136abffd58d442025-08-20T02:24:26ZengMDPI AGApplied Sciences2076-34172025-06-011512658510.3390/app15126585A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory NetworkYunpeng Zuo0Yunwei Zhang1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaAs a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications.https://www.mdpi.com/2076-3417/15/12/6585lightweightmultimodal fusionaudio-visual segmentation (AVS)video object segmentation (VOS)space–time memory (STM) network |
| spellingShingle | Yunpeng Zuo Yunwei Zhang A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network Applied Sciences lightweight multimodal fusion audio-visual segmentation (AVS) video object segmentation (VOS) space–time memory (STM) network |
| title | A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network |
| title_full | A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network |
| title_fullStr | A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network |
| title_full_unstemmed | A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network |
| title_short | A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network |
| title_sort | lightweight framework for audio visual segmentation with an audio guided space time memory network |
| topic | lightweight multimodal fusion audio-visual segmentation (AVS) video object segmentation (VOS) space–time memory (STM) network |
| url | https://www.mdpi.com/2076-3417/15/12/6585 |
| work_keys_str_mv | AT yunpengzuo alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunweizhang alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunpengzuo lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunweizhang lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork |