A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network

As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot intera...

Full description

Saved in:
Bibliographic Details
Main Authors: Yunpeng Zuo, Yunwei Zhang
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/12/6585
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850156665968525312
author Yunpeng Zuo
Yunwei Zhang
author_facet Yunpeng Zuo
Yunwei Zhang
author_sort Yunpeng Zuo
collection DOAJ
description As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications.
format Article
id doaj-art-16044f1f92cf4b9e92f136abffd58d44
institution OA Journals
issn 2076-3417
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-16044f1f92cf4b9e92f136abffd58d442025-08-20T02:24:26ZengMDPI AGApplied Sciences2076-34172025-06-011512658510.3390/app15126585A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory NetworkYunpeng Zuo0Yunwei Zhang1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaAs a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications.https://www.mdpi.com/2076-3417/15/12/6585lightweightmultimodal fusionaudio-visual segmentation (AVS)video object segmentation (VOS)space–time memory (STM) network
spellingShingle Yunpeng Zuo
Yunwei Zhang
A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
Applied Sciences
lightweight
multimodal fusion
audio-visual segmentation (AVS)
video object segmentation (VOS)
space–time memory (STM) network
title A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_full A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_fullStr A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_full_unstemmed A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_short A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_sort lightweight framework for audio visual segmentation with an audio guided space time memory network
topic lightweight
multimodal fusion
audio-visual segmentation (AVS)
video object segmentation (VOS)
space–time memory (STM) network
url https://www.mdpi.com/2076-3417/15/12/6585
work_keys_str_mv AT yunpengzuo alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork
AT yunweizhang alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork
AT yunpengzuo lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork
AT yunweizhang lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork