A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network

As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot intera...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yunpeng Zuo, Yunwei Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Applied Sciences
Subjects:	lightweight multimodal fusion audio-visual segmentation (AVS) video object segmentation (VOS) space–time memory (STM) network
Online Access:	https://www.mdpi.com/2076-3417/15/12/6585
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850156665968525312
author	Yunpeng Zuo Yunwei Zhang
author_facet	Yunpeng Zuo Yunwei Zhang
author_sort	Yunpeng Zuo
collection	DOAJ
description	As a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications.
format	Article
id	doaj-art-16044f1f92cf4b9e92f136abffd58d44
institution	OA Journals
issn	2076-3417
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-16044f1f92cf4b9e92f136abffd58d442025-08-20T02:24:26ZengMDPI AGApplied Sciences2076-34172025-06-011512658510.3390/app15126585A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory NetworkYunpeng Zuo0Yunwei Zhang1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, ChinaAs a multimodal fusion task, audio-visual segmentation (AVS) aims to locate sounding objects at the pixel level within a given image. This capability holds significant importance and practical value in applications such as intelligent surveillance, multimedia content analysis, and human–robot interaction. However, existing AVS models typically feature complex architectures, require a large number of parameters, and are challenging to deploy on embedded platforms. Furthermore, these models often lack integration with object tracking mechanisms and fail to address the issue of the mis-segmentation of unvoiced objects caused by environmental noise in real-world scenarios. To address these challenges, this research proposes a lightweight audio-visual segmentation framework incorporating an audio-guided space–time memory network (AG-STMNet). First, a mask generator with a scoring mechanism was developed to identify sounding objects from generated masks. This component integrates Fastsam, a lightweight, pre-trained, object-aware segmentation model, with WAV2CLIP, a parameter-efficient audio-visual alignment model. Subsequently, AG-STMNet, an audio-guided video object segmentation network, was introduced to track sounding objects using video object segmentation techniques while mitigating environmental noise. Finally, the mask generator and AG-STMNet were combined to form the complete framework. The experimental results demonstrate that the framework achieves a mean Intersection over Union (mIoU) score of 41.5, indicating its potential as a viable lightweight solution for practical applications.https://www.mdpi.com/2076-3417/15/12/6585lightweightmultimodal fusionaudio-visual segmentation (AVS)video object segmentation (VOS)space–time memory (STM) network
spellingShingle	Yunpeng Zuo Yunwei Zhang A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network Applied Sciences lightweight multimodal fusion audio-visual segmentation (AVS) video object segmentation (VOS) space–time memory (STM) network
title	A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_full	A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_fullStr	A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_full_unstemmed	A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_short	A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network
title_sort	lightweight framework for audio visual segmentation with an audio guided space time memory network
topic	lightweight multimodal fusion audio-visual segmentation (AVS) video object segmentation (VOS) space–time memory (STM) network
url	https://www.mdpi.com/2076-3417/15/12/6585
work_keys_str_mv	AT yunpengzuo alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunweizhang alightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunpengzuo lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork AT yunweizhang lightweightframeworkforaudiovisualsegmentationwithanaudioguidedspacetimememorynetwork

A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network

Similar Items