Audio-visual source separation with localization and individual control.

The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work propos...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mohanaprasad Kothandaraman, Balakrishnan Ramalingam, Kai Sheng, Aman Verma, Utkarsh Dhagat, Pranav Parab, Siddhartha Mallavolu, Sankar Ganesh
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0321856
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849469246433656832
author	Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh
author_facet	Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh
author_sort	Mohanaprasad Kothandaraman
collection	DOAJ
description	The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.
format	Article
id	doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba0
institution	Kabale University
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba02025-08-20T03:25:34ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032185610.1371/journal.pone.0321856Audio-visual source separation with localization and individual control.Mohanaprasad KothandaramanBalakrishnan RamalingamKai ShengAman VermaUtkarsh DhagatPranav ParabSiddhartha MallavoluSankar GaneshThe growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.https://doi.org/10.1371/journal.pone.0321856
spellingShingle	Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh Audio-visual source separation with localization and individual control. PLoS ONE
title	Audio-visual source separation with localization and individual control.
title_full	Audio-visual source separation with localization and individual control.
title_fullStr	Audio-visual source separation with localization and individual control.
title_full_unstemmed	Audio-visual source separation with localization and individual control.
title_short	Audio-visual source separation with localization and individual control.
title_sort	audio visual source separation with localization and individual control
url	https://doi.org/10.1371/journal.pone.0321856
work_keys_str_mv	AT mohanaprasadkothandaraman audiovisualsourceseparationwithlocalizationandindividualcontrol AT balakrishnanramalingam audiovisualsourceseparationwithlocalizationandindividualcontrol AT kaisheng audiovisualsourceseparationwithlocalizationandindividualcontrol AT amanverma audiovisualsourceseparationwithlocalizationandindividualcontrol AT utkarshdhagat audiovisualsourceseparationwithlocalizationandindividualcontrol AT pranavparab audiovisualsourceseparationwithlocalizationandindividualcontrol AT siddharthamallavolu audiovisualsourceseparationwithlocalizationandindividualcontrol AT sankarganesh audiovisualsourceseparationwithlocalizationandindividualcontrol

Audio-visual source separation with localization and individual control.

Similar Items