Audio-visual source separation with localization and individual control.

The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work propos...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohanaprasad Kothandaraman, Balakrishnan Ramalingam, Kai Sheng, Aman Verma, Utkarsh Dhagat, Pranav Parab, Siddhartha Mallavolu, Sankar Ganesh
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0321856
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849469246433656832
author Mohanaprasad Kothandaraman
Balakrishnan Ramalingam
Kai Sheng
Aman Verma
Utkarsh Dhagat
Pranav Parab
Siddhartha Mallavolu
Sankar Ganesh
author_facet Mohanaprasad Kothandaraman
Balakrishnan Ramalingam
Kai Sheng
Aman Verma
Utkarsh Dhagat
Pranav Parab
Siddhartha Mallavolu
Sankar Ganesh
author_sort Mohanaprasad Kothandaraman
collection DOAJ
description The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.
format Article
id doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba0
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba02025-08-20T03:25:34ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032185610.1371/journal.pone.0321856Audio-visual source separation with localization and individual control.Mohanaprasad KothandaramanBalakrishnan RamalingamKai ShengAman VermaUtkarsh DhagatPranav ParabSiddhartha MallavoluSankar GaneshThe growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.https://doi.org/10.1371/journal.pone.0321856
spellingShingle Mohanaprasad Kothandaraman
Balakrishnan Ramalingam
Kai Sheng
Aman Verma
Utkarsh Dhagat
Pranav Parab
Siddhartha Mallavolu
Sankar Ganesh
Audio-visual source separation with localization and individual control.
PLoS ONE
title Audio-visual source separation with localization and individual control.
title_full Audio-visual source separation with localization and individual control.
title_fullStr Audio-visual source separation with localization and individual control.
title_full_unstemmed Audio-visual source separation with localization and individual control.
title_short Audio-visual source separation with localization and individual control.
title_sort audio visual source separation with localization and individual control
url https://doi.org/10.1371/journal.pone.0321856
work_keys_str_mv AT mohanaprasadkothandaraman audiovisualsourceseparationwithlocalizationandindividualcontrol
AT balakrishnanramalingam audiovisualsourceseparationwithlocalizationandindividualcontrol
AT kaisheng audiovisualsourceseparationwithlocalizationandindividualcontrol
AT amanverma audiovisualsourceseparationwithlocalizationandindividualcontrol
AT utkarshdhagat audiovisualsourceseparationwithlocalizationandindividualcontrol
AT pranavparab audiovisualsourceseparationwithlocalizationandindividualcontrol
AT siddharthamallavolu audiovisualsourceseparationwithlocalizationandindividualcontrol
AT sankarganesh audiovisualsourceseparationwithlocalizationandindividualcontrol