Audio-visual source separation with localization and individual control.
The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work propos...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0321856 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849469246433656832 |
|---|---|
| author | Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh |
| author_facet | Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh |
| author_sort | Mohanaprasad Kothandaraman |
| collection | DOAJ |
| description | The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets. |
| format | Article |
| id | doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba0 |
| institution | Kabale University |
| issn | 1932-6203 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-bedd5fdfdfe145a893f4efb1c9d14ba02025-08-20T03:25:34ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032185610.1371/journal.pone.0321856Audio-visual source separation with localization and individual control.Mohanaprasad KothandaramanBalakrishnan RamalingamKai ShengAman VermaUtkarsh DhagatPranav ParabSiddhartha MallavoluSankar GaneshThe growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.https://doi.org/10.1371/journal.pone.0321856 |
| spellingShingle | Mohanaprasad Kothandaraman Balakrishnan Ramalingam Kai Sheng Aman Verma Utkarsh Dhagat Pranav Parab Siddhartha Mallavolu Sankar Ganesh Audio-visual source separation with localization and individual control. PLoS ONE |
| title | Audio-visual source separation with localization and individual control. |
| title_full | Audio-visual source separation with localization and individual control. |
| title_fullStr | Audio-visual source separation with localization and individual control. |
| title_full_unstemmed | Audio-visual source separation with localization and individual control. |
| title_short | Audio-visual source separation with localization and individual control. |
| title_sort | audio visual source separation with localization and individual control |
| url | https://doi.org/10.1371/journal.pone.0321856 |
| work_keys_str_mv | AT mohanaprasadkothandaraman audiovisualsourceseparationwithlocalizationandindividualcontrol AT balakrishnanramalingam audiovisualsourceseparationwithlocalizationandindividualcontrol AT kaisheng audiovisualsourceseparationwithlocalizationandindividualcontrol AT amanverma audiovisualsourceseparationwithlocalizationandindividualcontrol AT utkarshdhagat audiovisualsourceseparationwithlocalizationandindividualcontrol AT pranavparab audiovisualsourceseparationwithlocalizationandindividualcontrol AT siddharthamallavolu audiovisualsourceseparationwithlocalizationandindividualcontrol AT sankarganesh audiovisualsourceseparationwithlocalizationandindividualcontrol |