Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-base...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10969763/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850143662808236032 |
|---|---|
| author | Swati Soni Lalita Gupta Rishav Dubey |
| author_facet | Swati Soni Lalita Gupta Rishav Dubey |
| author_sort | Swati Soni |
| collection | DOAJ |
| description | The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems. |
| format | Article |
| id | doaj-art-d044f43b1fe844b5b741676016bab99a |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-d044f43b1fe844b5b741676016bab99a2025-08-20T02:28:37ZengIEEEIEEE Access2169-35362025-01-0113686076862110.1109/ACCESS.2025.356234310969763Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation TechniquesSwati Soni0https://orcid.org/0000-0003-1147-3877Lalita Gupta1https://orcid.org/0000-0002-6317-0563Rishav Dubey2https://orcid.org/0000-0001-8324-3152Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, IndiaMaulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, IndiaManipal University at Jaipur, Jaipur, Rajasthan, IndiaThe novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems.https://ieeexplore.ieee.org/document/10969763/Deep-learningmodel parallelismspeaker separationtime domain |
| spellingShingle | Swati Soni Lalita Gupta Rishav Dubey Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques IEEE Access Deep-learning model parallelism speaker separation time domain |
| title | Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques |
| title_full | Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques |
| title_fullStr | Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques |
| title_full_unstemmed | Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques |
| title_short | Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques |
| title_sort | parallel local and global context modeling of deep learning based monaural speech source separation techniques |
| topic | Deep-learning model parallelism speaker separation time domain |
| url | https://ieeexplore.ieee.org/document/10969763/ |
| work_keys_str_mv | AT swatisoni parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques AT lalitagupta parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques AT rishavdubey parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques |