Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques

The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-base...

Full description

Saved in:
Bibliographic Details
Main Authors: Swati Soni, Lalita Gupta, Rishav Dubey
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10969763/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850143662808236032
author Swati Soni
Lalita Gupta
Rishav Dubey
author_facet Swati Soni
Lalita Gupta
Rishav Dubey
author_sort Swati Soni
collection DOAJ
description The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems.
format Article
id doaj-art-d044f43b1fe844b5b741676016bab99a
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d044f43b1fe844b5b741676016bab99a2025-08-20T02:28:37ZengIEEEIEEE Access2169-35362025-01-0113686076862110.1109/ACCESS.2025.356234310969763Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation TechniquesSwati Soni0https://orcid.org/0000-0003-1147-3877Lalita Gupta1https://orcid.org/0000-0002-6317-0563Rishav Dubey2https://orcid.org/0000-0001-8324-3152Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, IndiaMaulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, IndiaManipal University at Jaipur, Jaipur, Rajasthan, IndiaThe novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems.https://ieeexplore.ieee.org/document/10969763/Deep-learningmodel parallelismspeaker separationtime domain
spellingShingle Swati Soni
Lalita Gupta
Rishav Dubey
Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
IEEE Access
Deep-learning
model parallelism
speaker separation
time domain
title Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
title_full Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
title_fullStr Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
title_full_unstemmed Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
title_short Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques
title_sort parallel local and global context modeling of deep learning based monaural speech source separation techniques
topic Deep-learning
model parallelism
speaker separation
time domain
url https://ieeexplore.ieee.org/document/10969763/
work_keys_str_mv AT swatisoni parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques
AT lalitagupta parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques
AT rishavdubey parallellocalandglobalcontextmodelingofdeeplearningbasedmonauralspeechsourceseparationtechniques