Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario

In this research, a multi-channel target speech enhancement scheme is proposed that is based on deep learning (DL) architecture and assisted by multi-source tracking using a labeled random finite set (RFS) framework. A neural network based on minimum variance distortionless response (MVDR) beamforme...

Full description

Saved in:
Bibliographic Details
Main Authors: Jayanta Datta, Ali Dehghan Firoozabadi, David Zabala-Blanco, Francisco R. Castillo-Soria
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/6/2944
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849340722736529408
author Jayanta Datta
Ali Dehghan Firoozabadi
David Zabala-Blanco
Francisco R. Castillo-Soria
author_facet Jayanta Datta
Ali Dehghan Firoozabadi
David Zabala-Blanco
Francisco R. Castillo-Soria
author_sort Jayanta Datta
collection DOAJ
description In this research, a multi-channel target speech enhancement scheme is proposed that is based on deep learning (DL) architecture and assisted by multi-source tracking using a labeled random finite set (RFS) framework. A neural network based on minimum variance distortionless response (MVDR) beamformer is considered as the beamformer of choice, where a residual dense convolutional graph-U-Net is applied in a generative adversarial network (GAN) setting to model the beamformer for target speech enhancement under reverberant conditions involving multiple moving speech sources. The input dataset for this neural architecture is constructed by applying multi-source tracking using multi-sensor generalized labeled multi-Bernoulli (MS-GLMB) filtering, which belongs to the labeled RFS framework, to obtain estimations of the sources’ positions and the associated labels (corresponding to each source) at each time frame with high accuracy under the effect of undesirable factors like reverberation and background noise. The tracked sources’ positions and associated labels help to correctly discriminate the target source from the interferers across all time frames and generate time–frequency (T-F) masks corresponding to the target source from the output of a time-varying, minimum variance distortionless response (MVDR) beamformer. These T-F masks constitute the target label set used to train the proposed deep neural architecture to perform target speech enhancement. The exploitation of MS-GLMB filtering and a time-varying MVDR beamformer help in providing the spatial information of the sources, in addition to the spectral information, within the neural speech enhancement framework during the training phase. Moreover, the application of the GAN framework takes advantage of adversarial optimization as an alternative to maximum likelihood (ML)-based frameworks, which further boosts the performance of target speech enhancement under reverberant conditions. The computer simulations demonstrate that the proposed approach leads to better target speech enhancement performance compared with existing state-of-the-art DL-based methodologies which do not incorporate the labeled RFS-based approach, something which is evident from the 75% ESTOI and PESQ of 2.70 achieved by the proposed approach as compared with the 46.74% ESTOI and PESQ of 1.84 achieved by Mask-MVDR with self-attention mechanism at a reverberation time (RT60) of 550 ms.
format Article
id doaj-art-06d5a92425a443c380105e23abcab58c
institution Kabale University
issn 2076-3417
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-06d5a92425a443c380105e23abcab58c2025-08-20T03:43:51ZengMDPI AGApplied Sciences2076-34172025-03-01156294410.3390/app15062944Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party ScenarioJayanta Datta0Ali Dehghan Firoozabadi1David Zabala-Blanco2Francisco R. Castillo-Soria3Department of Electrical Engineering, Universidad de Chile, Santiago 8370451, ChileDepartment of Electricity, Universidad Tecnológica Metropolitana, Av. José Pedro Alessandri 1242, Santiago 7800002, ChileDepartment of Computing and Industries, Universidad Católica del Maule, Talca 3466706, ChileFaculty of Science, Universidad Autónoma de San Luis Potosí, San Luis Potosí 78295, MexicoIn this research, a multi-channel target speech enhancement scheme is proposed that is based on deep learning (DL) architecture and assisted by multi-source tracking using a labeled random finite set (RFS) framework. A neural network based on minimum variance distortionless response (MVDR) beamformer is considered as the beamformer of choice, where a residual dense convolutional graph-U-Net is applied in a generative adversarial network (GAN) setting to model the beamformer for target speech enhancement under reverberant conditions involving multiple moving speech sources. The input dataset for this neural architecture is constructed by applying multi-source tracking using multi-sensor generalized labeled multi-Bernoulli (MS-GLMB) filtering, which belongs to the labeled RFS framework, to obtain estimations of the sources’ positions and the associated labels (corresponding to each source) at each time frame with high accuracy under the effect of undesirable factors like reverberation and background noise. The tracked sources’ positions and associated labels help to correctly discriminate the target source from the interferers across all time frames and generate time–frequency (T-F) masks corresponding to the target source from the output of a time-varying, minimum variance distortionless response (MVDR) beamformer. These T-F masks constitute the target label set used to train the proposed deep neural architecture to perform target speech enhancement. The exploitation of MS-GLMB filtering and a time-varying MVDR beamformer help in providing the spatial information of the sources, in addition to the spectral information, within the neural speech enhancement framework during the training phase. Moreover, the application of the GAN framework takes advantage of adversarial optimization as an alternative to maximum likelihood (ML)-based frameworks, which further boosts the performance of target speech enhancement under reverberant conditions. The computer simulations demonstrate that the proposed approach leads to better target speech enhancement performance compared with existing state-of-the-art DL-based methodologies which do not incorporate the labeled RFS-based approach, something which is evident from the 75% ESTOI and PESQ of 2.70 achieved by the proposed approach as compared with the 46.74% ESTOI and PESQ of 1.84 achieved by Mask-MVDR with self-attention mechanism at a reverberation time (RT60) of 550 ms.https://www.mdpi.com/2076-3417/15/6/2944SRP-PHATdeep learningmicrophone arrayMS-GLMB filteringbeamforming
spellingShingle Jayanta Datta
Ali Dehghan Firoozabadi
David Zabala-Blanco
Francisco R. Castillo-Soria
Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
Applied Sciences
SRP-PHAT
deep learning
microphone array
MS-GLMB filtering
beamforming
title Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
title_full Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
title_fullStr Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
title_full_unstemmed Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
title_short Multi-Channel Speech Enhancement Using Labelled Random Finite Sets and a Neural Beamformer in Cocktail Party Scenario
title_sort multi channel speech enhancement using labelled random finite sets and a neural beamformer in cocktail party scenario
topic SRP-PHAT
deep learning
microphone array
MS-GLMB filtering
beamforming
url https://www.mdpi.com/2076-3417/15/6/2944
work_keys_str_mv AT jayantadatta multichannelspeechenhancementusinglabelledrandomfinitesetsandaneuralbeamformerincocktailpartyscenario
AT alidehghanfiroozabadi multichannelspeechenhancementusinglabelledrandomfinitesetsandaneuralbeamformerincocktailpartyscenario
AT davidzabalablanco multichannelspeechenhancementusinglabelledrandomfinitesetsandaneuralbeamformerincocktailpartyscenario
AT franciscorcastillosoria multichannelspeechenhancementusinglabelledrandomfinitesetsandaneuralbeamformerincocktailpartyscenario