Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set

Reverberation and background noise are common and unavoidable real-world phenomena that hinder automatic speaker recognition systems, particularly because these systems are typically trained on noise-free data. Most models rely on fixed audio feature sets. To evaluate the dependency of features on r...

Full description

Saved in:
Bibliographic Details
Main Authors: Valerio Cesarini, Giovanni Costantini
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/23/11446
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850261393255694336
author Valerio Cesarini
Giovanni Costantini
author_facet Valerio Cesarini
Giovanni Costantini
author_sort Valerio Cesarini
collection DOAJ
description Reverberation and background noise are common and unavoidable real-world phenomena that hinder automatic speaker recognition systems, particularly because these systems are typically trained on noise-free data. Most models rely on fixed audio feature sets. To evaluate the dependency of features on reverberation and noise, this study proposes augmenting the commonly used mel-frequency cepstral coefficients (MFCCs) with relative spectral (RASTA) features. The performance of these features was assessed using noisy data generated by applying reverberation and pink noise to the DEMoS dataset, which includes 56 speakers. Verification models were trained on clean data using MFCCs, RASTA features, or their combination as inputs. They validated on augmented data with progressively increasing noise and reverberation levels. The results indicate that MFCCs struggle to identify the main speaker, while the RASTA method has difficulty with the opposite class. The hybrid feature set, derived from their combination, demonstrates the best overall performance as a compromise between the two. Although the MFCC method is the standard and performs well on clean training data, it shows a significant tendency to misclassify the main speaker in real-world scenarios, which is a critical limitation for modern user-centric verification applications. The hybrid feature set, therefore, proves effective as a balanced solution, optimizing both sensitivity and specificity.
format Article
id doaj-art-3b2ed6a7391d402cba9e0437ee4d60b0
institution OA Journals
issn 2076-3417
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-3b2ed6a7391d402cba9e0437ee4d60b02025-08-20T01:55:26ZengMDPI AGApplied Sciences2076-34172024-12-0114231144610.3390/app142311446Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature SetValerio Cesarini0Giovanni Costantini1Department of Electronic Engineering, University of Rome Tor Vergata, 00133 Rome, ItalyDepartment of Electronic Engineering, University of Rome Tor Vergata, 00133 Rome, ItalyReverberation and background noise are common and unavoidable real-world phenomena that hinder automatic speaker recognition systems, particularly because these systems are typically trained on noise-free data. Most models rely on fixed audio feature sets. To evaluate the dependency of features on reverberation and noise, this study proposes augmenting the commonly used mel-frequency cepstral coefficients (MFCCs) with relative spectral (RASTA) features. The performance of these features was assessed using noisy data generated by applying reverberation and pink noise to the DEMoS dataset, which includes 56 speakers. Verification models were trained on clean data using MFCCs, RASTA features, or their combination as inputs. They validated on augmented data with progressively increasing noise and reverberation levels. The results indicate that MFCCs struggle to identify the main speaker, while the RASTA method has difficulty with the opposite class. The hybrid feature set, derived from their combination, demonstrates the best overall performance as a compromise between the two. Although the MFCC method is the standard and performs well on clean training data, it shows a significant tendency to misclassify the main speaker in real-world scenarios, which is a critical limitation for modern user-centric verification applications. The hybrid feature set, therefore, proves effective as a balanced solution, optimizing both sensitivity and specificity.https://www.mdpi.com/2076-3417/14/23/11446speaker recognitiondata augmentationnoisereverbMFCCRASTA
spellingShingle Valerio Cesarini
Giovanni Costantini
Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
Applied Sciences
speaker recognition
data augmentation
noise
reverb
MFCC
RASTA
title Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
title_full Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
title_fullStr Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
title_full_unstemmed Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
title_short Reverb and Noise as Real-World Effects in Speech Recognition Models: A Study and a Proposal of a Feature Set
title_sort reverb and noise as real world effects in speech recognition models a study and a proposal of a feature set
topic speaker recognition
data augmentation
noise
reverb
MFCC
RASTA
url https://www.mdpi.com/2076-3417/14/23/11446
work_keys_str_mv AT valeriocesarini reverbandnoiseasrealworldeffectsinspeechrecognitionmodelsastudyandaproposalofafeatureset
AT giovannicostantini reverbandnoiseasrealworldeffectsinspeechrecognitionmodelsastudyandaproposalofafeatureset