Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach

Recent advancements in end-to-end (E2E) modeling have notably improved automatic speech recognition (ASR) systems; however, far-field speech recognition (FSR) remains challenging due to signal degradation from factors such as low signal-to-noise ratio, reverberation, and interfering sounds. This req...

Full description

Saved in:
Bibliographic Details
Main Authors: Tong Niu, Yaqi Chen, Dan Qu, Hengbo Hu
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/7/4073
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent advancements in end-to-end (E2E) modeling have notably improved automatic speech recognition (ASR) systems; however, far-field speech recognition (FSR) remains challenging due to signal degradation from factors such as low signal-to-noise ratio, reverberation, and interfering sounds. This requires richer training data and multi-channel speech enhancement. To address this gap, we introduce Mixer, a novel data augmentation technique designed to further enhance the performance of large-scale pre-trained models for FSR. Mixer interpolates and mixes feature representations of speech samples and their corresponding losses, extending the MixSpeech framework to intermediate layers of Whisper. Additionally, we propose Mixer-C, which further leverages multi-channel information by combining speech from different microphone channels using a channel selector. Experimental results demonstrate that Mixer significantly outperforms existing methods, including SpecAugment, achieving a relative word error rate (WER) reduction of 3.6% compared to the baseline. Furthermore, Mixer-C offers an additional WER improvement of 2.2%, showcasing its efficacy in improving FSR accuracy.
ISSN:2076-3417