Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach
Recent advancements in end-to-end (E2E) modeling have notably improved automatic speech recognition (ASR) systems; however, far-field speech recognition (FSR) remains challenging due to signal degradation from factors such as low signal-to-noise ratio, reverberation, and interfering sounds. This req...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/7/4073 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Recent advancements in end-to-end (E2E) modeling have notably improved automatic speech recognition (ASR) systems; however, far-field speech recognition (FSR) remains challenging due to signal degradation from factors such as low signal-to-noise ratio, reverberation, and interfering sounds. This requires richer training data and multi-channel speech enhancement. To address this gap, we introduce Mixer, a novel data augmentation technique designed to further enhance the performance of large-scale pre-trained models for FSR. Mixer interpolates and mixes feature representations of speech samples and their corresponding losses, extending the MixSpeech framework to intermediate layers of Whisper. Additionally, we propose Mixer-C, which further leverages multi-channel information by combining speech from different microphone channels using a channel selector. Experimental results demonstrate that Mixer significantly outperforms existing methods, including SpecAugment, achieving a relative word error rate (WER) reduction of 3.6% compared to the baseline. Furthermore, Mixer-C offers an additional WER improvement of 2.2%, showcasing its efficacy in improving FSR accuracy. |
|---|---|
| ISSN: | 2076-3417 |