End-to-end neural automatic speech recognition system for low resource languages

The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consi...

Full description

Saved in:
Bibliographic Details
Main Authors: Sami Dhahbi, Nasir Saleem, Sami Bourouis, Mouhebeddine Berrima, Elena Verdú
Format: Article
Language:English
Published: Elsevier 2025-03-01
Series:Egyptian Informatics Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110866525000088
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consistently outperformed traditional ASRs. However, training E2E-ASR systems for low-resource languages remains challenging due to the dependence on data from well-resourced languages. ASR is vital for promoting under-resourced languages, especially in developing human-to-human and human-to-machine communication systems. Using synthetic speech and data augmentation techniques can enhance E2E-ASR performance for low-resource languages, reducing word error rates (WERs) and character error rates (CERs). This study leverages a non-autoregressive neural text-to-speech (TTS) engine to generate high-quality speech, converting a series of phonemes into speech waveforms (mel-spectrograms). An on-the-fly data augmentation method is applied to these mel-spectrograms, treating them as images from which features are extracted to train a convolutional neural network (CNN) and a bidirectional long short-term memory (BLSTM)-based ASR. The E2E architecture of this system achieves optimal WER and CER performance. The proposed deep learning-based E2E-ASR, trained with synthetic speech and data augmentation, shows significant performance improvements, with a 20.75% reduction in WERs and a 10.34% reduction in CERs.
ISSN:1110-8665