Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
Rapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill levels. However, these advancements are limited by the quality of Automatic Speech Recognition (ASR) tools available for...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11003213/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Rapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill levels. However, these advancements are limited by the quality of Automatic Speech Recognition (ASR) tools available for regional languages. Research on ASR for Low-Resource Languages (LRL) addresses these limitations by developing techniques to improve ASR for local languages, thereby extending technical accessibility to native communities. This paper presents a data augmentation framework to address data sparsity for ASR of LRL. The proposed framework establishes a Pitch-Speed Feature Space (PSFS) through augmentation that encompasses all variations of domain specific speech samples spread evenly across the pitch and speed dimensions. Within this PSFS, the base audio data are augmented multifold to fill in the possible gaps within the feature space with a manageable data size for training the ASR system. The results show a positive trend in the improvement of ASR accuracy, even from augmentation of data from a single speaker. Depending on the individual characteristics of the speaker, a maximum ASR accuracy improvement of 10.3% was achieved. The results of this study contribute significantly to the improvement of the accuracy of ASR for LRL. The framework also complements existing efforts to enhance ASR accuracy in general. In addition to the gains in ASR accuracy, the model trained using the proposed framework can be utilised in a semi-supervised scheme to label unseen data, thereby expanding the dataset to enhance ASR system development for LRL. |
|---|---|
| ISSN: | 2169-3536 |