Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario

Rapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill levels. However, these advancements are limited by the quality of Automatic Speech Recognition (ASR) tools available for...

Full description

Saved in:
Bibliographic Details
Main Authors: Syed Muhammad Zahid, Saad Ahmed Qazi
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11003213/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850269590785884160
author Syed Muhammad Zahid
Saad Ahmed Qazi
author_facet Syed Muhammad Zahid
Saad Ahmed Qazi
author_sort Syed Muhammad Zahid
collection DOAJ
description Rapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill levels. However, these advancements are limited by the quality of Automatic Speech Recognition (ASR) tools available for regional languages. Research on ASR for Low-Resource Languages (LRL) addresses these limitations by developing techniques to improve ASR for local languages, thereby extending technical accessibility to native communities. This paper presents a data augmentation framework to address data sparsity for ASR of LRL. The proposed framework establishes a Pitch-Speed Feature Space (PSFS) through augmentation that encompasses all variations of domain specific speech samples spread evenly across the pitch and speed dimensions. Within this PSFS, the base audio data are augmented multifold to fill in the possible gaps within the feature space with a manageable data size for training the ASR system. The results show a positive trend in the improvement of ASR accuracy, even from augmentation of data from a single speaker. Depending on the individual characteristics of the speaker, a maximum ASR accuracy improvement of 10.3% was achieved. The results of this study contribute significantly to the improvement of the accuracy of ASR for LRL. The framework also complements existing efforts to enhance ASR accuracy in general. In addition to the gains in ASR accuracy, the model trained using the proposed framework can be utilised in a semi-supervised scheme to label unseen data, thereby expanding the dataset to enhance ASR system development for LRL.
format Article
id doaj-art-b2a14079411041b6977f0e2da4c5975f
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-b2a14079411041b6977f0e2da4c5975f2025-08-20T01:53:04ZengIEEEIEEE Access2169-35362025-01-0113869988701410.1109/ACCESS.2025.356957911003213Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource ScenarioSyed Muhammad Zahid0https://orcid.org/0009-0004-2898-6681Saad Ahmed Qazi1https://orcid.org/0000-0002-1522-0677Department of Electrical Engineering, NED University of Engineering and Technology, Karachi, PakistanNational Center for Artificial Intelligence (NCAI), NED University of Engineering and Technology, Karachi, PakistanRapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill levels. However, these advancements are limited by the quality of Automatic Speech Recognition (ASR) tools available for regional languages. Research on ASR for Low-Resource Languages (LRL) addresses these limitations by developing techniques to improve ASR for local languages, thereby extending technical accessibility to native communities. This paper presents a data augmentation framework to address data sparsity for ASR of LRL. The proposed framework establishes a Pitch-Speed Feature Space (PSFS) through augmentation that encompasses all variations of domain specific speech samples spread evenly across the pitch and speed dimensions. Within this PSFS, the base audio data are augmented multifold to fill in the possible gaps within the feature space with a manageable data size for training the ASR system. The results show a positive trend in the improvement of ASR accuracy, even from augmentation of data from a single speaker. Depending on the individual characteristics of the speaker, a maximum ASR accuracy improvement of 10.3% was achieved. The results of this study contribute significantly to the improvement of the accuracy of ASR for LRL. The framework also complements existing efforts to enhance ASR accuracy in general. In addition to the gains in ASR accuracy, the model trained using the proposed framework can be utilised in a semi-supervised scheme to label unseen data, thereby expanding the dataset to enhance ASR system development for LRL.https://ieeexplore.ieee.org/document/11003213/Audio perturbationdata augmentationensemble approachlow-resource ASRpitch shiftingpitch-speed feature space
spellingShingle Syed Muhammad Zahid
Saad Ahmed Qazi
Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
IEEE Access
Audio perturbation
data augmentation
ensemble approach
low-resource ASR
pitch shifting
pitch-speed feature space
title Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
title_full Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
title_fullStr Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
title_full_unstemmed Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
title_short Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario
title_sort pitch speed feature space data augmentation for automatic speech recognition improvement in low resource scenario
topic Audio perturbation
data augmentation
ensemble approach
low-resource ASR
pitch shifting
pitch-speed feature space
url https://ieeexplore.ieee.org/document/11003213/
work_keys_str_mv AT syedmuhammadzahid pitchspeedfeaturespacedataaugmentationforautomaticspeechrecognitionimprovementinlowresourcescenario
AT saadahmedqazi pitchspeedfeaturespacedataaugmentationforautomaticspeechrecognitionimprovementinlowresourcescenario