Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank

Audio datasets are fundamental to the development of automatic speech-recognition (ASR) systems. However, the availability of a large corpus of audio datasets in low-resource languages (LRLs) is limited. This study addresses this gap by introducing audio speech datasets for five low-resource languag...

Full description

Saved in:

Bibliographic Details
Main Authors:	Isaac Wiafe, Jamal-Deen Abdulai, Akon Obu Ekpezu, Raynard Dodzi Helegah, Elikem Doe Atsakpo, Charles Nutrokpor, Fiifi Baffoe Payin Winful, Kafui Kwashie Solaga
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Data in Brief
Subjects:	Speech-to-text Speech synthesis Low-resource languages Natural language processing Text-to-speech Speech datasets
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340925006043
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849249364049920000
author	Isaac Wiafe Jamal-Deen Abdulai Akon Obu Ekpezu Raynard Dodzi Helegah Elikem Doe Atsakpo Charles Nutrokpor Fiifi Baffoe Payin Winful Kafui Kwashie Solaga
author_facet	Isaac Wiafe Jamal-Deen Abdulai Akon Obu Ekpezu Raynard Dodzi Helegah Elikem Doe Atsakpo Charles Nutrokpor Fiifi Baffoe Payin Winful Kafui Kwashie Solaga
author_sort	Isaac Wiafe
collection	DOAJ
description	Audio datasets are fundamental to the development of automatic speech-recognition (ASR) systems. However, the availability of a large corpus of audio datasets in low-resource languages (LRLs) is limited. This study addresses this gap by introducing audio speech datasets for five low-resource languages spoken in Ghana and parts of Togo. Specifically, it presents a 5000-hour speech corpus in Akan, Ewe, Dagbani, Dagaare, and Ikposo. Each language corpus includes 1000 h of validated audio speech recorded by their indigenous speakers. These audio recordings are spoken descriptions of 1000 culturally relevant images collected using a custom Android mobile application. To enhance the dataset’s utility in ASR and linguistic research 10 % of the audio recordings for each language were randomly selected and transcribed, resulting in approximately 100 h of transcription per language. This dataset represents a critical resource for preserving and documenting Ghanaian languages. It holds the potential for advancing speech and language technologies in these languages. Creating this audio dataset is the first step towards bridging the technological gap between high- and low-resource languages. Ethical guidelines were strictly followed throughout the data collection process and participants were given incentives for lending their voices to this study.
format	Article
id	doaj-art-07719d37eaea485dba01b925dcec26c4
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-07719d37eaea485dba01b925dcec26c42025-08-20T03:57:36ZengElsevierData in Brief2352-34092025-08-016111188010.1016/j.dib.2025.111880Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data BankIsaac Wiafe0Jamal-Deen Abdulai1Akon Obu Ekpezu2Raynard Dodzi Helegah3Elikem Doe Atsakpo4Charles Nutrokpor5Fiifi Baffoe Payin Winful6Kafui Kwashie Solaga7Corresponding author.; Department of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaDepartment of Computer Science, University of Ghana, Legon-Accra, GhanaAudio datasets are fundamental to the development of automatic speech-recognition (ASR) systems. However, the availability of a large corpus of audio datasets in low-resource languages (LRLs) is limited. This study addresses this gap by introducing audio speech datasets for five low-resource languages spoken in Ghana and parts of Togo. Specifically, it presents a 5000-hour speech corpus in Akan, Ewe, Dagbani, Dagaare, and Ikposo. Each language corpus includes 1000 h of validated audio speech recorded by their indigenous speakers. These audio recordings are spoken descriptions of 1000 culturally relevant images collected using a custom Android mobile application. To enhance the dataset’s utility in ASR and linguistic research 10 % of the audio recordings for each language were randomly selected and transcribed, resulting in approximately 100 h of transcription per language. This dataset represents a critical resource for preserving and documenting Ghanaian languages. It holds the potential for advancing speech and language technologies in these languages. Creating this audio dataset is the first step towards bridging the technological gap between high- and low-resource languages. Ethical guidelines were strictly followed throughout the data collection process and participants were given incentives for lending their voices to this study.http://www.sciencedirect.com/science/article/pii/S2352340925006043Speech-to-textSpeech synthesisLow-resource languagesNatural language processingText-to-speechSpeech datasets
spellingShingle	Isaac Wiafe Jamal-Deen Abdulai Akon Obu Ekpezu Raynard Dodzi Helegah Elikem Doe Atsakpo Charles Nutrokpor Fiifi Baffoe Payin Winful Kafui Kwashie Solaga Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank Data in Brief Speech-to-text Speech synthesis Low-resource languages Natural language processing Text-to-speech Speech datasets
title	Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank
title_full	Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank
title_fullStr	Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank
title_full_unstemmed	Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank
title_short	Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank
title_sort	advancing automatic speech recognition for low resource ghanaian languages audio datasets for akan ewe dagbani dagaare and ikpososcience data bank
topic	Speech-to-text Speech synthesis Low-resource languages Natural language processing Text-to-speech Speech datasets
url	http://www.sciencedirect.com/science/article/pii/S2352340925006043
work_keys_str_mv	AT isaacwiafe advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT jamaldeenabdulai advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT akonobuekpezu advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT raynarddodzihelegah advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT elikemdoeatsakpo advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT charlesnutrokpor advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT fiifibaffoepayinwinful advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank AT kafuikwashiesolaga advancingautomaticspeechrecognitionforlowresourceghanaianlanguagesaudiodatasetsforakanewedagbanidagaareandikpososciencedatabank

Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and IkposoScience Data Bank

Similar Items