A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data

This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-st...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrew Katumba, Sulaiman Kagumire, Joyce Nakatumba-Nabende, John Quinn, Sudi Murindanyi
Format: Article
Language:English
Published: Elsevier 2025-10-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340925006390
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849415454803623936
author Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba-Nabende
John Quinn
Sudi Murindanyi
author_facet Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba-Nabende
John Quinn
Sudi Murindanyi
author_sort Andrew Katumba
collection DOAJ
description This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems.
format Article
id doaj-art-d9f9648f8a3948fdbe64ae6210b4b92c
institution Kabale University
issn 2352-3409
language English
publishDate 2025-10-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-d9f9648f8a3948fdbe64ae6210b4b92c2025-08-20T03:33:31ZengElsevierData in Brief2352-34092025-10-016211191510.1016/j.dib.2025.111915A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley DataAndrew Katumba0Sulaiman Kagumire1Joyce Nakatumba-Nabende2John Quinn3Sudi Murindanyi4Department of Electrical and Computer Engineering, Makerere University, Kampala, Uganda; Corresponding author.Department of Computer Science, Makerere University, Kampala, UgandaDepartment of Computer Science, Makerere University, Kampala, UgandaDepartment of Computer Science, Makerere University, Kampala, UgandaDepartment of Electrical and Computer Engineering, Makerere University, Kampala, UgandaThis data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems.http://www.sciencedirect.com/science/article/pii/S2352340925006390Speech datasetText-to-speechLow-resource languagesLugandaKiswahili
spellingShingle Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba-Nabende
John Quinn
Sudi Murindanyi
A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
Data in Brief
Speech dataset
Text-to-speech
Low-resource languages
Luganda
Kiswahili
title A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
title_full A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
title_fullStr A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
title_full_unstemmed A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
title_short A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
title_sort curated crowdsourced dataset of luganda and swahili speech for text to speech synthesismendeley data
topic Speech dataset
Text-to-speech
Low-resource languages
Luganda
Kiswahili
url http://www.sciencedirect.com/science/article/pii/S2352340925006390
work_keys_str_mv AT andrewkatumba acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT sulaimankagumire acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT joycenakatumbanabende acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT johnquinn acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT sudimurindanyi acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT andrewkatumba curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT sulaimankagumire curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT joycenakatumbanabende curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT johnquinn curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata
AT sudimurindanyi curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata