A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data
This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-st...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-10-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925006390 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849415454803623936 |
|---|---|
| author | Andrew Katumba Sulaiman Kagumire Joyce Nakatumba-Nabende John Quinn Sudi Murindanyi |
| author_facet | Andrew Katumba Sulaiman Kagumire Joyce Nakatumba-Nabende John Quinn Sudi Murindanyi |
| author_sort | Andrew Katumba |
| collection | DOAJ |
| description | This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems. |
| format | Article |
| id | doaj-art-d9f9648f8a3948fdbe64ae6210b4b92c |
| institution | Kabale University |
| issn | 2352-3409 |
| language | English |
| publishDate | 2025-10-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Data in Brief |
| spelling | doaj-art-d9f9648f8a3948fdbe64ae6210b4b92c2025-08-20T03:33:31ZengElsevierData in Brief2352-34092025-10-016211191510.1016/j.dib.2025.111915A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley DataAndrew Katumba0Sulaiman Kagumire1Joyce Nakatumba-Nabende2John Quinn3Sudi Murindanyi4Department of Electrical and Computer Engineering, Makerere University, Kampala, Uganda; Corresponding author.Department of Computer Science, Makerere University, Kampala, UgandaDepartment of Computer Science, Makerere University, Kampala, UgandaDepartment of Computer Science, Makerere University, Kampala, UgandaDepartment of Electrical and Computer Engineering, Makerere University, Kampala, UgandaThis data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems.http://www.sciencedirect.com/science/article/pii/S2352340925006390Speech datasetText-to-speechLow-resource languagesLugandaKiswahili |
| spellingShingle | Andrew Katumba Sulaiman Kagumire Joyce Nakatumba-Nabende John Quinn Sudi Murindanyi A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data Data in Brief Speech dataset Text-to-speech Low-resource languages Luganda Kiswahili |
| title | A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data |
| title_full | A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data |
| title_fullStr | A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data |
| title_full_unstemmed | A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data |
| title_short | A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data |
| title_sort | curated crowdsourced dataset of luganda and swahili speech for text to speech synthesismendeley data |
| topic | Speech dataset Text-to-speech Low-resource languages Luganda Kiswahili |
| url | http://www.sciencedirect.com/science/article/pii/S2352340925006390 |
| work_keys_str_mv | AT andrewkatumba acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT sulaimankagumire acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT joycenakatumbanabende acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT johnquinn acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT sudimurindanyi acuratedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT andrewkatumba curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT sulaimankagumire curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT joycenakatumbanabende curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT johnquinn curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata AT sudimurindanyi curatedcrowdsourceddatasetoflugandaandswahilispeechfortexttospeechsynthesismendeleydata |