Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data

ABSTRACT Text‐to‐speech (TTS) models have expanded the scope of digital inclusivity by becoming a basis for assistive communication technologies for visually impaired people, facilitating language learning, and allowing for digital textual content consumption in audio form across various sectors. De...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrew Katumba, Sulaiman Kagumire, Joyce Nakatumba‐Nabende, John Quinn, Sudi Murindanyi
Format: Article
Language:English
Published: Wiley 2025-04-01
Series:Applied AI Letters
Subjects:
Online Access:https://doi.org/10.1002/ail2.117
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849690414730182656
author Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba‐Nabende
John Quinn
Sudi Murindanyi
author_facet Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba‐Nabende
John Quinn
Sudi Murindanyi
author_sort Andrew Katumba
collection DOAJ
description ABSTRACT Text‐to‐speech (TTS) models have expanded the scope of digital inclusivity by becoming a basis for assistive communication technologies for visually impaired people, facilitating language learning, and allowing for digital textual content consumption in audio form across various sectors. Despite these benefits, the full potential of TTS models is often not realized for the majority of low‐resourced African languages because they have traditionally required large amounts of high‐quality single‐speaker recordings, which are financially costly and time‐consuming to obtain. In this paper, we demonstrate that crowdsourced recordings can help overcome the lack of single‐speaker data by compensating with data from other speakers of similar intonation (how the voice rises and falls in speech). We fine‐tuned an English variational inference with adversarial learning for an end‐to‐end text‐to‐speech (VITS) model on over 10 h of speech from six female common voice (CV) speech data speakers for Luganda and Kiswahili. A human mean opinion score evaluation on 100 test sentences shows that the model trained on six speakers sounds more natural than the benchmark models trained on two speakers and a single speaker for both languages. In addition to careful data curation, this approach shows promise for advancing speech synthesis in the context of low‐resourced African languages. Our final models for Luganda and Kiswahili are available at https://huggingface.co/marconilab/VITS‐commonvoice‐females.
format Article
id doaj-art-3e5a42c2cad848daaa23adbf7f6f2158
institution DOAJ
issn 2689-5595
language English
publishDate 2025-04-01
publisher Wiley
record_format Article
series Applied AI Letters
spelling doaj-art-3e5a42c2cad848daaa23adbf7f6f21582025-08-20T03:21:19ZengWileyApplied AI Letters2689-55952025-04-0162n/an/a10.1002/ail2.117Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced DataAndrew Katumba0Sulaiman Kagumire1Joyce Nakatumba‐Nabende2John Quinn3Sudi Murindanyi4Department of Electrical and Computer Engineering Makerere University Kampala UgandaDepartment of Electrical and Computer Engineering Makerere University Kampala UgandaDepartment of Computer Science Makerere University Kampala UgandaDepartment of Computer Science Makerere University Kampala UgandaDepartment of Electrical and Computer Engineering Makerere University Kampala UgandaABSTRACT Text‐to‐speech (TTS) models have expanded the scope of digital inclusivity by becoming a basis for assistive communication technologies for visually impaired people, facilitating language learning, and allowing for digital textual content consumption in audio form across various sectors. Despite these benefits, the full potential of TTS models is often not realized for the majority of low‐resourced African languages because they have traditionally required large amounts of high‐quality single‐speaker recordings, which are financially costly and time‐consuming to obtain. In this paper, we demonstrate that crowdsourced recordings can help overcome the lack of single‐speaker data by compensating with data from other speakers of similar intonation (how the voice rises and falls in speech). We fine‐tuned an English variational inference with adversarial learning for an end‐to‐end text‐to‐speech (VITS) model on over 10 h of speech from six female common voice (CV) speech data speakers for Luganda and Kiswahili. A human mean opinion score evaluation on 100 test sentences shows that the model trained on six speakers sounds more natural than the benchmark models trained on two speakers and a single speaker for both languages. In addition to careful data curation, this approach shows promise for advancing speech synthesis in the context of low‐resourced African languages. Our final models for Luganda and Kiswahili are available at https://huggingface.co/marconilab/VITS‐commonvoice‐females.https://doi.org/10.1002/ail2.117common voicecrowdsourcedKiswahililow‐resourcedLugandatext‐to‐speech
spellingShingle Andrew Katumba
Sulaiman Kagumire
Joyce Nakatumba‐Nabende
John Quinn
Sudi Murindanyi
Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
Applied AI Letters
common voice
crowdsourced
Kiswahili
low‐resourced
Luganda
text‐to‐speech
title Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
title_full Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
title_fullStr Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
title_full_unstemmed Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
title_short Building Text‐to‐Speech Models for Low‐Resourced Languages From Crowdsourced Data
title_sort building text to speech models for low resourced languages from crowdsourced data
topic common voice
crowdsourced
Kiswahili
low‐resourced
Luganda
text‐to‐speech
url https://doi.org/10.1002/ail2.117
work_keys_str_mv AT andrewkatumba buildingtexttospeechmodelsforlowresourcedlanguagesfromcrowdsourceddata
AT sulaimankagumire buildingtexttospeechmodelsforlowresourcedlanguagesfromcrowdsourceddata
AT joycenakatumbanabende buildingtexttospeechmodelsforlowresourcedlanguagesfromcrowdsourceddata
AT johnquinn buildingtexttospeechmodelsforlowresourcedlanguagesfromcrowdsourceddata
AT sudimurindanyi buildingtexttospeechmodelsforlowresourcedlanguagesfromcrowdsourceddata