Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement

In recent years, Text-to-Speech (TTS) technology has advanced, with research focusing on multi-speaker TTS capable of voice cloning. In 2023, Wang et al. introduced Vall-E, a Transformer-based neural codec language model, achieving state-of-the-art results in voice cloning. However, limited research...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hizkia Raditya Pratama Roosadi, Rizki Rivai Ginanjar, Dessi Puji Lestari
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Neural codec language model speech enhancement transformer text-to-speech Vall-E voice cloning
Online Access:	https://ieeexplore.ieee.org/document/10806715/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850242842648117248
author	Hizkia Raditya Pratama Roosadi Rizki Rivai Ginanjar Dessi Puji Lestari
author_facet	Hizkia Raditya Pratama Roosadi Rizki Rivai Ginanjar Dessi Puji Lestari
author_sort	Hizkia Raditya Pratama Roosadi
collection	DOAJ
description	In recent years, Text-to-Speech (TTS) technology has advanced, with research focusing on multi-speaker TTS capable of voice cloning. In 2023, Wang et al. introduced Vall-E, a Transformer-based neural codec language model, achieving state-of-the-art results in voice cloning. However, limited research has applied such models to the Indonesian language, leaving room for improvement in speech synthesis. This paper explores the development a TTS system using Vall-E and explores enhancements of the speech synthesis. The dataset, comprising audio-transcript pairs, was sourced from previous Indonesian speech processing research. Data preparation involved converting audio into codec tokens and transcripts into phoneme tokens. Following Wang et al., a neural codec language model was built and trained using open-source tools. Additionally, this paper explores the integration VoiceFixer tool for speech enhancement. The inclusion of VoiceFixer improved the naturalness MOS score from 3.34 to 3.95, demonstrating its effectiveness in enhancing speech quality. Overall, the TTS system achieved a naturalness MOS score of 3.489 and a similarity MOS score of 3.521, with a WER of 19.71% and speaker embedding vector similarity visualization. These results indicate that the Vall-E model can produce Indonesian speech with high speaker similarity. The development also emphasizes the importance of factors like the number of speakers, data selection, processing components, modeling, and speech duration during training for synthesis quality.
format	Article
id	doaj-art-b24cc2d4fb7e4393b3615ca4c246f79a
institution	OA Journals
issn	2169-3536
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-b24cc2d4fb7e4393b3615ca4c246f79a2025-08-20T02:00:10ZengIEEEIEEE Access2169-35362024-01-011219313119314010.1109/ACCESS.2024.351987010806715Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech EnhancementHizkia Raditya Pratama Roosadi0https://orcid.org/0009-0003-0010-3034Rizki Rivai Ginanjar1Dessi Puji Lestari2School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, IndonesiaSpeech TTS and Paralinguistics Division, Prosa.ai, Bandung, Jawa Barat, IndonesiaSchool of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, IndonesiaIn recent years, Text-to-Speech (TTS) technology has advanced, with research focusing on multi-speaker TTS capable of voice cloning. In 2023, Wang et al. introduced Vall-E, a Transformer-based neural codec language model, achieving state-of-the-art results in voice cloning. However, limited research has applied such models to the Indonesian language, leaving room for improvement in speech synthesis. This paper explores the development a TTS system using Vall-E and explores enhancements of the speech synthesis. The dataset, comprising audio-transcript pairs, was sourced from previous Indonesian speech processing research. Data preparation involved converting audio into codec tokens and transcripts into phoneme tokens. Following Wang et al., a neural codec language model was built and trained using open-source tools. Additionally, this paper explores the integration VoiceFixer tool for speech enhancement. The inclusion of VoiceFixer improved the naturalness MOS score from 3.34 to 3.95, demonstrating its effectiveness in enhancing speech quality. Overall, the TTS system achieved a naturalness MOS score of 3.489 and a similarity MOS score of 3.521, with a WER of 19.71% and speaker embedding vector similarity visualization. These results indicate that the Vall-E model can produce Indonesian speech with high speaker similarity. The development also emphasizes the importance of factors like the number of speakers, data selection, processing components, modeling, and speech duration during training for synthesis quality.https://ieeexplore.ieee.org/document/10806715/Neural codec language modelspeech enhancementtransformertext-to-speechVall-Evoice cloning
spellingShingle	Hizkia Raditya Pratama Roosadi Rizki Rivai Ginanjar Dessi Puji Lestari Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement IEEE Access Neural codec language model speech enhancement transformer text-to-speech Vall-E voice cloning
title	Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
title_full	Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
title_fullStr	Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
title_full_unstemmed	Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
title_short	Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
title_sort	indonesian voice cloning text to speech system with vall e based model and speech enhancement
topic	Neural codec language model speech enhancement transformer text-to-speech Vall-E voice cloning
url	https://ieeexplore.ieee.org/document/10806715/
work_keys_str_mv	AT hizkiaradityapratamaroosadi indonesianvoicecloningtexttospeechsystemwithvallebasedmodelandspeechenhancement AT rizkirivaiginanjar indonesianvoicecloningtexttospeechsystemwithvallebasedmodelandspeechenhancement AT dessipujilestari indonesianvoicecloningtexttospeechsystemwithvallebasedmodelandspeechenhancement

Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement

Similar Items