End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder

Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of differe...

Full description

Saved in:
Bibliographic Details
Main Authors: Majid Adibian, Hossein Zeinali
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11080147/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849713411094478848
author Majid Adibian
Hossein Zeinali
author_facet Majid Adibian
Hossein Zeinali
author_sort Majid Adibian
collection DOAJ
description Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.
format Article
id doaj-art-040fb77b3b3c46348e32b08ee4b4776f
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-040fb77b3b3c46348e32b08ee4b4776f2025-08-20T03:13:58ZengIEEEIEEE Access2169-35362025-01-011312780512781410.1109/ACCESS.2025.358912011080147End-to-End Multi-Speaker FastSpeech2 With Hierarchical DecoderMajid Adibian0Hossein Zeinali1https://orcid.org/0000-0002-3789-8091Department of Computer Engineering, Amirkabir University of Technology, Tehran, IranDepartment of Computer Engineering, Amirkabir University of Technology, Tehran, IranMulti-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.https://ieeexplore.ieee.org/document/11080147/Neural text-to-speechmulti-speaker speech synthesisend-to-end deep learning modelsspeaker adaptation in TTSnon-autoregressive speech generation
spellingShingle Majid Adibian
Hossein Zeinali
End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
IEEE Access
Neural text-to-speech
multi-speaker speech synthesis
end-to-end deep learning models
speaker adaptation in TTS
non-autoregressive speech generation
title End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
title_full End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
title_fullStr End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
title_full_unstemmed End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
title_short End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
title_sort end to end multi speaker fastspeech2 with hierarchical decoder
topic Neural text-to-speech
multi-speaker speech synthesis
end-to-end deep learning models
speaker adaptation in TTS
non-autoregressive speech generation
url https://ieeexplore.ieee.org/document/11080147/
work_keys_str_mv AT majidadibian endtoendmultispeakerfastspeech2withhierarchicaldecoder
AT hosseinzeinali endtoendmultispeakerfastspeech2withhierarchicaldecoder