End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder
Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of differe...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11080147/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849713411094478848 |
|---|---|
| author | Majid Adibian Hossein Zeinali |
| author_facet | Majid Adibian Hossein Zeinali |
| author_sort | Majid Adibian |
| collection | DOAJ |
| description | Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2. |
| format | Article |
| id | doaj-art-040fb77b3b3c46348e32b08ee4b4776f |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-040fb77b3b3c46348e32b08ee4b4776f2025-08-20T03:13:58ZengIEEEIEEE Access2169-35362025-01-011312780512781410.1109/ACCESS.2025.358912011080147End-to-End Multi-Speaker FastSpeech2 With Hierarchical DecoderMajid Adibian0Hossein Zeinali1https://orcid.org/0000-0002-3789-8091Department of Computer Engineering, Amirkabir University of Technology, Tehran, IranDepartment of Computer Engineering, Amirkabir University of Technology, Tehran, IranMulti-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.https://ieeexplore.ieee.org/document/11080147/Neural text-to-speechmulti-speaker speech synthesisend-to-end deep learning modelsspeaker adaptation in TTSnon-autoregressive speech generation |
| spellingShingle | Majid Adibian Hossein Zeinali End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder IEEE Access Neural text-to-speech multi-speaker speech synthesis end-to-end deep learning models speaker adaptation in TTS non-autoregressive speech generation |
| title | End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder |
| title_full | End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder |
| title_fullStr | End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder |
| title_full_unstemmed | End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder |
| title_short | End-to-End Multi-Speaker FastSpeech2 With Hierarchical Decoder |
| title_sort | end to end multi speaker fastspeech2 with hierarchical decoder |
| topic | Neural text-to-speech multi-speaker speech synthesis end-to-end deep learning models speaker adaptation in TTS non-autoregressive speech generation |
| url | https://ieeexplore.ieee.org/document/11080147/ |
| work_keys_str_mv | AT majidadibian endtoendmultispeakerfastspeech2withhierarchicaldecoder AT hosseinzeinali endtoendmultispeakerfastspeech2withhierarchicaldecoder |