ECE-TTS: A Zero-Shot Emotion Text-to-Speech Model with Simplified and Precise Control
Significant advances have been made in emotional speech synthesis technology; however, existing models still face challenges in achieving fine-grained emotion style control and simple yet precise emotion intensity regulation. To address these issues, we propose Easy-Control Emotion Text-to-Speech (E...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/9/5108 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Significant advances have been made in emotional speech synthesis technology; however, existing models still face challenges in achieving fine-grained emotion style control and simple yet precise emotion intensity regulation. To address these issues, we propose Easy-Control Emotion Text-to-Speech (ECE-TTS), a zero-shot TTS model built upon the F5-TTS architecture, simplifying emotion modeling while maintaining accurate control. ECE-TTS leverages pretrained emotion recognizers to extract Valence, Arousal, and Dominance (VAD) values, transforming them into Emotion-Adaptive Spherical Vectors (EASV) for precise emotion style representation. Emotion intensity modulation is efficiently realized via simple arithmetic operations on emotion vectors without introducing additional complex modules or training extra regression networks. Emotion style control experiments demonstrate that ECE-TTS achieves a Word Error Rate (WER) of 13.91%, an Aro-Val-Domin SIM of 0.679, and an Emo SIM of 0.594, surpassing GenerSpeech (WER = 16.34%, Aro-Val-Domin SIM = 0.627, Emo SIM = 0.563) and EmoSphere++ (WER = 15.08%, Aro-Val-Domin SIM = 0.656, Emo SIM = 0.578). Subjective Mean Opinion Score (MOS) evaluations (1–5 scale) further confirm improvements in speaker similarity (3.93), naturalness (3.98), and emotional expressiveness (3.94). Additionally, emotion intensity control experiments demonstrate smooth and precise modulation across varying emotional strengths. These results validate ECE-TTS as a highly effective and practical solution for high-quality, emotion-controllable speech synthesis. |
|---|---|
| ISSN: | 2076-3417 |