ECE-TTS: A Zero-Shot Emotion Text-to-Speech Model with Simplified and Precise Control

Significant advances have been made in emotional speech synthesis technology; however, existing models still face challenges in achieving fine-grained emotion style control and simple yet precise emotion intensity regulation. To address these issues, we propose Easy-Control Emotion Text-to-Speech (E...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shixiong Liang, Ruohua Zhou, Qingsheng Yuan
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Applied Sciences
Subjects:	ECE-TTS emotional speech synthesis zero-shot text-to-speech
Online Access:	https://www.mdpi.com/2076-3417/15/9/5108
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Significant advances have been made in emotional speech synthesis technology; however, existing models still face challenges in achieving fine-grained emotion style control and simple yet precise emotion intensity regulation. To address these issues, we propose Easy-Control Emotion Text-to-Speech (ECE-TTS), a zero-shot TTS model built upon the F5-TTS architecture, simplifying emotion modeling while maintaining accurate control. ECE-TTS leverages pretrained emotion recognizers to extract Valence, Arousal, and Dominance (VAD) values, transforming them into Emotion-Adaptive Spherical Vectors (EASV) for precise emotion style representation. Emotion intensity modulation is efficiently realized via simple arithmetic operations on emotion vectors without introducing additional complex modules or training extra regression networks. Emotion style control experiments demonstrate that ECE-TTS achieves a Word Error Rate (WER) of 13.91%, an Aro-Val-Domin SIM of 0.679, and an Emo SIM of 0.594, surpassing GenerSpeech (WER = 16.34%, Aro-Val-Domin SIM = 0.627, Emo SIM = 0.563) and EmoSphere++ (WER = 15.08%, Aro-Val-Domin SIM = 0.656, Emo SIM = 0.578). Subjective Mean Opinion Score (MOS) evaluations (1–5 scale) further confirm improvements in speaker similarity (3.93), naturalness (3.98), and emotional expressiveness (3.94). Additionally, emotion intensity control experiments demonstrate smooth and precise modulation across varying emotional strengths. These results validate ECE-TTS as a highly effective and practical solution for high-quality, emotion-controllable speech synthesis.
ISSN:	2076-3417

ECE-TTS: A Zero-Shot Emotion Text-to-Speech Model with Simplified and Precise Control

Similar Items