Enhancing Expressiveness in Vocal Conversational Agents through Large Language Model-Generated Speech Synthesis Markup Language
Advancements in speech synthesis have enabled more natural and engaging conversational agents, including neural text-to-speech models that can adjust speech inflections to produce distinct vocal styles. For example, Azure Neural Voices can adjust speech using Speech Synthesis Markup Language (SSML)...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
LibraryPress@UF
2025-05-01
|
| Series: | Proceedings of the International Florida Artificial Intelligence Research Society Conference |
| Subjects: | |
| Online Access: | https://journals.flvc.org/FLAIRS/article/view/138814 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Advancements in speech synthesis have enabled more natural and engaging conversational agents, including neural text-to-speech models that can adjust speech inflections to produce distinct vocal styles. For example, Azure Neural Voices can adjust speech using Speech Synthesis Markup Language (SSML) style tags, such as “affectionate,” “cheerful,” and “hopeful.” However, determining when to apply these tags in real-time interactions can be challenging and time-consuming. In this paper, we present a prompt-based approach that enables large language models (LLMs) to dynamically stylize their responses with appropriate SSML tags, enhancing synthesized speech expressiveness across 34 different styles. Using targeted probes designed to elicit specific speech styles, we demonstrate that LLM-generated responses are syntactically well-formed and correctly apply style tags to enhance expressiveness. This simple, customizable approach facilitates the rapid development of expressive vocal conversational agents.
|
|---|---|
| ISSN: | 2334-0754 2334-0762 |