Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
View PDF HTML (experimental) Abstract:Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a specific TTS model by measuring character error rate, speaker similarity, and emotional similarity between the synthesized and prompt speech. In the dynamic stage (during synthesis), we use a textual similarity model to select the prompt that is most aligned with the current input text. Experimental results demonstrate that our strategy effectively selects prompt to synthesize speech with both high-intensity emotional expression and robust speaker identity, leading to more expressive and stable zero-shot TTS performance. Audio samples and codes will be available at this https URL. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.18512 [cs.SD] (or arXiv:2409.18512v2 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.18512 arXiv-issued DOI via DataCite Submission history From: Haoyu Wang [view email] [v1] Fri, 27 Sep 2024 07:46:52 UTC (426 KB) [v2] Fri, 3 Apr 2026 15:54:23 UTC (445 KB)
No replies yet. Be first.