The rise of Large Language Models (LLMs) is reshaping multimodel models, with
speech synthesis being a prominent application. However, existing approaches
often underutilize the linguistic intelligence of these models, typically
failing to leverage their powerful instruction-following capabilities. This
limitation hinders the model's ability to follow text instructions for
controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm
inspired by operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a conductor'', understanding user instructions and generating a textual
plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the orchestra'', then generates the speech from these features. To
realize this component, we develop BatonTTS, a TTS model trained specifically
for this task. Our experiments demonstrate that BatonVoice achieves strong
performance in controllable and emotional speech synthesis, outperforming
strong open- and closed-source baselines. Notably, our approach enables
remarkable zero-shot cross-lingual generalization, accurately applying feature
control abilities to languages unseen during post-training. This demonstrates
that objectifying speech into textual vocal features can more effectively
unlock the linguistic intelligence of LLMs.
BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
BatonVoice framework decouples instruction understanding from speech generation, using an LLM to create vocal feature plans and a specialized TTS model to produce speech, achieving strong performance in controllable and emotional speech synthesis with zero-shot cross-lingual generalization.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 15
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2509.26514ARXIV-DEFAULT
- TL;DR
- Semantic Scholar