0

Bridging What the Model Thinks and How It Speaks: Expressive Speech Generation via Self-Aware Intent-Realization Alignment

Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2604.11424ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion. We identify this mismatch as the semantic understanding-acoustic realization gap. Existing approaches typically rely on externally specified proxies, such as emotion labels or style prompts, which require annotations and struggle to capture dynamically evolving expressive intent throughout dialogue. To overcome these limitations, we propose SASLM (Self-Aware Speech Language Model), a proxy-free framework that bridges what the model thinks and how it speaks through self-aware intent-realization alignment: (1) Intent-Aware Bridging self-distills expressive intent from the model's own evolving semantic generation states via a Variational Information Bottleneck (VIB), thereby guiding expressive speech realization without external expressive supervision; while (2) Realization-Aware Alignment reflectively aligns generated acoustics with intended expression through self-reward optimization, progressively improving intent-realization consistency during speech generation. Despite using only 3B parameters and 800 hours of expressive speech data, SASLM achieves state-of-the-art performance on EchoMind among open-source systems, surpassing models over 10 times larger and approaching commercial systems.