We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single <unit, speech> pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.
UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
UnitSpeech is a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based TTS model using minimal untranscribed data and self-supervised unit representation for personalized tasks like TTS and voice conversion.
- Year
- 2023
- Venue
- arXiv 2023
- Authors
- 4
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2306.16083ARXIV-DEFAULT
- TL;DR
- Semantic Scholar