In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Self-distillation and online clustering enhance self-supervised speech representation learning by combining masked language modeling, contextualized embeddings, and discrete tokens, achieving superior performance in various downstream tasks.
- Year
- 2023
- Venue
- dinosr-self-distillation-and-online
- Authors
- 5
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2305.10005v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar