Room embeddings derived from reverberant speech are often unreliable: speech content and recording degradation can alter the representation even when speaker, room, and source-receiver geometry remain unchanged, degrading downstream task performance. We propose a framework that learns room embeddings robust to speech-content variation and a representation-level uncertainty score from reverberant speech without downstream-task supervision. The embedding is anchored to a structured room impulse response (RIR) latent space and trained using a multi-view data structure with Kullback-Leibler (KL)-based alignment; a multi-positive contrastive term further refines robustness. A lightweight uncertainty head is calibrated using the dispersion of corruption-induced embeddings and optimized with a rank-based objective. Across waveform- and spectrogram-level corruptions, the score is consistent with representation dispersion and enables effective selective prediction while requiring only a single utterance at inference.
Quantifying the Uncertainty of Blindly Estimated Room Embeddings Using a Dispersion-Calibrated Score
Room embeddings derived from reverberant speech are often unreliable: speech content and recording degradation can alter the representation even when speaker, room, and source-receiver geometry remain unchanged, degrading downstream task performance.
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2607.01527ARXIV-DEFAULT
- TL;DR
- Semantic Scholar