0

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Self-supervised Wav2Vec2 models encode Dutch linguistic features more accurately when pre-trained exclusively on Dutch data, compared to similar amounts of English or multilingual data, as shown by clustering and classification probes, and demonstrated through improved Automatic Speech Recognition performance.

Year
2025
Venue
arXiv 2025
Authors
6
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2506.00981ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.

Authors

6