Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone conditioned curriculum framework for 6 Southern Bantu languages that combined hybrid difficulty scoring, gated adapters driven by tonal statistics and staged curriculum training. We trained on a community corpus and tested transfer to NCHLT to measure robustness beyond matched evaluation. Results revealed clear interactions between architecture and language, with W2V-BERT outperforming Whisper on Nguni languages by 3 to 4 WER points whilst Whisper performed better on Sotho-Tswana languages. W2V-BERT with tone conditioning reached 28.41% average WER across datasets and 23.79% on Xitsonga transfer. No single model suited all 6 languages, so deployment should pair model selection per language with validation across corpora.
Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition
Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.31642CC-BY-4.0
- TL;DR
- Semantic Scholar