This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
An overview of a program that develops freely available speech resources for under-represented languages using 38 datasets for text-to-speech and automatic speech recognition applications.
- Year
- 2020
- Venue
- arXiv 2020
- Authors
- 21
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2010.06778ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
21Richard SproatAlena ButrynaShan-Hui Cathy ChuIsin DemirsahinAlexander GutkinLinne HaFei HeMartin JanscheCibu JohnyAnna KatanovaOddur KjartanssonChenfang LiTatiana MerkulovaYin May OoKnot PipatsrisawatClara RiveraSupheakmungkol SarinPasindu De SilvaKeshan SodimanaTheeraphol WattanavekinJaka Aris Eko Wibawa