We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
- Year
- 2015
- Venue
- arXiv 2015
- Authors
- 34
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/1512.02595ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
34Bryan CatanzaroDario AmodeiYi WangJared CasperSanjeev SatheeshSharan NarangDani YogatamaJun ZhanLinxi FanJingdong ChenRishita AnubhaiEric BattenbergCarl CaseMike ChrzanowskiAdam CoatesGreg DiamosErich ElsenJesse EngelChristopher FougnerTony HanAwni HannunBilly JunPatrick LeGresleyLibby LinAndrew NgSherjil OzairRyan PrengerJonathan RaimanDavid SeetapunShubho SenguptaZhiqian WangChong WangBo XiaoZhenyao Zhu