We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.
AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
A large-scale IndicNLP corpus and word embeddings for 10 Indian languages are presented, showing superior performance in news article category classification tasks.
- Year
- 2020
- Venue
- arXiv 2020
- Authors
- 7
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2005.00085ARXIV-DEFAULT
- TL;DR
- Semantic Scholar