MegaWika
Active
MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which
- Publisher
- Johns Hopkins University
- License
- cc-by-sa-4.0
- Published
- May 2026
Cite
Notes
Only stored in your browser.
FAQ
- What is MegaWika?
- MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which
- What license is MegaWika under?
- MegaWika is available under cc-by-sa-4.0.