0

MegaWika

Active

MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which

License
cc-by-sa-4.0
Published
May 2026

Cite

Notes

Only stored in your browser.

FAQ

What is MegaWika?
MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which
What license is MegaWika under?
MegaWika is available under cc-by-sa-4.0.