Guilherme Penedo
- Papers
- 3
Cite
Notes
Only stored in your browser.
3papers
Authored papers
3The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
arXiv 2025
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
arXiv 2025
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
arXiv 2023
Affiliations
No known affiliations.
Frequent co-authors
10from 3 papers