Catherine Arnett
- Papers
- 7
Cite
Notes
Only stored in your browser.
7papers
Authored papers
7Goldfish: Monolingual Language Models for 350 Languages
arXiv 2024
Evaluating Morphological Alignment of Tokenizers in 70 Languages
arXiv 2025
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
arXiv 2025
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
arXiv 2024
Toxicity of the Commons: Curating Open-Source Pre-Training Data
arXiv 2024
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
arXiv 2024
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
arXiv 2023
Affiliations
No known affiliations.
Frequent co-authors
10from 7 papers