Shayne Longpre
MIT researcher leading the Data Provenance Initiative; co-author of major dataset audits and the Flan Collection.
- Role
- researcher
- Currently at
- MIT CSAIL
- twitter.com/ShayneRedford
- Scholar
- scholar.google.com/citations
- Papers
- 13
Cite
Notes
Only stored in your browser.
Authored papers
13ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
arXiv 2026
The Leaderboard Illusion
preprint
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
arXiv 2025
FlexOlmo: Open Language Models for Flexible Data Use
arXiv 2025
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
ACL
A Survey on Data Selection for Language Models
arXiv 2024
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
arXiv 2023
OctoPack: Instruction Tuning Code Large Language Models
arXiv 2023
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
arXiv 2023
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
arXiv 2023
The Foundation Model Transparency Index
arXiv 2023
MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering
arXiv 2020
Open-Domain Question Answering Goes Conversational via Question Rewriting
NAACL 2021 4
Affiliations
Frequent co-authors
10from 13 papers
Niklas Muennighoff
grad-student
Sara Hooker
researcher / VP Research
Ahmet Üstün
researcher
Alon Albalak
Colin Raffel
Enrico Shippole
Luca Soldaini
Sayash Kapoor
researcher
Shivalika Singh
engineer
A. Feder Cooper