Shayne Longpre

MIT researcher leading the Data Provenance Initiative; co-author of major dataset audits and the Flan Collection.

Role: researcher
Currently at: MIT CSAIL
Twitter: twitter.com/ShayneRedford
Scholar: scholar.google.com/citations
Papers: 13

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: scholar.google.com/citations

Attribution policy →

13papers

Authored papers

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

arXiv 2026

2026

The Leaderboard Illusion

preprint

2025

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

arXiv 2025

2025

FlexOlmo: Open Language Models for Flexible Data Use

arXiv 2025

2025

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

ACL

2024

A Survey on Data Selection for Language Models

arXiv 2024

2024

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

arXiv 2023

2023

OctoPack: Instruction Tuning Code Large Language Models

arXiv 2023

2023

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

arXiv 2023

2023

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

arXiv 2023

2023

The Foundation Model Transparency Index

arXiv 2023

2023

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

arXiv 2020

2020

Open-Domain Question Answering Goes Conversational via Question Rewriting

NAACL 2021 4

2020

Affiliations

Currently at

MIT CSAIL

researcher · university lab

Frequent co-authors

from 13 papers

Niklas Muennighoff

grad-student

5 shared papers

Sara Hooker

researcher / VP Research

3 shared papers

Ahmet Üstün

researcher

Alon Albalak

Colin Raffel

Enrico Shippole

Luca Soldaini

Sayash Kapoor

researcher

2 shared papers

Shivalika Singh

engineer

2 shared papers

A. Feder Cooper

1 shared paper