We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 31
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2504.07096ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
31Ali FarhadiHannaneh HajishirziYejin ChoiKaren FarleyLuca SoldainiDirk GroeneveldTaira AndersonPang Wei KohJiacheng LiuMichael SchmitzNoah A. SmithJon BorchardtYanai ElazarJesse DodgeYenSung ChenAaron SarnatByron BischoffSophie LebrechtCarissa SchoenickSewon MinBailey KuehlTaylor BlantonArnavi Chheda-KotharyHuy TranEric MarshCassidy TrierJenna JamesEvie ChengSruthi SreeramDavid AlbrightRock Yuren Pang