Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary documented in the peer-reviewed ML methodology literature (to my knowledge, as of May 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.
A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time
Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2603.10742CC-BY-4.0
- TL;DR
- Semantic Scholar