0

Structural Enforcement of Statistical Rigor in AI-Driven Discovery: A Functional Architecture

AI-Scientist systems risk manufacturing spurious discoveries through uncontrolled multiple testing. We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded domain-specific language (the Research monad) that makes it impossible to…

Preview
Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2511.06701CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

AI-Scientist systems risk manufacturing spurious discoveries through uncontrolled multiple testing. We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded domain-specific language (the Research monad) that makes it impossible to test a hypothesis without updating the error budget, and a declarative scaffold, backed by an OS-level sandbox, that makes validation data physically absent from the environment in which LLM-generated code runs. We ground the design in a machine-checked Lean 4 formalization of LORD++ online false-discovery-rate (FDR) control: we derive its error budget and prove both marginal and full FDR control, then close the gap to the implementation by verifying the budget's wealth invariant over IEEE 754 arithmetic in SPARK/Ada. To our knowledge this is the first verified chain from theorem to floating-point implementation for an online FDR procedure. In simulation, the architecture holds the false discovery rate near 1% against a 5% target, where a naive approach reaches 41%. In end-to-end case studies, a valid test avoids the false discoveries a flawed one produces, yet still finds real effects when the data allow. An adversarial evaluation confirms that generated code cannot read the held-out data even when given its exact path.