0

Grounded autonomous scrutiny at scale: emergent critique from reproduction of published computational physics papers

Autonomous LLM agents now produce complete research artifacts in machine-learning sandboxes, but real computational physics is harder: experiments are first-principles calculations against re-runnable physical ground truth, and meaningful new work almost always builds on a key…

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2604.12198CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Autonomous LLM agents now produce complete research artifacts in machine-learning sandboxes, but real computational physics is harder: experiments are first-principles calculations against re-runnable physical ground truth, and meaningful new work almost always builds on a key existing paper. We ask whether such an agent can perform grounded scrutiny of published computational physics - reading a paper, reproducing it from scratch, and surfacing methodological concerns from execution. We deploy a single Claude Opus 4.6 configuration at two complementary scopes. At scale, across 111 open-access Quantum ESPRESSO papers, an autonomous agent runs the read-plan-compute-compare loop and, although never asked to critique, raises substantive methodological concerns on 42% of papers; 85 of 88 of these critiques (96.6%) surface only after the agent has actually run a calculation, with a reading-only ceiling of 1.8%. Critique emerges from reproduction, not from reading. In depth, on one Nature Communications paper on multiscale device simulation of a 2D-material MOSFET, a fresh agent inheriting a verified reproduction pipeline autonomously produces a 14-concern physics inventory and a complete, submission-form six-page Comment that revises the paper's L_G = 5 nm headline. Two of its L_G = 5 nm headline-challenging attacks - a source-degeneration contact-resistance bound and a Sb-doping degradation ratio - are absent from the published 21-reviewer peer review.