Autonomous LLM agents now produce complete research artifacts in machine-learning sandboxes, but real computational physics is harder: experiments are first-principles calculations against re-runnable physical ground truth, and meaningful new work almost always builds on a key existing paper. We ask whether such an agent can perform grounded scrutiny of published computational physics - reading a paper, reproducing it from scratch, and surfacing methodological concerns from execution. We deploy a single Claude Opus 4.6 configuration at two complementary scopes. At scale, across 111 open-access Quantum ESPRESSO papers, an autonomous agent runs the read-plan-compute-compare loop and, although never asked to critique, raises substantive methodological concerns on 42% of papers; 85 of 88 of these critiques (96.6%) surface only after the agent has actually run a calculation, with a reading-only ceiling of 1.8%. Critique emerges from reproduction, not from reading. In depth, on one Nature Communications paper on multiscale device simulation of a 2D-material MOSFET, a fresh agent inheriting a verified reproduction pipeline autonomously produces a 14-concern physics inventory and a complete, submission-form six-page Comment that revises the paper's L_G = 5 nm headline. Two of its L_G = 5 nm headline-challenging attacks - a source-degeneration contact-resistance bound and a Sb-doping degradation ratio - are absent from the published 21-reviewer peer review.
Grounded autonomous scrutiny at scale: emergent critique from reproduction of published computational physics papers
Autonomous LLM agents now produce complete research artifacts in machine-learning sandboxes, but real computational physics is harder: experiments are first-principles calculations against re-runnable physical ground truth, and meaningful new work almost always builds on a key…
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2604.12198CC-BY-4.0
- TL;DR
- Semantic Scholar