0

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2604.03893CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference. They do not assess whether models can reason over the global structural properties of formal diagrams, such as topology, conservation constraints, and the consistent mapping between visual patterns and algebraic expressions. We introduce FeynmanBench, a benchmark of over 2,000 tasks centered on Feynman diagrams spanning the electromagnetic, weak, and strong interactions of the Standard Model. Each instance couples a diagram image with minimal textual conventions and requires models to recover the full physical content -- vertex inventory, propagator types, topological connectivity, momentum routing, and the complete scattering amplitude. An automated generation and verification pipeline produces the diagrams, annotations, and reference answers under standardized rules. Evaluating 19 state-of-the-art multimodal LLMs, we find a consistent failure pattern: models achieve 70--95% on local recognition (vertex and propagator identification) but collapse to 13--17% on topological reconstruction (CP3), and near zero on full algebraic derivation (CP5). FeynmanBench offers a controlled testbed for multimodal reasoning over formal scientific diagrams and highlights fundamental limitations of current architectures in topology-sensitive scientific reasoning.