scientific reasoning
- Slug
scientific-reasoning- Evals
- 9
- Tools
- 32
- Models
- 463
- Papers
- 7
Evals testing this capability
9Tools lifting evals here
32Top models on this capability
463by avg parsed score across evals here
Papers in this area
7introducesBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsChallenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve ThemintroducesFrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIintroducesGPQA: A Graduate-Level Google-Proof Q&A BenchmarkintroducesHumanity's Last ExamintroducesMMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding BenchmarkintroducesMeasuring Massive Multitask Language Understanding
