math
- Slug
math- Evals
- 12
- Tools
- 58
- Models
- 495
- Papers
- 8
Evals testing this capability
12Tools lifting evals here
58Top models on this capability
495by avg parsed score across evals here
Papers in this area
8introducesAIME as an LLM Evaluation BenchmarkintroducesBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsChallenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve ThemintroducesFrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIintroducesTraining Verifiers to Solve Math Word ProblemsintroducesHumanity's Last ExamintroducesMeasuring Mathematical Problem Solving With the MATH DatasetintroducesLet's Verify Step by Step


