0

planning

Slug
planning
Evals
29
Tools
91
Models
466
Papers
24

Evals testing this capability

29
View all

Tools lifting evals here

91
View all

Top models on this capability

466

by avg parsed score across evals here

planningBar chart with 21 bars. Highest value: JT-35B-Flash at 99.1.
21 models

Papers in this area

24
introducesAIME as an LLM Evaluation BenchmarkintroducesALFWorld: Aligning Text and Embodied Environments for Interactive LearningintroducesThe Arcade Learning Environment: An Evaluation Platform for General AgentsintroducesBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsChallenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve ThemintroducesBrowseComp: A Simple Yet Challenging Benchmark for Browsing AgentsintroducesDeepMind Control SuiteintroducesGAIA: A Benchmark for General AI AssistantsintroducesGDPval: Evaluating AI Model Performance on Real-World Economically Valuable TasksintroducesTraining Verifiers to Solve Math Word ProblemsintroducesMeasuring AI Ability to Complete Long TasksintroducesMeasuring Mathematical Problem Solving With the MATH DatasetintroducesOSWorld-Verified: A Cleaner, More Reliable Computer-Use BenchmarkintroducesOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsintroducesSWE-bench: Can Language Models Resolve Real-World GitHub Issues?introducesSWE-Gym: An Open Environment for Training Software Engineering Agents and VerifiersintroducesSWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?introducesτ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World DomainsintroducesTerminal-Bench: A Benchmark for Real-World Terminal-Based AgentsIntroducing Terminal-BenchintroducesTextArena: Multi-Agent Text-Based Games for LLM EvaluationintroducesVisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web TasksintroducesWebArena: A Realistic Web Environment for Building Autonomous AgentsintroducesWorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?