planning

Category: reasoning
Slug: planning
Evals: 29
Tools: 91
Models: 490
Papers: 24

Evals testing this capability

AIME 2024: Problems from the American Invitational Mathematics Examination

Mathematical Association of America

Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark.

ActiveMathPlanning

ALFRED

University of California, Berkeley

3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld.

ActiveEmbodiedImage UnderstandingPlanningRobotics

ALFWorld

Microsoft

Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstract text policies to grounded execution.

ActiveEmbodiedPlanningTool CallingAgentic

AssistantBench

Allen Institute for AI (Ai2)

214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.

ActiveBrowser UsePlanningRetrievalAgentic

Atari 57

Google DeepMind

57 Atari 2600 games played from raw pixels - the foundational reinforcement-learning benchmark from DeepMind's DQN era.

ActivePlanningImage UnderstandingAgentic

BIG-Bench Hard (BBH)

Google Research

23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.

SaturatedPlanningScientific ReasoningMath

BIG-Bench

Google DeepMind

204 diverse tasks contributed by 450 researchers at 132 institutions - the original "test everything" LLM benchmark.

SaturatedFactual RecallPlanningMath

BrowseComp

OpenAI

1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.

ActiveBrowser UseRetrievalPlanningAgentic

DABstep

Hugging Face

450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.

ActiveTool CallingPlanningCode GenerationAgentic

DeepMind Control Suite

Google DeepMind

A set of MuJoCo-based continuous-control RL tasks (cartpole, cheetah, walker, humanoid) - the standard benchmark for continuous-action policy learning.

ActiveEmbodiedPlanningRobotics

GAIA (General AI Assistants)

Meta FAIR (Fundamental AI Research)

466 real-world questions requiring tool use, multi-step reasoning, and web browsing - easy for humans (~92%) but hard for AI assistants.

ActiveTool CallingBrowser UsePlanningAgentic

GDPval

OpenAI

OpenAI's economic-impact eval - 220 expert-curated tasks weighted by US-GDP contribution across 44 occupations, evaluating whether models can do real white-collar work.

ActiveFactual RecallInstruction FollowingPlanning

GSM8K

OpenAI

8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.

SaturatedMathPlanning

HCAST

METR (Model Evaluation and Threat Research)

METR's Human-Calibrated Autonomy Software Tasks - 189 multi-step software tasks calibrated against human time-to-complete, used to measure agent task length capability.

ActiveCode GenerationPlanningTool CallingAgentic

MATH-500

OpenAI

500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.

SaturatedMathPlanning

MuSR

University of California, Berkeley

756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.

ActivePlanningFactual Recall

OSWorld-Verified

XLANG Lab

Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.

ActiveComputer UsePlanningTool CallingAgentic

OSWorld

XLANG Lab

369 computer-use tasks across Ubuntu, Windows, and macOS environments testing whether agents can operate a real desktop via screenshots and mouse/keyboard.

ActiveComputer UsePlanningTool CallingAgentic

SWE-bench Verified

OpenAI

500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.

ActiveCode EditingDebuggingTool CallingCode

SWE-bench

Princeton NLP Group

2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.

ActiveCode EditingDebuggingTool CallingCode

SWE-Gym

University of California, Berkeley

First publicly available training environment for software-engineering agents - 2,438 real Python GitHub issues with executable Docker test environments and golden patches.

ActiveCode EditingDebuggingTool CallingCode

SWE-Lancer

OpenAI

1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work.

ActiveCode EditingCode GenerationPlanningCode

τ-bench (tau-bench)

Sierra

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

ActiveTool CallingMulti Turn DialogInstruction FollowingAgentic

τ²-bench (Tau²-bench)

Sierra

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

ActiveTool CallingMulti Turn DialogPlanningAgentic

Terminal-Bench

Laude Institute

Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.

ActiveTool CallingPlanningCode EditingAgentic

TextArena

Laude Institute

70+ competitive text games (negotiation, social deduction, language puzzles) where models play head-to-head and a TrueSkill rating is fit.

ActiveMulti Turn DialogPlanningInstruction FollowingAgentic

VisualWebArena

Carnegie Mellon University

910 visually grounded web tasks across three self-hosted sites (Classifieds, Shopping, Reddit) requiring image understanding to complete.

ActiveBrowser UseImage UnderstandingPlanningAgentic

WebArena

Carnegie Mellon University

812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.

ActiveBrowser UsePlanningTool CallingAgentic

WorkArena

ServiceNow Research

Browser-based enterprise web tasks on a live ServiceNow instance - list filtering, form filling, knowledge search - covering daily knowledge-worker workflows.

ActiveBrowser UseTool CallingPlanningAgentic

Tools lifting evals here

View all

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 4 evals here

NuminaMath

Numina

An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.

SFT DatasetMathScientific Reasoning

lifts 3 evals here

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

RL Env

lifts 3 evals here

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

RL EnvTool UseAgent

lifts 2 evals here

ALFWorld

MIT CSAIL

Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.

RL EnvEmbodiedPlanningInstruction FollowingAgentic

lifts 2 evals here

Bigbench BBH RL Env (Prime Community)

Prime Community

Big Bench + BBH implementation

RL EnvBigbenchBbhNLP

lifts 2 evals here

mini-swe-agent-plus

Prime Intellect

Verifiers env that runs the mini-swe-agent harness inside Prime Sandboxes against real GitHub issues; reward is test-suite pass.

RL EnvCode EditingDebuggingTool CallingCode

lifts 2 evals here

Openenv Browsergym RL Env (Hugging Face)

Hugging Face

OpenEnv port of ServiceNow's BrowserGym - a Playwright-backed browser environment exposing WebArena, MiniWoB, WorkArena, etc. through the standard OpenEnv API.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 2 evals here

OpenThoughts

Open Thoughts

A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.

SFT DatasetMathCode GenerationScientific Reasoning

lifts 2 evals here

s1K

Stanford Center for Research on Foundation Models (CRFM)

Stanford's hand-curated 1,000-problem reasoning dataset that, paired with budget forcing at inference, produced o1-competitive results for ~$50 of compute.

SFT DatasetMathScientific Reasoning

lifts 2 evals here

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

RL EnvCode EditingDebuggingTool CallingCode

lifts 2 evals here

TAU 3 Bench RL Env (Prime Intellect)

Prime Intellect

τ²-bench evaluation environment. Focus on tau-knowledge.

RL EnvTool Agent UserTool UseUser Sim

lifts 2 evals here

Verifiers Math (math-python)

Prime Intellect

Multi-turn math problem-solving environment where the model proposes Python code in a sandbox to compute and verify numerical answers.

RL EnvMathTool CallingCode Generation

lifts 2 evals here

Agent PLUS RL Env (Prime Intellect)

Prime Intellect

Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.

Evals testing this capability

AIME 2024: Problems from the American Invitational Mathematics Examination

ALFRED

ALFWorld

AssistantBench

Atari 57

BIG-Bench Hard (BBH)

BIG-Bench

BrowseComp

DABstep

DeepMind Control Suite

GAIA (General AI Assistants)

GDPval

GSM8K

HCAST

MATH-500

MuSR

OSWorld-Verified

OSWorld

SWE-bench Verified

SWE-bench

SWE-Gym

SWE-Lancer

τ-bench (tau-bench)

τ²-bench (Tau²-bench)

Terminal-Bench

TextArena

VisualWebArena

WebArena

WorkArena

Tools lifting evals here

BrowserGym

NuminaMath

VF Openbench RL Env (Community)

Agent Bench RL Env (Prime Community)

ALFWorld

Bigbench BBH RL Env (Prime Community)

mini-swe-agent-plus

Openenv Browsergym RL Env (Hugging Face)

OpenThoughts

s1K

SWE-Gym

TAU 3 Bench RL Env (Prime Intellect)

Verifiers Math (math-python)

Agent PLUS RL Env (Prime Intellect)

AIME 2024 RL Env (Prime Intellect)

BB Browsecomp RL Env (Prime Intellect)

BBH RL Env (Community)

Bench 2 RL Env (Prime Intellect)

Bench ENV RL Env (Prime Community)

Browsecomp Openai RL Env (Community)

Browsecomp Openai RL Env (Prime Intellect)

Browsecomp RL Env (Prime Intellect)

Certainty Collapse RL Env (Community)

Compositional Hacks RL Env (Community)

COT Theater RL Env (Community)

Dabstep RL Env (Community)

Dabstep RL Env (Prime Community)

Dabstep RL Env (Prime Intellect)

DDBC RL Env (Prime Intellect)

DDBC RLM RL Env (Prime Intellect)

Deepconf RL Env (Community)

DeepDive (Serper-powered web QA)

Deepscaler MATH RL Env (Prime Intellect)

Deepscaler RL Env (Prime Intellect)

Deepswe RL Env (Prime Intellect)

Defend Concede RL Env (Community)

Discover Gsm8k RL Env (Community)

DM Control RL Env (Hugging Face)

Emergence Prediction RL Env (Community)

Emoji HACK RL Env (Community)

FH Aviary RL Env (Prime Community)

FH Aviary RL Env (Prime Intellect)

Formatting Emergence RL Env (Community)

GAIA RL Env (Browserbase)

Gdpval RL Env (Community)

Goodsirmath8k RL Env (Kunumi)

Gsm8k Multireward RL Env (Community)

Gsm8k Olmes RL Env (Community)

Gsm8k RL Env (Community)