tool calling

Category: agents
Slug: tool-calling
Evals: 18
Tools: 32
Models: 357
Papers: 13

Evals testing this capability

ALFWorld

Microsoft

Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstract text policies to grounded execution.

ActiveEmbodiedPlanningTool CallingAgentic

AssistantBench

Allen Institute for AI (Ai2)

214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.

ActiveBrowser UsePlanningRetrievalAgentic

DABstep

Hugging Face

450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.

ActiveTool CallingPlanningCode GenerationAgentic

GAIA (General AI Assistants)

Meta FAIR (Fundamental AI Research)

466 real-world questions requiring tool use, multi-step reasoning, and web browsing - easy for humans (~92%) but hard for AI assistants.

ActiveTool CallingBrowser UsePlanningAgentic

HCAST

METR (Model Evaluation and Threat Research)

METR's Human-Calibrated Autonomy Software Tasks - 189 multi-step software tasks calibrated against human time-to-complete, used to measure agent task length capability.

ActiveCode GenerationPlanningTool CallingAgentic

MiniWoB++

University of California, Berkeley

100+ small synthetic web-page tasks (click button, fill form, drag slider) - the original web-agent benchmark, still used as a unit test.

SaturatedBrowser UseTool CallingAgentic

OSWorld-Verified

XLANG Lab

Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.

ActiveComputer UsePlanningTool CallingAgentic

OSWorld

XLANG Lab

369 computer-use tasks across Ubuntu, Windows, and macOS environments testing whether agents can operate a real desktop via screenshots and mouse/keyboard.

ActiveComputer UsePlanningTool CallingAgentic

SWE-bench Lite

Princeton University

300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs.

ActiveCode EditingDebuggingTool CallingCode

SWE-bench Verified

OpenAI

500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.

ActiveCode EditingDebuggingTool CallingCode

SWE-bench

Princeton NLP Group

2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.

ActiveCode EditingDebuggingTool CallingCode

SWE-Gym

University of California, Berkeley

First publicly available training environment for software-engineering agents - 2,438 real Python GitHub issues with executable Docker test environments and golden patches.

ActiveCode EditingDebuggingTool CallingCode

SWE-Lancer

OpenAI

1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work.

ActiveCode EditingCode GenerationPlanningCode

τ-bench (tau-bench)

Sierra

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

ActiveTool CallingMulti Turn DialogInstruction FollowingAgentic

τ²-bench (Tau²-bench)

Sierra

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

ActiveTool CallingMulti Turn DialogPlanningAgentic

Terminal-Bench

Laude Institute

Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.

ActiveTool CallingPlanningCode EditingAgentic

WebArena

Carnegie Mellon University

812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.

ActiveBrowser UsePlanningTool CallingAgentic

WorkArena

ServiceNow Research

Browser-based enterprise web tasks on a live ServiceNow instance - list filtering, form filling, knowledge search - covering daily knowledge-worker workflows.

ActiveBrowser UseTool CallingPlanningAgentic

Tools lifting evals here

View all

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 4 evals here

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

RL EnvTool UseAgent

lifts 3 evals here

Openenv Browsergym RL Env (Hugging Face)

Hugging Face

OpenEnv port of ServiceNow's BrowserGym - a Playwright-backed browser environment exposing WebArena, MiniWoB, WorkArena, etc. through the standard OpenEnv API.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 3 evals here

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

RL EnvCode EditingDebuggingTool CallingCode

lifts 3 evals here

Agent PLUS RL Env (Prime Intellect)

Prime Intellect

Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.

RL EnvSWECode

lifts 2 evals here

Deepswe RL Env (Prime Intellect)

Prime Intellect

DeepSWE environment for solving SWE issues inside Prime Sandboxes.

RL EnvSWECode

lifts 2 evals here

mini-swe-agent-plus

Prime Intellect

Verifiers env that runs the mini-swe-agent harness inside Prime Sandboxes against real GitHub issues; reward is test-suite pass.

RL EnvCode EditingDebuggingTool CallingCode

lifts 2 evals here

TAU 3 Bench RL Env (Prime Intellect)

Prime Intellect

τ²-bench evaluation environment. Focus on tau-knowledge.

RL EnvTool Agent UserTool UseUser Sim

lifts 2 evals here

ALFWorld

MIT CSAIL

Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.

Evals testing this capability

ALFWorld

AssistantBench

DABstep

GAIA (General AI Assistants)

HCAST

MiniWoB++

OSWorld-Verified

OSWorld

SWE-bench Lite

SWE-bench Verified

SWE-bench

SWE-Gym

SWE-Lancer

τ-bench (tau-bench)

τ²-bench (Tau²-bench)

Terminal-Bench

WebArena

WorkArena

Tools lifting evals here

BrowserGym

Agent Bench RL Env (Prime Community)

Openenv Browsergym RL Env (Hugging Face)

SWE-Gym

Agent PLUS RL Env (Prime Intellect)

Deepswe RL Env (Prime Intellect)

mini-swe-agent-plus

TAU 3 Bench RL Env (Prime Intellect)

ALFWorld

Androidworld RL Env (Prime Community)

Androidworld RL Env (Prime Intellect)

Bench 2 RL Env (Prime Intellect)

Bench ENV RL Env (Prime Community)

Browser Miniwob RL Env (Community)

Browser Miniwob RL Env (Community)

Dabstep RL Env (Community)

Dabstep RL Env (Prime Community)

Dabstep RL Env (Prime Intellect)

GAIA RL Env (Browserbase)

Harbor RL Env (Prime Intellect)

Opencode SWE RL Env (Prime Intellect)

OpenEnv Jupyter Agent (E2B-backed)

OpenEnv Terminus (HF Sandbox terminal)

SWE RL Env (Prime Intellect)

Swebench PRO RL Env (Prime Intellect)

TAU 2 Bench RL Env (Community)

TAU 2 Bench RL Env (Community)

TAU 2 Bench RL Env (Prime Intellect)

TAU 2 Synth RL Env (Prime)

Terminal Bench RL Env (Community)

Terminal-Bench (Verifiers wrapper)

Terminalbench RL Env (Community)

Top models on this capability

Papers in this area

Related in agents