dojo

Overview

Environment ID: chakra-labs/dojo
Short description: Multi-turn agent evaluation using Dojo infrastructure for task execution
Tags: multi-turn, tool-use, benchmark, agent-evaluation, multimodal, dojo, web

Datasets

Primary dataset(s): dojo-mini-bench - Collection of multi-turn tasks including LinkedIn, Linear, Gmail
Source links: Dojo Documentation

Task

Type: multi-turn, tool use
Parser: OpenAI-compatible tool calling format
Rubric overview: Task-specific verification logic

Quickstart

Get your Dojo API KEY

Run an evaluation with default settings:

DOJO_API_KEY="your_key" uv run vf-eval dojo

Configure model and sampling:

DOJO_API_KEY="your_key" uv run vf-eval dojo -m gpt-4.1-mini -n 10 -r 1

If you want to run with browserbase

DOJO_API_KEY="your_key" BROWSERBASE_PROJECT_ID="project_id" BROWSERBASE_API_KEY="your_browserbase_key" DOJO_ENGINE=browserbase BROWSERBASE_CONCURRENT_LIMIT=1  uv run vf-eval dojo -m gpt-4.1-mini -n 10 -r 1

Notes:

Use -a / --env-args to pass environment-specific configuration as a JSON object.

Metrics

Task-specific verification that returns a fraction between 0.0 and 1.0. Failure means 0.0, partial sucess is <= 1 and sucess is 1.0

Metric	Meaning
`reward`	Score between 0.0 and 1.0

For more information, see the Dojo Verifiers Integration Documentation.