ecom-bench

Hard cart-building tasks for browser agents on real Shopify e-commerce sites, via BrowserEnv in DOM mode (Stagehand). The agent reads a shopping task, drives a live storefront with navigate / observe / act / extract, and is scored on the resulting cart by a per-site verifier.

Two models are in the loop: the rollout model under test emits the tool calls, and a separate Stagehand grounder (stagehand_model, default anthropic/claude-haiku-4-5) does the DOM grounding behind those tools.

Overview

Environment ID: ecom-bench
Type: multi-turn, tool use, browser
Tags: browser-agent, stagehand, shopify, eval, train

Dataset

40 tasks across 4 Shopify storefronts, 10 per site:

Site	Tasks	Storefront
gymshark	10	gymshark.com
allbirds	10	allbirds.com
chillys	10	chillys.com
kylie	10	kyliecosmetics.com

Tasks are adversarial cart-building prompts (exact totals, matching sets, size/color constraints, multi-item bundles) mined to challenge frontier browser agents. Each row carries info = {site, task_id, start_url, location, verifier_path}; the rollout pre-navigates to start_url before turn 1.

Tools

navigate(url) — load a URL.
observe(instruction) — find DOM elements matching a description.
act(instruction) — perform a natural-language action on the page.
extract(instruction, schema_json) — pull structured data matching a JSON schema.

Reward

cart_verifier_reward (weight 1.0) — binary. A priority-10 @vf.cleanup hook opens a second Playwright CDP connection to the live BrowserBase session, runs the bundled per-site cart snapshot, and scores it with the task's verifier ((result, checks, debug)). result is the reward.

Metric (weight 0): cart_total_qty — total items in the captured cart, for partial-progress visibility even when the binary reward is 0.

Stop conditions

A rollout ends when the model emits an assistant turn with no tool call (it considers the task done) or max_turns (default 30) is reached. Scoring runs at cleanup regardless of how the rollout ended.

Required environment variables

BROWSERBASE_API_KEY — Browserbase session creation + side-channel CDP attach.
BROWSERBASE_PROJECT_ID — Browserbase project for the sessions.
ANTHROPIC_API_KEY — key for the default Stagehand grounder (anthropic/claude-haiku-4-5). If you change stagehand_model to another provider, set that provider's key instead (OPENAI_API_KEY / GOOGLE_API_KEY).

The rollout model under test uses whatever provider/key the eval is configured with (e.g. --provider prime), independent of the grounder key above.

Environment Arguments

Arg	Type	Default	Description
`site`	str	`None`	Keep only tasks from one site (e.g. `"gymshark"`).
`task_ids`	list[str]	`None`	Keep only these task ids.
`stagehand_model`	str	`"anthropic/claude-haiku-4-5"`	Stagehand's internal DOM-grounding LLM.
`model_api_key_var`	str	derived	Env var Stagehand reads for the grounder key; derived from `stagehand_model`'s provider by default.
`proxies`	bool	`True`	Route the Browserbase session through a residential proxy (Shopify CDNs bot-block without one).
`max_turns`	int	`30`	Agent step cap per rollout.
`proxy_model_to_stagehand`	bool	`False`	Route Stagehand's grounder calls through the verifiers client. Off because Stagehand's server hardcodes `api.openai.com` for non-OpenAI providers.

Quickstart

prime env install ecom-bench

# Single-task smoke (gymshark task 183), one rollout
prime eval run ecom-bench -m claude-haiku-4-5 -n 1 -r 1 \
  -a '{"task_ids": ["183"]}'

# One site, default haiku grounder
prime eval run ecom-bench -m claude-haiku-4-5 -n 10 -r 1 -a '{"site": "gymshark"}'

# Full 40-task run (push first so results auto-upload)
prime env push --path environments/ecom_bench --visibility PRIVATE
prime eval run vibrantlabsai/ecom-bench -m claude-haiku-4-5 -n 40 -r 1

-n caps how many of the 40 tasks are sampled — pass -n 40 for full coverage (the bundled [tool.verifiers.eval] default samples fewer).