ecom-bench
Hard cart-building tasks for browser agents on real Shopify e-commerce sites,
via BrowserEnv in DOM mode (Stagehand). The agent reads a shopping task,
drives a live storefront with navigate / observe / act / extract, and
is scored on the resulting cart by a per-site verifier.
Two models are in the loop: the rollout model under test emits the tool
calls, and a separate Stagehand grounder (stagehand_model, default
anthropic/claude-haiku-4-5) does the DOM grounding behind those tools.
Overview
- Environment ID:
ecom-bench - Type: multi-turn, tool use, browser
- Tags: browser-agent, stagehand, shopify, eval, train
Dataset
40 tasks across 4 Shopify storefronts, 10 per site:
| Site | Tasks | Storefront |
|---|---|---|
| gymshark | 10 | gymshark.com |
| allbirds | 10 | allbirds.com |
| chillys | 10 | chillys.com |
| kylie | 10 | kyliecosmetics.com |
Tasks are adversarial cart-building prompts (exact totals, matching sets,
size/color constraints, multi-item bundles) mined to challenge frontier
browser agents. Each row carries info = {site, task_id, start_url, location, verifier_path}; the rollout pre-navigates to start_url before turn 1.
Tools
navigate(url)— load a URL.observe(instruction)— find DOM elements matching a description.act(instruction)— perform a natural-language action on the page.extract(instruction, schema_json)— pull structured data matching a JSON schema.
Reward
cart_verifier_reward(weight 1.0) — binary. A priority-10@vf.cleanuphook opens a second Playwright CDP connection to the live BrowserBase session, runs the bundled per-site cart snapshot, and scores it with the task's verifier ((result, checks, debug)).resultis the reward.
Metric (weight 0): cart_total_qty — total items in the captured cart, for
partial-progress visibility even when the binary reward is 0.
Stop conditions
A rollout ends when the model emits an assistant turn with no tool call
(it considers the task done) or max_turns (default 30) is reached. Scoring
runs at cleanup regardless of how the rollout ended.
Required environment variables
BROWSERBASE_API_KEY— Browserbase session creation + side-channel CDP attach.BROWSERBASE_PROJECT_ID— Browserbase project for the sessions.ANTHROPIC_API_KEY— key for the default Stagehand grounder (anthropic/claude-haiku-4-5). If you changestagehand_modelto another provider, set that provider's key instead (OPENAI_API_KEY/GOOGLE_API_KEY).
The rollout model under test uses whatever provider/key the eval is configured
with (e.g. --provider prime), independent of the grounder key above.
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
site | str | None | Keep only tasks from one site (e.g. "gymshark"). |
task_ids | list[str] | None | Keep only these task ids. |
stagehand_model | str | "anthropic/claude-haiku-4-5" | Stagehand's internal DOM-grounding LLM. |
model_api_key_var | str | derived | Env var Stagehand reads for the grounder key; derived from stagehand_model's provider by default. |
proxies | bool | True | Route the Browserbase session through a residential proxy (Shopify CDNs bot-block without one). |
max_turns | int | 30 | Agent step cap per rollout. |
proxy_model_to_stagehand | bool | False | Route Stagehand's grounder calls through the verifiers client. Off because Stagehand's server hardcodes api.openai.com for non-OpenAI providers. |
Quickstart
prime env install ecom-bench
# Single-task smoke (gymshark task 183), one rollout
prime eval run ecom-bench -m claude-haiku-4-5 -n 1 -r 1 \
-a '{"task_ids": ["183"]}'
# One site, default haiku grounder
prime eval run ecom-bench -m claude-haiku-4-5 -n 10 -r 1 -a '{"site": "gymshark"}'
# Full 40-task run (push first so results auto-upload)
prime env push --path environments/ecom_bench --visibility PRIVATE
prime eval run vibrantlabsai/ecom-bench -m claude-haiku-4-5 -n 40 -r 1
-n caps how many of the 40 tasks are sampled — pass -n 40 for full coverage
(the bundled [tool.verifiers.eval] default samples fewer).