0

ECOM Bench RL Env (Vibrantlabsai)

Fresh

Shopify e-commerce browser agent benchmark - Prime Intellect verifiers environment.

Type
RL Env
Publisher
Vibrantlabsai
Runtime
multi-turn
License
unknown
Size
v0.1.4
Published
May 2026

Cite

Notes

Only stored in your browser.

ecom-bench

Hard cart-building tasks for browser agents on real Shopify e-commerce sites, via BrowserEnv in DOM mode (Stagehand). The agent reads a shopping task, drives a live storefront with navigate / observe / act / extract, and is scored on the resulting cart by a per-site verifier.

Two models are in the loop: the rollout model under test emits the tool calls, and a separate Stagehand grounder (stagehand_model, default anthropic/claude-haiku-4-5) does the DOM grounding behind those tools.

Overview

  • Environment ID: ecom-bench
  • Type: multi-turn, tool use, browser
  • Tags: browser-agent, stagehand, shopify, eval, train

Dataset

40 tasks across 4 Shopify storefronts, 10 per site:

SiteTasksStorefront
gymshark10gymshark.com
allbirds10allbirds.com
chillys10chillys.com
kylie10kyliecosmetics.com

Tasks are adversarial cart-building prompts (exact totals, matching sets, size/color constraints, multi-item bundles) mined to challenge frontier browser agents. Each row carries info = {site, task_id, start_url, location, verifier_path}; the rollout pre-navigates to start_url before turn 1.

Tools

  • navigate(url) — load a URL.
  • observe(instruction) — find DOM elements matching a description.
  • act(instruction) — perform a natural-language action on the page.
  • extract(instruction, schema_json) — pull structured data matching a JSON schema.

Reward

  • cart_verifier_reward (weight 1.0) — binary. A priority-10 @vf.cleanup hook opens a second Playwright CDP connection to the live BrowserBase session, runs the bundled per-site cart snapshot, and scores it with the task's verifier ((result, checks, debug)). result is the reward.

Metric (weight 0): cart_total_qty — total items in the captured cart, for partial-progress visibility even when the binary reward is 0.

Stop conditions

A rollout ends when the model emits an assistant turn with no tool call (it considers the task done) or max_turns (default 30) is reached. Scoring runs at cleanup regardless of how the rollout ended.

Required environment variables

  • BROWSERBASE_API_KEY — Browserbase session creation + side-channel CDP attach.
  • BROWSERBASE_PROJECT_ID — Browserbase project for the sessions.
  • ANTHROPIC_API_KEY — key for the default Stagehand grounder (anthropic/claude-haiku-4-5). If you change stagehand_model to another provider, set that provider's key instead (OPENAI_API_KEY / GOOGLE_API_KEY).

The rollout model under test uses whatever provider/key the eval is configured with (e.g. --provider prime), independent of the grounder key above.

Environment Arguments

ArgTypeDefaultDescription
sitestrNoneKeep only tasks from one site (e.g. "gymshark").
task_idslist[str]NoneKeep only these task ids.
stagehand_modelstr"anthropic/claude-haiku-4-5"Stagehand's internal DOM-grounding LLM.
model_api_key_varstrderivedEnv var Stagehand reads for the grounder key; derived from stagehand_model's provider by default.
proxiesboolTrueRoute the Browserbase session through a residential proxy (Shopify CDNs bot-block without one).
max_turnsint30Agent step cap per rollout.
proxy_model_to_stagehandboolFalseRoute Stagehand's grounder calls through the verifiers client. Off because Stagehand's server hardcodes api.openai.com for non-OpenAI providers.

Quickstart

prime env install ecom-bench

# Single-task smoke (gymshark task 183), one rollout
prime eval run ecom-bench -m claude-haiku-4-5 -n 1 -r 1 \
  -a '{"task_ids": ["183"]}'

# One site, default haiku grounder
prime eval run ecom-bench -m claude-haiku-4-5 -n 10 -r 1 -a '{"site": "gymshark"}'

# Full 40-task run (push first so results auto-upload)
prime env push --path environments/ecom_bench --visibility PRIVATE
prime eval run vibrantlabsai/ecom-bench -m claude-haiku-4-5 -n 40 -r 1

-n caps how many of the 40 tasks are sampled — pass -n 40 for full coverage (the bundled [tool.verifiers.eval] default samples fewer).