pinchbench
Overview
- Environment ID:
pinchbench - Short description: Run PinchBench tasks through OpenClaw in a sandbox, then score them with the original task checks and judge prompt.
- Tags:
agent,multi-turn,sandbox
Provenance
- Task markdown files in
pinchbench/tasks/are copied verbatim frompinchbench/skill. - Referenced assets in
pinchbench/assets/are copied from the same repository. - Task loading is adapted from
scripts/lib_tasks.py. - Automated grading, transcript summarization, judge prompt construction, and judge-response parsing are adapted from
scripts/lib_grading.py. - The sandbox runner in
pinchbench/run_task.pymirrors the upstreamexecute_openclaw_task(...)loop fromscripts/lib_agent.py, with--localadded so OpenClaw runs inside the sandbox. - Sandbox setup clears a dedicated
/tmp/pinchbench/...agent workspace, loads task fixtures, removes the same bootstrap files that upstream removes (BOOTSTRAP.md,SOUL.md,USER.md,IDENTITY.md), and then copies installed OpenClaw skills into the task workspace.
Task
- Type: multi-turn CLI-agent benchmark
- Runtime: OpenClaw is installed inside a Prime Sandbox, pointed at the verifier interception endpoint via a temporary custom provider config, and run against a dedicated
/tmp/pinchbench/...agent workspace that mirrors the upstream benchmark layout. - Prompt source: upstream PinchBench task markdown, preserved verbatim.
- Scoring:
automatedtasks execute the original embedded Pythongrade(...)snippets against the downloaded sandbox workspace and transcript.llm_judgetasks use the original PinchBench judge prompt shape and default judge model choice.hybridtasks combine both using the upstream weights.
Quickstart
# install (local development)
uv pip install -e ./environments/pinchbench
# one debug rollout
uv run vf-eval pinchbench -n1 -r1 -d -v
# automated-only suite
uv run vf-eval pinchbench -n5 -r1 -a '{"suite":"automated-only"}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
suite | str | "all" | all, automated-only, or comma-separated task ids |
openclaw_version | str | "2026.3.13" | npm package version installed inside the sandbox |
docker_image | str | "node:24-bookworm" | Sandbox image |
timeout_multiplier | float | 1.0 | Multiplies task timeouts before the runner uses them |
timeout_seconds | float | 1800.0 | Overall verifier rollout timeout |
max_turns | int | 200 | Max intercepted model turns |
setup_parallelism | int | 4 | Max concurrent PinchBench sandbox bootstraps per process |
judge_model | str | "openrouter/anthropic/claude-opus-4.5" | Upstream PinchBench default judge model |
judge_base_url | str | "https://api.pinference.ai/api/v1" | Base URL for the judge client |
judge_api_key_var | str | "PRIME_API_KEY" | Env var used for the judge API key when Prime CLI auth is not available |
Notes
- This port keeps the upstream task prompts and grading logic intact, but it does not recreate the original host-side PinchBench harness byte-for-byte.
- The upstream judge model string is preserved, but the default judge client now points at Pinference, strips the leading
openrouter/prefix before sending the request, and resolves Prime team auth the same way other environments in this repository do. - The sandbox bootstrap now relies on the base image for standard tooling, installs only the missing PDF/pip utilities PinchBench tasks actually use, and otherwise keeps setup focused on OpenClaw itself.
- Search-heavy and image-generation tasks work best when relevant tool credentials are available in the evaluation environment; the sandbox forwards a small allowlist of common search/image env vars when present.
- The default
vf-evalsmoke test starts ontask_00_sanity, so it does not require judge credentials.
Changelog
v0.1.1
- Harden per-sandbox bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.