pinchbench

Overview

Environment ID: pinchbench
Short description: Run PinchBench tasks through OpenClaw in a sandbox, then score them with the original task checks and judge prompt.
Tags: agent, multi-turn, sandbox

Provenance

Task markdown files in pinchbench/tasks/ are copied verbatim from pinchbench/skill.
Referenced assets in pinchbench/assets/ are copied from the same repository.
Task loading is adapted from scripts/lib_tasks.py.
Automated grading, transcript summarization, judge prompt construction, and judge-response parsing are adapted from scripts/lib_grading.py.
The sandbox runner in pinchbench/run_task.py mirrors the upstream execute_openclaw_task(...) loop from scripts/lib_agent.py, with --local added so OpenClaw runs inside the sandbox.
Sandbox setup clears a dedicated /tmp/pinchbench/... agent workspace, loads task fixtures, removes the same bootstrap files that upstream removes (BOOTSTRAP.md, SOUL.md, USER.md, IDENTITY.md), and then copies installed OpenClaw skills into the task workspace.

Task

Type: multi-turn CLI-agent benchmark
Runtime: OpenClaw is installed inside a Prime Sandbox, pointed at the verifier interception endpoint via a temporary custom provider config, and run against a dedicated /tmp/pinchbench/... agent workspace that mirrors the upstream benchmark layout.
Prompt source: upstream PinchBench task markdown, preserved verbatim.
Scoring:
- automated tasks execute the original embedded Python grade(...) snippets against the downloaded sandbox workspace and transcript.
- llm_judge tasks use the original PinchBench judge prompt shape and default judge model choice.
- hybrid tasks combine both using the upstream weights.

Quickstart

# install (local development)
uv pip install -e ./environments/pinchbench

# one debug rollout
uv run vf-eval pinchbench -n1 -r1 -d -v

# automated-only suite
uv run vf-eval pinchbench -n5 -r1 -a '{"suite":"automated-only"}'

Environment Arguments

Arg	Type	Default	Description
`suite`	str	`"all"`	`all`, `automated-only`, or comma-separated task ids
`openclaw_version`	str	`"2026.3.13"`	npm package version installed inside the sandbox
`docker_image`	str	`"node:24-bookworm"`	Sandbox image
`timeout_multiplier`	float	`1.0`	Multiplies task timeouts before the runner uses them
`timeout_seconds`	float	`1800.0`	Overall verifier rollout timeout
`max_turns`	int	`200`	Max intercepted model turns
`setup_parallelism`	int	`4`	Max concurrent PinchBench sandbox bootstraps per process
`judge_model`	str	`"openrouter/anthropic/claude-opus-4.5"`	Upstream PinchBench default judge model
`judge_base_url`	str	`"https://api.pinference.ai/api/v1"`	Base URL for the judge client
`judge_api_key_var`	str	`"PRIME_API_KEY"`	Env var used for the judge API key when Prime CLI auth is not available

Notes

This port keeps the upstream task prompts and grading logic intact, but it does not recreate the original host-side PinchBench harness byte-for-byte.
The upstream judge model string is preserved, but the default judge client now points at Pinference, strips the leading openrouter/ prefix before sending the request, and resolves Prime team auth the same way other environments in this repository do.
The sandbox bootstrap now relies on the base image for standard tooling, installs only the missing PDF/pip utilities PinchBench tasks actually use, and otherwise keeps setup focused on OpenClaw itself.
Search-heavy and image-generation tasks work best when relevant tool credentials are available in the evaluation environment; the sandbox forwards a small allowlist of common search/image env vars when present.
The default vf-eval smoke test starts on task_00_sanity, so it does not require judge credentials.

Changelog

v0.1.1

Harden per-sandbox bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.