0

Pinchbench RL Env (Community)

Fresh

PinchBench benchmark port running OpenClaw inside a verifier sandbox.

Type
RL Env
License
apache-2.0
Published
Apr 2026

Cite

Notes

Only stored in your browser.

pinchbench

Overview

  • Environment ID: pinchbench
  • Short description: Run PinchBench tasks through OpenClaw in a sandbox, then score them with the original task checks and judge prompt.
  • Tags: agent, multi-turn, sandbox

Provenance

  • Task markdown files in pinchbench/tasks/ are copied verbatim from pinchbench/skill.
  • Referenced assets in pinchbench/assets/ are copied from the same repository.
  • Task loading is adapted from scripts/lib_tasks.py.
  • Automated grading, transcript summarization, judge prompt construction, and judge-response parsing are adapted from scripts/lib_grading.py.
  • The sandbox runner in pinchbench/run_task.py mirrors the upstream execute_openclaw_task(...) loop from scripts/lib_agent.py, with --local added so OpenClaw runs inside the sandbox.
  • Sandbox setup clears a dedicated /tmp/pinchbench/... agent workspace, loads task fixtures, removes the same bootstrap files that upstream removes (BOOTSTRAP.md, SOUL.md, USER.md, IDENTITY.md), and then copies installed OpenClaw skills into the task workspace.

Task

  • Type: multi-turn CLI-agent benchmark
  • Runtime: OpenClaw is installed inside a Prime Sandbox, pointed at the verifier interception endpoint via a temporary custom provider config, and run against a dedicated /tmp/pinchbench/... agent workspace that mirrors the upstream benchmark layout.
  • Prompt source: upstream PinchBench task markdown, preserved verbatim.
  • Scoring:
    • automated tasks execute the original embedded Python grade(...) snippets against the downloaded sandbox workspace and transcript.
    • llm_judge tasks use the original PinchBench judge prompt shape and default judge model choice.
    • hybrid tasks combine both using the upstream weights.

Quickstart

# install (local development)
uv pip install -e ./environments/pinchbench

# one debug rollout
uv run vf-eval pinchbench -n1 -r1 -d -v

# automated-only suite
uv run vf-eval pinchbench -n5 -r1 -a '{"suite":"automated-only"}'

Environment Arguments

ArgTypeDefaultDescription
suitestr"all"all, automated-only, or comma-separated task ids
openclaw_versionstr"2026.3.13"npm package version installed inside the sandbox
docker_imagestr"node:24-bookworm"Sandbox image
timeout_multiplierfloat1.0Multiplies task timeouts before the runner uses them
timeout_secondsfloat1800.0Overall verifier rollout timeout
max_turnsint200Max intercepted model turns
setup_parallelismint4Max concurrent PinchBench sandbox bootstraps per process
judge_modelstr"openrouter/anthropic/claude-opus-4.5"Upstream PinchBench default judge model
judge_base_urlstr"https://api.pinference.ai/api/v1"Base URL for the judge client
judge_api_key_varstr"PRIME_API_KEY"Env var used for the judge API key when Prime CLI auth is not available

Notes

  • This port keeps the upstream task prompts and grading logic intact, but it does not recreate the original host-side PinchBench harness byte-for-byte.
  • The upstream judge model string is preserved, but the default judge client now points at Pinference, strips the leading openrouter/ prefix before sending the request, and resolves Prime team auth the same way other environments in this repository do.
  • The sandbox bootstrap now relies on the base image for standard tooling, installs only the missing PDF/pip utilities PinchBench tasks actually use, and otherwise keeps setup focused on OpenClaw itself.
  • Search-heavy and image-generation tasks work best when relevant tool credentials are available in the evaluation environment; the sandbox forwards a small allowlist of common search/image env vars when present.
  • The default vf-eval smoke test starts on task_00_sanity, so it does not require judge credentials.

Changelog

v0.1.1

  • Harden per-sandbox bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.