SkillsBench
SkillsBench in the Prime Intellect Environments Hub. Wraps the
SkillsBench benchmark as a
verifiers.v1 environment so it can be evaluated and trained against through
prime eval run and the hosted training stack.
This is the v1.1 release of SkillsBench — native BenchFlow task.md
packages (GitHub release v1.1,
commit 27738384). The task list is faithful to the v1.1 release — no tasks
moved, dropped, or renamed.
Overview
- Environment ID:
skillsbench - Short description: Benchmark for evaluating how well AI agents use skills (modular folders of instructions, scripts, and resources). Bundles 101 task dirs: an 87-task scored grid plus 14 opt-in extras.
- Tags: skills, agents, tool-use, harbor, eval, train
Datasets
-
Scored grid: 87 SkillsBench tasks under
tasks/, loaded by default — the full v1.1 default grid. -
Extras: 14 opt-in tasks under
tasks-extra/, loaded only withextras=true(credential-/external-dep or integration-incompatible tasks that upstream parks intasks-extra/). -
Each task is a native
task.mdpackage (BenchFlow v1.1 layout):tasks/<task-id>/ task.md # YAML frontmatter (schema_version 1.3: # metadata / verifier / agent / environment) + instruction body environment/ Dockerfile # baked recipe (ignored by this port) skills/ # uploaded to /root/.claude/skills/ <data files...> # uploaded to /root/ oracle/ solve.sh # reference solution (held back, scoring only) verifier/ test.sh # scorer: pytest verifier/test_outputs.py -> /logs/verifier/reward.txt test_outputs.py -
Source: https://github.com/benchflow-ai/skillsbench —
tasks/andtasks-extra/trees vendored verbatim from the v1.1 release (27738384), matching the release's exact split. -
Split sizes: 87 scored examples, single split.
Task
- Type: multi-turn, tool-use coding agent.
- Output expectations: each task's
task.mdinstruction names the file path(s) the agent must produce. Examples:/root/mass_report.json,/root/answer.txt. The task'sverifier/test_outputs.pyscores them. - Rubric: a single binary
skillsbench_taskmd_reward—1.0if the task's verifier writes a non-zero value to/logs/verifier/reward.txt, else0.0.
How it differs from upstream SkillsBench
| Concern | Upstream SkillsBench (bench eval) | This port (prime eval run) |
|---|---|---|
| Sandbox | Daytona, per-task built Docker image | One shared image (ubuntu:24.04) + per-task setup script |
| Workdir | /root (Dockerfile WORKDIR /root) | /root (set on sandbox config) |
| Skills mount | /root/.claude/skills/<skill>/ | Same: /root/.claude/skills/<skill>/ |
| Agent | claude-agent-acp, etc. | verifiers.v1.OpenCode by default; Pi, MiniSWEAgent, base harness all wireable |
| Scoring | runs verifier/test.sh, reads /logs/verifier/reward.txt | skillsbench_taskmd_reward does exactly the same |
Because stock verifiers (0.1.14) HarborTaskset only parses the old
task.toml layout, this env ships a native task.md loader: it parses the
YAML frontmatter (sandbox sizing from environment.memory_mb / storage_mb /
cpus, timeouts from verifier/agent), uses the markdown body as the
instruction, and scores via the v1.1 verifier/ protocol (mounts oracle/ →
/oracle, verifier/ → /verifier, runs bash /verifier/test.sh).
The port intentionally trades per-task Dockerfiles for a shared base image plus
a setup hook. For a task that needs a different image (e.g. CUDA, a locked
compiler version), set docker_image under environment: in its task.md
frontmatter — the loader honors it.
Quickstart
# Install
prime env install benchflow/skillsbench # or `prime env install .` after cloning
# List the tasks the env can run
python -c "from skillsbench import SkillsBenchTaskset; \
print(len(SkillsBenchTaskset().load_rows()))"
# Run one task end-to-end with the default OpenCode harness
# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)
prime eval run skillsbench \
-m anthropic/claude-haiku-4.5 -p prime \
-n 1 -r 1 --max-concurrent 1 --timeout 1800 \
-a '{"config":{"taskset":{"task_names":["dialogue-parser"]}}}'
# Run a curated subset
prime eval run skillsbench \
-m anthropic/claude-sonnet-4.6 -p prime \
-n 5 -r 1 --max-concurrent 4 \
-a '{"config":{"taskset":{"task_names":["3d-scan-calc","edit-pdf","dialogue-parser","exam-block-sequencing","find-topk-similiar-chemicals"]}}}'
Environment Arguments
load_environment takes the standard vf.EnvConfig envelope. The taskset and
harness child configs accept:
| Section | Arg | Type | Default | Description |
|---|---|---|---|---|
| taskset | task_names | list[str] | None | None (all 87) | Subset filter |
| taskset | extras | bool | false | Merge the 14 tasks-extra/ tasks into the grid |
| taskset | docker_image | str | ubuntu:24.04 | Image used when a task's task.md does not override |
| taskset | workdir | str | /root | Sandbox workdir; also where instruction.md / data is uploaded |
| taskset | skills_remote_dir | str | /root/.claude/skills | Where the per-task environment/skills/ tree is mounted |
| taskset | with_skills | bool | true | Set false for the without-skills half of a paired sweep |
| taskset | apt_packages | list[str] | python3, pip, ripgrep, jq, … | Installed once per rollout via apt-get install |
| taskset | timeout_minutes | int | 480 | Sandbox lifetime (8h) |
| taskset | agent_timeout_seconds | float | 7200 | Per-command timeout for the agent. Foreground HTTP caps at 900s, so anything above 600s is auto-routed through start_background_job (polled, ~24h cap). |
| taskset | verifier_timeout_seconds | float | 1800 | Verifier script timeout (also background-routed) |
| harness | max_turns | int | 60 | OpenCode turn budget |
| harness | system_prompt | str | SkillsBench prompt | Override the system prompt |
| harness | disabled_tools | list[str] | narrow list | OpenCode tool gating |
| harness | agent_workdir | str | /root | Where OpenCode cds before running |
| harness | install_ripgrep | bool | false | Skipped because our apt setup installs ripgrep |
Metrics
| Metric | Meaning |
|---|---|
reward | 1.0 iff the task's verifier/test.sh wrote a non-zero value to /logs/verifier/reward.txt, else 0.0 |
skillsbench_tests (state) | dict with returncode, stdout, stderr from the verifier — useful for debugging failed rollouts |
skillsbench_error (state) | exception string if verifier upload/exec itself failed |
Caveats
- Single base image. Tasks that hard-depend on a custom base image (e.g. a
specific CUDA/compiler image, or compose-based tasks) will not match upstream
behavior unless their
task.mdfrontmatter setsenvironment.docker_image. - Network on by default. SkillsBench setup steps
apt-get/pip install; the port enablesnetwork_access = Trueper rollout. verifier/test.shruns inside the same sandbox. At score timeskillsbench_taskmd_rewarduploadsoracle/→/oracleandverifier/→/verifierfrom the host into the live sandbox, so the agent cannot peek at the oracle before it runs.- No bench-specific judge agents. SkillsBench's
bench evalwrapper provides trajectory inspection and skill-coverage metrics; this port stays inside theverifiers.v1contract and emits the single binary reward.