SkillsBench

SkillsBench in the Prime Intellect Environments Hub. Wraps the SkillsBench benchmark as a verifiers.v1 environment so it can be evaluated and trained against through prime eval run and the hosted training stack.

This is the v1.1 release of SkillsBench — native BenchFlow task.md packages (GitHub release v1.1, commit 27738384). The task list is faithful to the v1.1 release — no tasks moved, dropped, or renamed.

Overview

Environment ID: skillsbench
Short description: Benchmark for evaluating how well AI agents use skills (modular folders of instructions, scripts, and resources). Bundles 101 task dirs: an 87-task scored grid plus 14 opt-in extras.
Tags: skills, agents, tool-use, harbor, eval, train

Datasets

Scored grid: 87 SkillsBench tasks under tasks/, loaded by default — the full v1.1 default grid.
Extras: 14 opt-in tasks under tasks-extra/, loaded only with extras=true (credential-/external-dep or integration-incompatible tasks that upstream parks in tasks-extra/).

Each task is a native task.md package (BenchFlow v1.1 layout):

tasks/<task-id>/
  task.md                  # YAML frontmatter (schema_version 1.3:
                           #   metadata / verifier / agent / environment) + instruction body
  environment/
    Dockerfile             # baked recipe (ignored by this port)
    skills/                # uploaded to /root/.claude/skills/
    <data files...>        # uploaded to /root/
  oracle/
    solve.sh               # reference solution (held back, scoring only)
  verifier/
    test.sh                # scorer: pytest verifier/test_outputs.py -> /logs/verifier/reward.txt
    test_outputs.py

Source: https://github.com/benchflow-ai/skillsbench — tasks/ and tasks-extra/ trees vendored verbatim from the v1.1 release (27738384), matching the release's exact split.
Split sizes: 87 scored examples, single split.

Task

Type: multi-turn, tool-use coding agent.
Output expectations: each task's task.md instruction names the file path(s) the agent must produce. Examples: /root/mass_report.json, /root/answer.txt. The task's verifier/test_outputs.py scores them.
Rubric: a single binary skillsbench_taskmd_reward — 1.0 if the task's verifier writes a non-zero value to /logs/verifier/reward.txt, else 0.0.

How it differs from upstream SkillsBench

Concern	Upstream SkillsBench (`bench eval`)	This port (`prime eval run`)
Sandbox	Daytona, per-task built Docker image	One shared image (`ubuntu:24.04`) + per-task setup script
Workdir	`/root` (Dockerfile `WORKDIR /root`)	`/root` (set on sandbox config)
Skills mount	`/root/.claude/skills/<skill>/`	Same: `/root/.claude/skills/<skill>/`
Agent	`claude-agent-acp`, etc.	`verifiers.v1.OpenCode` by default; Pi, MiniSWEAgent, base harness all wireable
Scoring	runs `verifier/test.sh`, reads `/logs/verifier/reward.txt`	`skillsbench_taskmd_reward` does exactly the same

Because stock verifiers (0.1.14) HarborTaskset only parses the old task.toml layout, this env ships a native task.md loader: it parses the YAML frontmatter (sandbox sizing from environment.memory_mb / storage_mb / cpus, timeouts from verifier/agent), uses the markdown body as the instruction, and scores via the v1.1 verifier/ protocol (mounts oracle/ → /oracle, verifier/ → /verifier, runs bash /verifier/test.sh).

The port intentionally trades per-task Dockerfiles for a shared base image plus a setup hook. For a task that needs a different image (e.g. CUDA, a locked compiler version), set docker_image under environment: in its task.md frontmatter — the loader honors it.

Quickstart

# Install
prime env install benchflow/skillsbench      # or `prime env install .` after cloning

# List the tasks the env can run
python -c "from skillsbench import SkillsBenchTaskset; \
print(len(SkillsBenchTaskset().load_rows()))"

# Run one task end-to-end with the default OpenCode harness
# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)
prime eval run skillsbench \
  -m anthropic/claude-haiku-4.5 -p prime \
  -n 1 -r 1 --max-concurrent 1 --timeout 1800 \
  -a '{"config":{"taskset":{"task_names":["dialogue-parser"]}}}'

# Run a curated subset
prime eval run skillsbench \
  -m anthropic/claude-sonnet-4.6 -p prime \
  -n 5 -r 1 --max-concurrent 4 \
  -a '{"config":{"taskset":{"task_names":["3d-scan-calc","edit-pdf","dialogue-parser","exam-block-sequencing","find-topk-similiar-chemicals"]}}}'

Environment Arguments

load_environment takes the standard vf.EnvConfig envelope. The taskset and harness child configs accept:

Section	Arg	Type	Default	Description
taskset	`task_names`	`list[str] \| None`	`None` (all 87)	Subset filter
taskset	`extras`	`bool`	`false`	Merge the 14 `tasks-extra/` tasks into the grid
taskset	`docker_image`	`str`	`ubuntu:24.04`	Image used when a task's `task.md` does not override
taskset	`workdir`	`str`	`/root`	Sandbox workdir; also where `instruction.md` / data is uploaded
taskset	`skills_remote_dir`	`str`	`/root/.claude/skills`	Where the per-task `environment/skills/` tree is mounted
taskset	`with_skills`	`bool`	`true`	Set `false` for the without-skills half of a paired sweep
taskset	`apt_packages`	`list[str]`	python3, pip, ripgrep, jq, …	Installed once per rollout via `apt-get install`
taskset	`timeout_minutes`	`int`	`480`	Sandbox lifetime (8h)
taskset	`agent_timeout_seconds`	`float`	`7200`	Per-command timeout for the agent. Foreground HTTP caps at 900s, so anything above 600s is auto-routed through `start_background_job` (polled, ~24h cap).
taskset	`verifier_timeout_seconds`	`float`	`1800`	Verifier script timeout (also background-routed)
harness	`max_turns`	`int`	`60`	OpenCode turn budget
harness	`system_prompt`	`str`	SkillsBench prompt	Override the system prompt
harness	`disabled_tools`	`list[str]`	narrow list	OpenCode tool gating
harness	`agent_workdir`	`str`	`/root`	Where OpenCode `cd`s before running
harness	`install_ripgrep`	`bool`	`false`	Skipped because our apt setup installs ripgrep

Metrics

Metric	Meaning
`reward`	`1.0` iff the task's `verifier/test.sh` wrote a non-zero value to `/logs/verifier/reward.txt`, else `0.0`
`skillsbench_tests` (state)	dict with `returncode`, `stdout`, `stderr` from the verifier — useful for debugging failed rollouts
`skillsbench_error` (state)	exception string if verifier upload/exec itself failed

Caveats

Single base image. Tasks that hard-depend on a custom base image (e.g. a specific CUDA/compiler image, or compose-based tasks) will not match upstream behavior unless their task.md frontmatter sets environment.docker_image.
Network on by default. SkillsBench setup steps apt-get / pip install; the port enables network_access = True per rollout.
verifier/test.sh runs inside the same sandbox. At score time skillsbench_taskmd_reward uploads oracle/ → /oracle and verifier/ → /verifier from the host into the live sandbox, so the agent cannot peek at the oracle before it runs.
No bench-specific judge agents. SkillsBench's bench eval wrapper provides trajectory inspection and skill-coverage metrics; this port stays inside the verifiers.v1 contract and emits the single binary reward.