0

Skillsbench

Fresh

SkillsBench v1.1 (native task.md) — evaluating how well AI agents use skills (87-task default grid + 14 opt-in extras).

Type
RL Env
Publisher
Benchflow
Runtime
multi-turn
License
unknown
Size
v1.1.0
Published
Jun 2026

Cite

Notes

Only stored in your browser.

SkillsBench

SkillsBench in the Prime Intellect Environments Hub. Wraps the SkillsBench benchmark as a verifiers.v1 environment so it can be evaluated and trained against through prime eval run and the hosted training stack.

This is the v1.1 release of SkillsBench — native BenchFlow task.md packages (GitHub release v1.1, commit 27738384). The task list is faithful to the v1.1 release — no tasks moved, dropped, or renamed.

Overview

  • Environment ID: skillsbench
  • Short description: Benchmark for evaluating how well AI agents use skills (modular folders of instructions, scripts, and resources). Bundles 101 task dirs: an 87-task scored grid plus 14 opt-in extras.
  • Tags: skills, agents, tool-use, harbor, eval, train

Datasets

  • Scored grid: 87 SkillsBench tasks under tasks/, loaded by default — the full v1.1 default grid.

  • Extras: 14 opt-in tasks under tasks-extra/, loaded only with extras=true (credential-/external-dep or integration-incompatible tasks that upstream parks in tasks-extra/).

  • Each task is a native task.md package (BenchFlow v1.1 layout):

    tasks/<task-id>/
      task.md                  # YAML frontmatter (schema_version 1.3:
                               #   metadata / verifier / agent / environment) + instruction body
      environment/
        Dockerfile             # baked recipe (ignored by this port)
        skills/                # uploaded to /root/.claude/skills/
        <data files...>        # uploaded to /root/
      oracle/
        solve.sh               # reference solution (held back, scoring only)
      verifier/
        test.sh                # scorer: pytest verifier/test_outputs.py -> /logs/verifier/reward.txt
        test_outputs.py
    
  • Source: https://github.com/benchflow-ai/skillsbenchtasks/ and tasks-extra/ trees vendored verbatim from the v1.1 release (27738384), matching the release's exact split.

  • Split sizes: 87 scored examples, single split.

Task

  • Type: multi-turn, tool-use coding agent.
  • Output expectations: each task's task.md instruction names the file path(s) the agent must produce. Examples: /root/mass_report.json, /root/answer.txt. The task's verifier/test_outputs.py scores them.
  • Rubric: a single binary skillsbench_taskmd_reward1.0 if the task's verifier writes a non-zero value to /logs/verifier/reward.txt, else 0.0.

How it differs from upstream SkillsBench

ConcernUpstream SkillsBench (bench eval)This port (prime eval run)
SandboxDaytona, per-task built Docker imageOne shared image (ubuntu:24.04) + per-task setup script
Workdir/root (Dockerfile WORKDIR /root)/root (set on sandbox config)
Skills mount/root/.claude/skills/<skill>/Same: /root/.claude/skills/<skill>/
Agentclaude-agent-acp, etc.verifiers.v1.OpenCode by default; Pi, MiniSWEAgent, base harness all wireable
Scoringruns verifier/test.sh, reads /logs/verifier/reward.txtskillsbench_taskmd_reward does exactly the same

Because stock verifiers (0.1.14) HarborTaskset only parses the old task.toml layout, this env ships a native task.md loader: it parses the YAML frontmatter (sandbox sizing from environment.memory_mb / storage_mb / cpus, timeouts from verifier/agent), uses the markdown body as the instruction, and scores via the v1.1 verifier/ protocol (mounts oracle//oracle, verifier//verifier, runs bash /verifier/test.sh).

The port intentionally trades per-task Dockerfiles for a shared base image plus a setup hook. For a task that needs a different image (e.g. CUDA, a locked compiler version), set docker_image under environment: in its task.md frontmatter — the loader honors it.

Quickstart

# Install
prime env install benchflow/skillsbench      # or `prime env install .` after cloning

# List the tasks the env can run
python -c "from skillsbench import SkillsBenchTaskset; \
print(len(SkillsBenchTaskset().load_rows()))"

# Run one task end-to-end with the default OpenCode harness
# (Verified working: Prime CLI + anthropic/claude-haiku-4.5 via -p prime)
prime eval run skillsbench \
  -m anthropic/claude-haiku-4.5 -p prime \
  -n 1 -r 1 --max-concurrent 1 --timeout 1800 \
  -a '{"config":{"taskset":{"task_names":["dialogue-parser"]}}}'

# Run a curated subset
prime eval run skillsbench \
  -m anthropic/claude-sonnet-4.6 -p prime \
  -n 5 -r 1 --max-concurrent 4 \
  -a '{"config":{"taskset":{"task_names":["3d-scan-calc","edit-pdf","dialogue-parser","exam-block-sequencing","find-topk-similiar-chemicals"]}}}'

Environment Arguments

load_environment takes the standard vf.EnvConfig envelope. The taskset and harness child configs accept:

SectionArgTypeDefaultDescription
tasksettask_nameslist[str] | NoneNone (all 87)Subset filter
tasksetextrasboolfalseMerge the 14 tasks-extra/ tasks into the grid
tasksetdocker_imagestrubuntu:24.04Image used when a task's task.md does not override
tasksetworkdirstr/rootSandbox workdir; also where instruction.md / data is uploaded
tasksetskills_remote_dirstr/root/.claude/skillsWhere the per-task environment/skills/ tree is mounted
tasksetwith_skillsbooltrueSet false for the without-skills half of a paired sweep
tasksetapt_packageslist[str]python3, pip, ripgrep, jq, …Installed once per rollout via apt-get install
tasksettimeout_minutesint480Sandbox lifetime (8h)
tasksetagent_timeout_secondsfloat7200Per-command timeout for the agent. Foreground HTTP caps at 900s, so anything above 600s is auto-routed through start_background_job (polled, ~24h cap).
tasksetverifier_timeout_secondsfloat1800Verifier script timeout (also background-routed)
harnessmax_turnsint60OpenCode turn budget
harnesssystem_promptstrSkillsBench promptOverride the system prompt
harnessdisabled_toolslist[str]narrow listOpenCode tool gating
harnessagent_workdirstr/rootWhere OpenCode cds before running
harnessinstall_ripgrepboolfalseSkipped because our apt setup installs ripgrep

Metrics

MetricMeaning
reward1.0 iff the task's verifier/test.sh wrote a non-zero value to /logs/verifier/reward.txt, else 0.0
skillsbench_tests (state)dict with returncode, stdout, stderr from the verifier — useful for debugging failed rollouts
skillsbench_error (state)exception string if verifier upload/exec itself failed

Caveats

  • Single base image. Tasks that hard-depend on a custom base image (e.g. a specific CUDA/compiler image, or compose-based tasks) will not match upstream behavior unless their task.md frontmatter sets environment.docker_image.
  • Network on by default. SkillsBench setup steps apt-get / pip install; the port enables network_access = True per rollout.
  • verifier/test.sh runs inside the same sandbox. At score time skillsbench_taskmd_reward uploads oracle//oracle and verifier//verifier from the host into the live sandbox, so the agent cannot peek at the oracle before it runs.
  • No bench-specific judge agents. SkillsBench's bench eval wrapper provides trajectory inspection and skill-coverage metrics; this port stays inside the verifiers.v1 contract and emits the single binary reward.