dummy-harbor-env
Overview
- Environment ID:
dummy-harbor-env - Short description: Minimal Harbor environment for testing the CLI agent interception framework
- Tags:
dummy,testing,cli-agent,harbor
Datasets
- Primary dataset: Harbor-format tasks in
tasks/directory - Source: Bundled with environment
- Tasks: 1 dummy task (
hello-world)
Task
- Type: single-turn (via HarborEnv)
- Base class:
HarborEnv(extendsCliAgentEnv) - Rubric overview:
- Reward computed by
tests/test.shwhich runs pytest ontest_state.py - Returns 1.0 if
/app/hello.txtcontains "Hello, world!", 0.0 otherwise
- Reward computed by
Quickstart
Run an evaluation with default settings:
prime eval run dummy-harbor-env
Configure model and sampling:
prime eval run dummy-harbor-env -m gpt-4.1-mini -n 1 -r 1
How It Works
This environment demonstrates the HarborEnv/CliAgentEnv data flow:
- Harbor Task Loading: Task is loaded from
tasks/hello-world/withtask.toml,instruction.md, andtests/ - Sandbox Creation: A Docker sandbox is created with the task instruction uploaded to
/task/ - Agent Execution: A Python script reads the instruction and makes an OpenAI API call
- Interception: The API call is intercepted by CliAgentEnv's HTTP proxy server (via Cloudflare tunnel)
- LLM Response: The LLM returns a bash command to complete the task
- Execution: The agent executes the command in
/app - Testing: Harbor's
tests/test.shruns pytest to verify the result
Agent Script Details
The embedded agent script:
- Reads task instruction from
/task/instruction.md - Asks the LLM for a bash command to complete the task
- Executes the returned command in
/app
For the hello-world task, the LLM should respond with something like:
echo "Hello, world!" > hello.txt
Environment Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
dataset_path | str | Path | ./tasks | Path to Harbor-format tasks directory |
tasks | list[str] | None | None | Specific task names to load (None = all) |
agent_workdir | str | /app | Working directory for agent in sandbox |
docker_image | str | python:3.11-slim | Docker image for sandbox |
timeout_seconds | float | 300.0 | Overall rollout timeout |
max_turns | int | -1 | Max turns (-1 = unlimited) |
Metrics
| Metric | Meaning |
|---|---|
reward | 1.0 if pytest passes (hello.txt correct), 0.0 otherwise |