dummy-harbor-env

Environment ID: dummy-harbor-env
Short description: Minimal Harbor environment for testing the CLI agent interception framework
Tags: dummy, testing, cli-agent, harbor

Type: single-turn (via HarborEnv)
Base class: HarborEnv (extends CliAgentEnv)
Rubric overview:
- Reward computed by tests/test.sh which runs pytest on test_state.py
- Returns 1.0 if /app/hello.txt contains "Hello, world!", 0.0 otherwise

Run an evaluation with default settings:

prime eval run dummy-harbor-env

Configure model and sampling:

prime eval run dummy-harbor-env -m gpt-4.1-mini -n 1 -r 1

This environment demonstrates the HarborEnv/CliAgentEnv data flow:

Harbor Task Loading: Task is loaded from tasks/hello-world/ with task.toml, instruction.md, and tests/
Sandbox Creation: A Docker sandbox is created with the task instruction uploaded to /task/
Agent Execution: A Python script reads the instruction and makes an OpenAI API call
Interception: The API call is intercepted by CliAgentEnv's HTTP proxy server (via Cloudflare tunnel)
LLM Response: The LLM returns a bash command to complete the task
Execution: The agent executes the command in /app
Testing: Harbor's tests/test.sh runs pytest to verify the result

The embedded agent script:

For the hello-world task, the LLM should respond with something like:

echo "Hello, world!" > hello.txt

Argument	Type	Default	Description
`dataset_path`	`str \| Path`	`./tasks`	Path to Harbor-format tasks directory
`tasks`	`list[str] \| None`	`None`	Specific task names to load (None = all)
`agent_workdir`	`str`	`/app`	Working directory for agent in sandbox
`docker_image`	`str`	`python:3.11-slim`	Docker image for sandbox
`timeout_seconds`	`float`	`300.0`	Overall rollout timeout
`max_turns`	`int`	`-1`	Max turns (-1 = unlimited)

Metric	Meaning
`reward`	1.0 if pytest passes (hello.txt correct), 0.0 otherwise