0

Openenv Tbench 2 RL Env (Community)

Fresh

OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:

Type
RL Env
License
bsd-3-clause
Published
Jan 2026

Cite

Notes

Only stored in your browser.

TB2 Environment (Terminal-Bench 2)

OpenEnv wrapper for Terminal-Bench 2 tasks. Supports two execution modes:

ModeDescriptionUse Case
LocalRuns commands in the server process (no Docker)Hugging Face Spaces, environments without Docker access
DockerRuns each task in its own containerFull TB2.0 fidelity with custom task images

Quick Start

from tbench2_env import Tbench2Env, Tbench2Action

env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)

result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)

result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)

env.close()

Building the Docker Image

Before using the environment, build the Docker image:

# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .

Environment Details

Action

Tbench2Action: Controls interaction with the TB2 task session

FieldTypeDefaultDescription
action_typestr"exec"Action to perform (exec, write, view, wait, kill, write_file, evaluate, close)
commandstr""Shell command or input to send
session_idstr | NoneNoneSession ID for streaming processes
blockboolTrueWhether to block until command completes
wait_secondsfloat | NoneNoneTime to wait (for wait action)
file_pathstr""File path (for write_file action)
contentstr""Content to write (for write_file action)

Observation

Tbench2Observation: Contains the environment response

FieldTypeDescription
instructionstrTask instruction/prompt from the TB2 task
outputstrCommand output (stdout/stderr)
successboolWhether the action succeeded
errorstrError message if action failed
task_idstrCurrent task identifier
task_pathstrPath to the task directory
session_idstr | NoneSession ID for streaming processes
action_typestrThe action type that produced this observation
infodictAdditional metadata

State

Tbench2State: Server-side state for the task session

FieldTypeDescription
task_idstrCurrent task identifier
task_pathstrPath to the task directory
session_idstrActive session ID
terminal_readyboolWhether the terminal is ready for commands
last_action_typestrLast action type executed
last_commandstrLast command executed
last_outputstrOutput from last command

Execution Modes

Local Mode (Default)

Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.

# Default - local mode
python -m tbench2_env.server.app

# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app

Note: Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.

Docker Mode

Each task runs in its own Docker container, using the image specified in the task's task.toml:

# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app

Requirements:

  • Docker socket mounted at /var/run/docker.sock
  • Sufficient disk space for container images
  • Network access to pull images if not cached

Environment Variables for Docker Mode:

  • TB2_MODE=docker - Enable Docker-backed execution
  • Docker socket must be accessible (mounted volume)

Action Types

ActionDescriptionRequired Fields
execRun a shell commandcommand, optionally block, session_id
writeSend input to a running sessionsession_id, command
viewRead pending outputsession_id
waitWait for outputsession_id, optionally wait_seconds
killTerminate a running sessionsession_id
write_fileWrite content to a filefile_path, content
evaluateRun pytest tests, return reward(none)
closeStop and cleanup(none)

Session IDs (Streaming Processes)

session_id is only required when you start a non-blocking process and want to interact with it (write, view, wait, kill). For plain exec commands, you can omit it.

Example (Python):

# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))

# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))

# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))

Environment Variables

VariableDefaultDescription
TB2_MODElocalExecution mode: local or docker
TB2_TASKS_DIR(auto-download)Path to local Terminal-Bench-2 repo checkout
TB2_OUTPUT_DIR/tmp/tbench2_env_runsDirectory for session logs and cache
TB2_CACHE_DIR$TB2_OUTPUT_DIR/repo_cacheWhere to extract TB2 repo
TB2_REPO_URL(GitHub main.zip)Repo zip URL for auto-download

Reward

Binary reward on evaluate action:

  • 1.0 - All pytest tests pass (exit code 0)
  • 0.0 - Tests fail (non-zero exit code)

Intermediate steps return reward=None.

Running the Server

# Install dependencies
uv sync --all-extras

# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000

# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000

# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app

Project Structure

tbench2_env/
├── __init__.py              # Module exports (Tbench2Env, Tbench2Action, etc.)
├── README.md                # This file
├── client.py                # Tbench2Env client implementation
├── models.py                # Tbench2Action, Tbench2Observation, Tbench2State
├── openenv.yaml             # OpenEnv configuration
├── pyproject.toml           # Package dependencies
└── server/
    ├── __init__.py          # Server exports
    ├── app.py               # FastAPI application
    ├── tbench2_env_environment.py  # Core environment logic
    └── Dockerfile           # Container image definition