hud-browser-2048
Overview
- Environment ID:
hud-browser-2048 - Short description: Browser-based 2048 game for evaulating agents using visual observations and keyboard actions
- Tags:
game,browser,multimodal,CUA,2048
Datasets
- Primary dataset(s): Built-in tasks with varying difficulty (64 to 256 tiles)
- Source links: Generated programmatically in environment loader
- Split sizes: 3 tasks (expandable)
Task
- Type: multi-turn, tool use
- Parser: ToolXMLParser with action validation
- Rubric overview: Task completion (80%), format compliance (10%), tool execution (10%)
Quickstart
Run an evaluation with default settings:
uv run vf-eval hud-browser-2048
Configure model and sampling:
uv run vf-eval hud-browser-2048 \
-m gpt-4.1-mini \
-n 1 -r 3 \
-a '{"max_turns": 150}' # env-specific args as JSON
Notes:
- Use
-a/--env-argsto pass environment-specific configuration as a JSON object.
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
max_turns | int | 100 | Maximum moves allowed per game |
system_prompt | str | (see config) | Override agent instructions |
config_path | str | ./config.yaml | Path to alternative config file |
Metrics
| Metric | Meaning |
|---|---|
reward | Main scalar reward (weighted sum of task completion, format compliance, tool execution) |
task_completion | Logarithmic scale based on highest tile vs target (min(1.0, log(highest)/log(target))) |
tool_execution | Ratio of successful tool calls |
format_compliance | Score for correct XML formatting and action syntax |
How It Works
This environment uses hud-vf-gym to connect the browser-based 2048 game with the Verifiers RL framework:
-
Browser Control:
- Uses
hudevals/hud-browserDocker image with Playwright - Takes screenshots to observe game state
- Sends keyboard inputs (arrow keys) to play
- Uses
-
Action Mapping:
- Agent calls
screenshot()to see the game - Agent calls
up(),down(),left(),right()for moves - Maps to environment's
computertool with appropriate key presses
- Agent calls
-
Task Flow:
- Setup: Launches 2048 web app in browser
- Play: Agent uses screenshots and arrow keys
- Evaluate: Checks if target tile was reached
The browser environment demonstrates HUD MCP's ability to wrap web applications as RL environments, enabling agents to learn from visual observations and keyboard interactions.