minigrid_adapted Adapter
Overview
- Environment ID:
minigrid_adapted - Short description: Multi-turn, multi-action verifier adapter for Farama MiniGrid with coordinate-based observations and game-specific milestone shaping rewards for GRPO training.
- Tags:
rl,minigrid,gymnasium,verifiers,grpo,multi-action
This environment wraps Farama MiniGrid environments as verifiers-compatible tasks for Prime RL. Key features:
- Multi-action generation: Each LLM turn can specify up to 20 actions executed sequentially, enabling better temporal abstraction and credit assignment
- Coordinate-based observations: Global (x, y) coordinates with origin at bottom-left, X→right, Y→up
- Game-specific milestone rewards: Tailored shaping for LockedRoom, ObstructedMaze, and LavaGap environments
- Curriculum support: Progressive difficulty ramping across environments
Datasets
- Primary dataset: Synthetic prompts sampled from MiniGrid resets across supported environments
- Default environments:
MiniGrid-ObstructedMaze-Full-v1,MiniGrid-LavaGapS7-v0,MiniGrid-LockedRoom-v0 - Source links: MiniGrid Docs
- Split sizes: Configurable via
num_examples(default 12)
Task
- Type: Multi-turn control loop with multi-action generation
- Parser:
vf.XMLParser(fields=["think", "actions"], answer_field="actions") - Rubric overview:
minigrid_native_reward_func(weight 7.5): Native MiniGrid reward (0.1–1.0 for wins, efficiency-based)minigrid_milestone_reward_func(weight 0.5, optional): Game-specific shaping rewards- Parser format reward (weight 0.2): Enforces
<THINK>/<ACTIONS>XML structure
Multi-Action Format
The agent outputs multiple actions per turn in an <ACTIONS> block:
<THINK>Plan the sequence of actions</THINK>
<ACTIONS>
rotate(clockwise, 2)
forward
forward
pickup
toggle
</ACTIONS>
Available actions:
rotate(clockwise, N)/rotate(anticlockwise, N): Rotate N times (90° each)forward: Move one tile in facing directionpickup: Pick up object in front tiledrop: Drop held object in front tiletoggle: Open/close doors or boxes
Supported Environments
| Environment | Objective | Milestone Rewards (max) |
|---|---|---|
MiniGrid-LockedRoom-v0 | Get specified key, unlock specified door, reach goal | 1.6 (mission-aware) |
MiniGrid-ObstructedMaze-Full-v1 | Find and pick up blue ball in maze | 2.5 (box/key/door/ball) |
MiniGrid-LavaGapS5-v0 | Navigate through lava gap to goal | Distance shaping + gap bonus |
MiniGrid-LavaGapS6-v0 | Navigate through lava gap to goal | Distance shaping + gap bonus |
MiniGrid-LavaGapS7-v0 | Navigate through lava gap to goal | Distance shaping + gap bonus |
Quickstart
Run with defaults (all three default environments):
uv run vf-eval minigrid_adapted
Run a single environment:
# LavaGap only (easiest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LavaGapS7-v0"]}'
# LockedRoom only
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LockedRoom-v0"]}'
# ObstructedMaze only (hardest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-ObstructedMaze-Full-v1"]}'
Mix and match environments:
# Just the two simpler environments
uv run vf-eval minigrid_adapted \
-a '{"env_ids": ["MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"]}'
# All LavaGap sizes
uv run vf-eval minigrid_adapted \
-a '{"env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"]}'
# Custom combination with milestone rewards
uv run vf-eval minigrid_adapted \
-m gpt-4.1-mini \
-n 32 -r 3 \
-a '{
"env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LockedRoom-v0"],
"use_milestone_rewards": true
}'
Progressive curriculum with LavaGap (trains on easier sizes first, gradually adds harder):
# Start with S5 (5x5), progressively add S6, then S7 (7x7)
uv run vf-eval minigrid_adapted \
-m gpt-4.1-mini \
-n 50 -r 5 -t 2048 -T 0.7 \
-a '{
"env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"],
"curriculum_mode": "progressive",
"curriculum_warmup_episodes": 100,
"use_milestone_rewards": true
}'
In progressive mode, env_ids order matters: the first environment is used exclusively at the start, then additional environments are gradually introduced over curriculum_warmup_episodes.
Notes:
- Use
-a/--env-argsfor JSON kwargs forwarded toload_environment() - Reports land in
./environments/minigrid_adapted/reports/
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
env_ids | list[str] | ["MiniGrid-ObstructedMaze-Full-v1", "MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"] | MiniGrid environment IDs. Order matters for progressive curriculum (easy→hard). |
max_turns | int | 20 | Maximum LLM generations before timeout. Total step budget = max_turns × 20. |
num_examples | int | 12 | Number of prompt seeds in the dataset. |
seed | int | null | null | RNG seed for reproducibility. |
curriculum_mode | str | "uniform" | "uniform" (random) or "progressive" (start with first env, gradually add more). |
curriculum_warmup_episodes | int | 100 | Episodes before all envs available (progressive mode only). |
use_milestone_rewards | bool | false | Enable game-specific milestone shaping. Recommended for smaller models (≤7B). |
milestone_reward_weight | float | 0.5 | Weight for milestone reward function. |
native_reward_weight | float | 7.5 | Weight for native MiniGrid reward. |
Observation Format
Each observation includes:
<OBS id=N>
Grid: WxH (origin at bottom-left, X→right, Y→up)
You: (x,y) facing direction, holding [item] or hands empty
In front (x,y): [object description] or empty or BLOCKED
Mission: [mission text if applicable]
Visible objects:
- [color] [type] at (x,y)
- ...
</OBS>
Metrics
| Metric | Meaning |
|---|---|
native_reward | MiniGrid's built-in reward (0.1–1.0 for wins, higher = faster completion) |
milestone_reward_total | Cumulative game-specific shaping reward |
result | Final outcome: success, timeout, failure, or invalid |
steps | Total MiniGrid environment steps executed |
turns | Total LLM generations used |
Milestone Reward Details
LockedRoom (max 1.6):
- Enter key room: +0.4
- Pick up correct key: +0.6
- Unlock correct door: +0.6
ObstructedMaze (max 2.5):
- Open first box: +0.5
- Pick up first key: +0.5
- Open first door: +0.5
- Pick up blue ball (goal): +1.0
LavaGap (potential-based):
- Distance reduction to goal: +0.15 per Manhattan distance
- Distance increase: -0.1 (mild backtrack penalty)
- Pass through lava gap: +0.4
- Exploration: +0.03 per new tile
Evaluation Reports
No reports found. Run uv run vf-eval minigrid_adapted to generate one.