minigrid_adapted Adapter

Overview

Environment ID: minigrid_adapted
Short description: Multi-turn, multi-action verifier adapter for Farama MiniGrid with coordinate-based observations and game-specific milestone shaping rewards for GRPO training.
Tags: rl, minigrid, gymnasium, verifiers, grpo, multi-action

This environment wraps Farama MiniGrid environments as verifiers-compatible tasks for Prime RL. Key features:

Multi-action generation: Each LLM turn can specify up to 20 actions executed sequentially, enabling better temporal abstraction and credit assignment
Coordinate-based observations: Global (x, y) coordinates with origin at bottom-left, X→right, Y→up
Game-specific milestone rewards: Tailored shaping for LockedRoom, ObstructedMaze, and LavaGap environments
Curriculum support: Progressive difficulty ramping across environments

Datasets

Primary dataset: Synthetic prompts sampled from MiniGrid resets across supported environments
Default environments: MiniGrid-ObstructedMaze-Full-v1, MiniGrid-LavaGapS7-v0, MiniGrid-LockedRoom-v0
Source links: MiniGrid Docs
Split sizes: Configurable via num_examples (default 12)

Task

Type: Multi-turn control loop with multi-action generation
Parser: vf.XMLParser(fields=["think", "actions"], answer_field="actions")
Rubric overview:
- minigrid_native_reward_func (weight 7.5): Native MiniGrid reward (0.1–1.0 for wins, efficiency-based)
- minigrid_milestone_reward_func (weight 0.5, optional): Game-specific shaping rewards
- Parser format reward (weight 0.2): Enforces <THINK>/<ACTIONS> XML structure

Multi-Action Format

The agent outputs multiple actions per turn in an <ACTIONS> block:

<THINK>Plan the sequence of actions</THINK>
<ACTIONS>
rotate(clockwise, 2)
forward
forward
pickup
toggle
</ACTIONS>

Available actions:

rotate(clockwise, N) / rotate(anticlockwise, N): Rotate N times (90° each)
forward: Move one tile in facing direction
pickup: Pick up object in front tile
drop: Drop held object in front tile
toggle: Open/close doors or boxes

Supported Environments

Environment	Objective	Milestone Rewards (max)
`MiniGrid-LockedRoom-v0`	Get specified key, unlock specified door, reach goal	1.6 (mission-aware)
`MiniGrid-ObstructedMaze-Full-v1`	Find and pick up blue ball in maze	2.5 (box/key/door/ball)
`MiniGrid-LavaGapS5-v0`	Navigate through lava gap to goal	Distance shaping + gap bonus
`MiniGrid-LavaGapS6-v0`	Navigate through lava gap to goal	Distance shaping + gap bonus
`MiniGrid-LavaGapS7-v0`	Navigate through lava gap to goal	Distance shaping + gap bonus

Quickstart

Run with defaults (all three default environments):

uv run vf-eval minigrid_adapted

Run a single environment:

# LavaGap only (easiest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LavaGapS7-v0"]}'

# LockedRoom only
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LockedRoom-v0"]}'

# ObstructedMaze only (hardest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-ObstructedMaze-Full-v1"]}'

Mix and match environments:

# Just the two simpler environments
uv run vf-eval minigrid_adapted \
  -a '{"env_ids": ["MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"]}'

# All LavaGap sizes
uv run vf-eval minigrid_adapted \
  -a '{"env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"]}'

# Custom combination with milestone rewards
uv run vf-eval minigrid_adapted \
  -m gpt-4.1-mini \
  -n 32 -r 3 \
  -a '{
    "env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LockedRoom-v0"],
    "use_milestone_rewards": true
  }'

Progressive curriculum with LavaGap (trains on easier sizes first, gradually adds harder):

# Start with S5 (5x5), progressively add S6, then S7 (7x7)
uv run vf-eval minigrid_adapted \
  -m gpt-4.1-mini \
  -n 50 -r 5 -t 2048 -T 0.7 \
  -a '{
    "env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"],
    "curriculum_mode": "progressive",
    "curriculum_warmup_episodes": 100,
    "use_milestone_rewards": true
  }'

In progressive mode, env_ids order matters: the first environment is used exclusively at the start, then additional environments are gradually introduced over curriculum_warmup_episodes.

Notes:

Use -a/--env-args for JSON kwargs forwarded to load_environment()
Reports land in ./environments/minigrid_adapted/reports/

Environment Arguments

Arg	Type	Default	Description
`env_ids`	list[str]	`["MiniGrid-ObstructedMaze-Full-v1", "MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"]`	MiniGrid environment IDs. Order matters for progressive curriculum (easy→hard).
`max_turns`	int	`20`	Maximum LLM generations before timeout. Total step budget = max_turns × 20.
`num_examples`	int	`12`	Number of prompt seeds in the dataset.
`seed`	int \| null	`null`	RNG seed for reproducibility.
`curriculum_mode`	str	`"uniform"`	`"uniform"` (random) or `"progressive"` (start with first env, gradually add more).
`curriculum_warmup_episodes`	int	`100`	Episodes before all envs available (progressive mode only).
`use_milestone_rewards`	bool	`false`	Enable game-specific milestone shaping. Recommended for smaller models (≤7B).
`milestone_reward_weight`	float	`0.5`	Weight for milestone reward function.
`native_reward_weight`	float	`7.5`	Weight for native MiniGrid reward.

Observation Format

Each observation includes:

<OBS id=N>
Grid: WxH (origin at bottom-left, X→right, Y→up)
You: (x,y) facing direction, holding [item] or hands empty
In front (x,y): [object description] or empty or BLOCKED
Mission: [mission text if applicable]
Visible objects:
  - [color] [type] at (x,y)
  - ...
</OBS>

Metrics

Metric	Meaning
`native_reward`	MiniGrid's built-in reward (0.1–1.0 for wins, higher = faster completion)
`milestone_reward_total`	Cumulative game-specific shaping reward
`result`	Final outcome: `success`, `timeout`, `failure`, or `invalid`
`steps`	Total MiniGrid environment steps executed
`turns`	Total LLM generations used

Milestone Reward Details

LockedRoom (max 1.6):

Enter key room: +0.4
Pick up correct key: +0.6
Unlock correct door: +0.6

ObstructedMaze (max 2.5):

Open first box: +0.5
Pick up first key: +0.5
Open first door: +0.5
Pick up blue ball (goal): +1.0

LavaGap (potential-based):

Distance reduction to goal: +0.15 per Manhattan distance
Distance increase: -0.1 (mild backtrack penalty)
Pass through lava gap: +0.4
Exploration: +0.03 per new tile

Evaluation Reports

No reports found. Run uv run vf-eval minigrid_adapted to generate one.