0

Minigrid Adapted RL Env (Antim)

Fresh

MiniGrid adapter that exposes Farama MiniGrid levels to the verifiers / Prime RL ecosystem.

Type
RL Env
Publisher
Antim
Runtime
multi-turn
License
mit
Size
v2.7.0
Published
Dec 2025

Cite

Notes

Only stored in your browser.

minigrid_adapted Adapter

Overview

  • Environment ID: minigrid_adapted
  • Short description: Multi-turn, multi-action verifier adapter for Farama MiniGrid with coordinate-based observations and game-specific milestone shaping rewards for GRPO training.
  • Tags: rl, minigrid, gymnasium, verifiers, grpo, multi-action

This environment wraps Farama MiniGrid environments as verifiers-compatible tasks for Prime RL. Key features:

  • Multi-action generation: Each LLM turn can specify up to 20 actions executed sequentially, enabling better temporal abstraction and credit assignment
  • Coordinate-based observations: Global (x, y) coordinates with origin at bottom-left, X→right, Y→up
  • Game-specific milestone rewards: Tailored shaping for LockedRoom, ObstructedMaze, and LavaGap environments
  • Curriculum support: Progressive difficulty ramping across environments

Datasets

  • Primary dataset: Synthetic prompts sampled from MiniGrid resets across supported environments
  • Default environments: MiniGrid-ObstructedMaze-Full-v1, MiniGrid-LavaGapS7-v0, MiniGrid-LockedRoom-v0
  • Source links: MiniGrid Docs
  • Split sizes: Configurable via num_examples (default 12)

Task

  • Type: Multi-turn control loop with multi-action generation
  • Parser: vf.XMLParser(fields=["think", "actions"], answer_field="actions")
  • Rubric overview:
    • minigrid_native_reward_func (weight 7.5): Native MiniGrid reward (0.1–1.0 for wins, efficiency-based)
    • minigrid_milestone_reward_func (weight 0.5, optional): Game-specific shaping rewards
    • Parser format reward (weight 0.2): Enforces <THINK>/<ACTIONS> XML structure

Multi-Action Format

The agent outputs multiple actions per turn in an <ACTIONS> block:

<THINK>Plan the sequence of actions</THINK>
<ACTIONS>
rotate(clockwise, 2)
forward
forward
pickup
toggle
</ACTIONS>

Available actions:

  • rotate(clockwise, N) / rotate(anticlockwise, N): Rotate N times (90° each)
  • forward: Move one tile in facing direction
  • pickup: Pick up object in front tile
  • drop: Drop held object in front tile
  • toggle: Open/close doors or boxes

Supported Environments

EnvironmentObjectiveMilestone Rewards (max)
MiniGrid-LockedRoom-v0Get specified key, unlock specified door, reach goal1.6 (mission-aware)
MiniGrid-ObstructedMaze-Full-v1Find and pick up blue ball in maze2.5 (box/key/door/ball)
MiniGrid-LavaGapS5-v0Navigate through lava gap to goalDistance shaping + gap bonus
MiniGrid-LavaGapS6-v0Navigate through lava gap to goalDistance shaping + gap bonus
MiniGrid-LavaGapS7-v0Navigate through lava gap to goalDistance shaping + gap bonus

Quickstart

Run with defaults (all three default environments):

uv run vf-eval minigrid_adapted

Run a single environment:

# LavaGap only (easiest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LavaGapS7-v0"]}'

# LockedRoom only
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-LockedRoom-v0"]}'

# ObstructedMaze only (hardest)
uv run vf-eval minigrid_adapted -a '{"env_ids": ["MiniGrid-ObstructedMaze-Full-v1"]}'

Mix and match environments:

# Just the two simpler environments
uv run vf-eval minigrid_adapted \
  -a '{"env_ids": ["MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"]}'

# All LavaGap sizes
uv run vf-eval minigrid_adapted \
  -a '{"env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"]}'

# Custom combination with milestone rewards
uv run vf-eval minigrid_adapted \
  -m gpt-4.1-mini \
  -n 32 -r 3 \
  -a '{
    "env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LockedRoom-v0"],
    "use_milestone_rewards": true
  }'

Progressive curriculum with LavaGap (trains on easier sizes first, gradually adds harder):

# Start with S5 (5x5), progressively add S6, then S7 (7x7)
uv run vf-eval minigrid_adapted \
  -m gpt-4.1-mini \
  -n 50 -r 5 -t 2048 -T 0.7 \
  -a '{
    "env_ids": ["MiniGrid-LavaGapS5-v0", "MiniGrid-LavaGapS6-v0", "MiniGrid-LavaGapS7-v0"],
    "curriculum_mode": "progressive",
    "curriculum_warmup_episodes": 100,
    "use_milestone_rewards": true
  }'

In progressive mode, env_ids order matters: the first environment is used exclusively at the start, then additional environments are gradually introduced over curriculum_warmup_episodes.

Notes:

  • Use -a/--env-args for JSON kwargs forwarded to load_environment()
  • Reports land in ./environments/minigrid_adapted/reports/

Environment Arguments

ArgTypeDefaultDescription
env_idslist[str]["MiniGrid-ObstructedMaze-Full-v1", "MiniGrid-LavaGapS7-v0", "MiniGrid-LockedRoom-v0"]MiniGrid environment IDs. Order matters for progressive curriculum (easy→hard).
max_turnsint20Maximum LLM generations before timeout. Total step budget = max_turns × 20.
num_examplesint12Number of prompt seeds in the dataset.
seedint | nullnullRNG seed for reproducibility.
curriculum_modestr"uniform""uniform" (random) or "progressive" (start with first env, gradually add more).
curriculum_warmup_episodesint100Episodes before all envs available (progressive mode only).
use_milestone_rewardsboolfalseEnable game-specific milestone shaping. Recommended for smaller models (≤7B).
milestone_reward_weightfloat0.5Weight for milestone reward function.
native_reward_weightfloat7.5Weight for native MiniGrid reward.

Observation Format

Each observation includes:

<OBS id=N>
Grid: WxH (origin at bottom-left, X→right, Y→up)
You: (x,y) facing direction, holding [item] or hands empty
In front (x,y): [object description] or empty or BLOCKED
Mission: [mission text if applicable]
Visible objects:
  - [color] [type] at (x,y)
  - ...
</OBS>

Metrics

MetricMeaning
native_rewardMiniGrid's built-in reward (0.1–1.0 for wins, higher = faster completion)
milestone_reward_totalCumulative game-specific shaping reward
resultFinal outcome: success, timeout, failure, or invalid
stepsTotal MiniGrid environment steps executed
turnsTotal LLM generations used

Milestone Reward Details

LockedRoom (max 1.6):

  • Enter key room: +0.4
  • Pick up correct key: +0.6
  • Unlock correct door: +0.6

ObstructedMaze (max 2.5):

  • Open first box: +0.5
  • Pick up first key: +0.5
  • Open first door: +0.5
  • Pick up blue ball (goal): +1.0

LavaGap (potential-based):

  • Distance reduction to goal: +0.15 per Manhattan distance
  • Distance increase: -0.1 (mild backtrack penalty)
  • Pass through lava gap: +0.4
  • Exploration: +0.03 per new tile

Evaluation Reports

No reports found. Run uv run vf-eval minigrid_adapted to generate one.