0

GBAGym

Fresh

GBA Eval from Mechanize Inc. ported as an RL training environment.

Type
RL Env
Runtime
ORS
License
unknown
Published
Jun 2026

Cite

Notes

Only stored in your browser.

GBA-Emu-Gym

OpenReward Environment

OpenReward gym port of mechanize-work/gba-eval — agents iteratively submit GBA emulator wasm builds and earn reward equal to the high-water mark of the grader's score.

Upstream is a one-shot 24-hour eval: the agent gets one Linux container with a Rust toolchain, writes a GBA emulator from scratch, and at the deadline a held-out grader runs the wasm against ~27 testcases (homebrew gameplay replays + procedural CPU/memory/DMA test ROMs + audio diffs against Mesen2). One submission, one number.

This is a gym: same shape, but the agent can submit any number of times per episode. Each submit call earns

reward = max(0, new_overall - previous_best_overall)

so an episode's cumulative reward equals the high-water mark of overall ∈ [0, 1]. A worse submission earns zero, no penalty. The agent's incentive is simple: ship something that compiles, get a baseline, then improve.

Status

End-to-end pipeline is verified live on OpenReward as GeneralReasoning/GBAGym. A no-op stub submission scores overall ≈ 0.0085 in roughly 3 minutes of grader wall-clock.

How the agent interacts with the env

Inside the agent's sandbox:

PathWhat it is
/task/Agent's working directory. Empty cargo workspace expected — agent creates Cargo.toml + src/lib.rs.
/task/spec/ABI.md10 C-ABI functions the agent's wasm must export (emu_init, emu_load_rom, emu_set_keys, emu_run_frame, emu_framebuffer, emu_audio_buffer, …).
/task/spec/gba_bios_stub.bin16 KiB ARM BIOS stub mapped at 0x00000000.
/task/spec/gbatek.htm5 MB GBATEK hardware reference.
/task/dev-roms/5 visible GBA ROMs (a strict subset of the grader's full ROM set — see below).

Available tools (per session):

ToolWhat it does
bash, read, write, edit, glob, grep, todo_writeStandard Claude-Code-style file/shell tools, scoped to the agent's sandbox.
oracle_run(rom_path, frames, replay_text?)Run a ROM through the reference Mesen2 emulator in a separate sandbox; per-frame PPMs + WAV audio land back as a tarball at /task/.oracle-out/<run-id>.tar. Mesen2 itself is never visible to the agent.
submitGrade whatever wasm is currently at /task/target/wasm32-unknown-unknown/release/gba_emu.wasm. Returns overall + section subscores. Per-testcase scores are deliberately not surfaced.
give_upEnd the episode early.

The agent is responsible for running cargo build themselves — submit does not build for them, it only grades the artifact at the canonical path.

ROM split

To preserve anti-shimming pressure while still letting the agent develop against real ROMs:

BucketROMs
Visible (in dev-roms/, also graded)spout, waimanu (gameplay replays); armwrestler (procedural test)
Pure dev (in dev-roms/, never graded)trogdor, another-world
Held-out (grader only)celeste-classic, varooom-3d, bulletgba, chip-advance, collie-defense, goodboy-advance, heartwrench-advance, piugba, mgba-suite, jsmolka/memory, destoer/dma-priority, several nba-hw/* ROMs, tonc/snd1-demo, audio test ROMs

The agent never sees the held-out names and submit only returns section aggregates (replay, procedural, audio), so they cannot reverse-engineer the held-out set from feedback.

Architecture

Each session has two sandboxes, both owned by env code (the agent never touches the eval sandbox):

┌──────────────────────────────────────────────────────────────────────┐
│ OpenReward env server (FastAPI, deployed from this repo)             │
│                                                                      │
│  ┌────────────────────────┐         ┌────────────────────────────┐   │
│  │  agent_sandbox  4:16   │         │  eval_sandbox      4:16    │   │
│  │  ────────────────────  │         │  ──────────────────────    │   │
│  │  rust 1.87 + wasmtime  │         │  oracle binary             │   │
│  │  clang / cmake / py3   │         │  grader binary             │   │
│  │  spec/, dev-roms/      │         │  mesen.wasm                │   │
│  │                        │         │  full corpus (held-out)    │   │
│  │  agent edits Rust here │         │  ref-cache (~230 MB LFS)   │   │
│  │  ▲                     │         │  ▲                         │   │
│  └──┼─────────────────────┘         └──┼─────────────────────────┘   │
│     │ Claude-Code-style tools          │ shell-in via env code       │
│     │ + oracle_run + submit + give_up  │ for oracle / grader runs    │
│     └──────────── env code ────────────┘                             │
└──────────────────────────────────────────────────────────────────────┘

Both sandboxes are pinned by GHCR digest (images/{task,eval}.sha, written by CI). Network is blocked on both pods.

oracle_run flow

  1. Env downloads ROM bytes from agent_sandbox at rom_path.
  2. Stages ROM (and optional replay file) in eval_sandbox at /eval/scratch/<run-id>/.
  3. Runs oracle run rom.gba <frames> --replay … --dump-frames frames/ --dump-audio audio.wav in eval_sandbox.
  4. Tars frames/ audio.wav together, downloads the tarball, uploads it to the agent's /task/.oracle-out/<run-id>.tar.

submit flow

  1. Env downloads /task/target/wasm32-unknown-unknown/release/gba_emu.wasm from agent_sandbox. Errors clearly if absent (agent forgot to build).
  2. Uploads it to eval_sandbox at /eval/scratch/grade-<run-id>/candidate.wasm.
  3. Runs /eval/bin/grader --reference /opt/gba-eval/mesen.wasm candidate.wasm /eval/corpus <out> (15-min timeout).
  4. Reads <out>/summary.json → returns scalar overall + section scores. Cleans the scratch dir.
  5. Computes delta = max(0, overall - best_score), updates best_score, returns delta as reward.

Why this isolation matters

Mirrors upstream's services/task split: mesen.wasm and the oracle binary live only in the eval sandbox, never reachable from the agent's container. Stronger than upstream's HTTP-sidecar design — every interaction is a Pydantic-schema-validated tool call, no wire protocol exposed, and the agent has no network path to the reference at all.

Wasmtime memory patch

The default wasmtime::Config reserves ~4 GiB of virtual address space per linear memory + a 2 GiB guard region. The grader instantiates two wasms (reference + candidate) so a single submit would want ~12 GiB of VAS — the OpenReward sandbox's kernel/cgroup vm limits reject the mmap even at 4:16 (the largest non-GPU machine size). docker/eval.Dockerfile applies a small in-place patch to upstream/harness/grader/src/wasm_candidate.rs that switches every memory to dynamic mode with a 64 KiB guard, eliminating the giant mmap. Without it, every submit errors at instantiation with mmap failed to reserve 0x200000000 bytes.

Repo layout

server.py                       OpenReward env server entry point
env.py                          GBAEmuGym Environment — dual-sandbox flow + tools
TASK.md                         Agent-facing prompt (baked into the task image)
Dockerfile                      Env server image (FastAPI, deployed by OpenReward)
docker/task.Dockerfile          Agent sandbox image
docker/eval.Dockerfile          Eval sandbox image (oracle + grader + corpus)
images/{task,eval}.sha          Digest-pinned GHCR image refs (CI writes these)
upstream/                       Git submodule → mechanize-work/gba-eval @ pinned SHA
.github/workflows/
  build-images.yml              Builds task + eval images on every push,
                                pushes to GHCR, commits digests back to images/
requirements.txt                openreward, pydantic (env server runtime deps)
pyproject.toml                  Project metadata
runner/                         (gitignored) Local dev tooling — interactive
                                step_through script, snapshot extractor

Reward shape

PropertyValue
Rangeoverall ∈ [0, 1], cumulative episode reward ≤ 1.0
SignMonotone non-decreasing — worse submissions earn 0, no penalty
TerminationNo auto-finish from submit. Episode ends when the agent calls give_up or the harness enforces a step/wall-clock cap
Scoringoverall = 0.60 × replay + 0.20 × procedural + 0.20 × audio (configurable in corpus/grader.yaml, but agents can't see it)

If you want larger reward magnitudes for training, scale at the trainer.

Local setup

git clone --recurse-submodules <this-repo>
cd GBA-Emu-Gym
git -C upstream lfs install && git -C upstream lfs pull   # ~230 MB ref cache
uv pip install -r requirements.txt

Build the sandbox images locally:

docker build -f docker/task.Dockerfile -t gba-emu-gym-task:dev .
docker build -f docker/eval.Dockerfile -t gba-emu-gym-eval:dev .

CI does this automatically on every push to main and writes the resulting GHCR digests back to images/{task,eval}.sha. The env code reads those pins via env.py:_read_image_pin(...) and falls back to :latest tags for local development.

License

Inherits upstream's per-file licensing — see upstream/LEGAL.md. Briefly:

  • Original work in this repo (env.py, server.py, Dockerfiles, README, TASK.md): MIT
  • Upstream harness/spec/corpus (non-ROM): MIT
  • corpus/roms/: per-ROM licenses (homebrew + open test ROMs)
  • Mesen2 wasm + build glue: GPL-3.0
  • spec/gba_bios_stub.bin: clean-room MIT (not a Nintendo dump)

Known limitations

  • oracle_run is slow per call — a 600-frame run shuttles a ~70 MB tarball through env code (agent ↔ env ↔ eval, base64-over-HTTP). Workable, but agents that want many high-frame queries will see latency. Mitigations on the table: cap frames lower, add a session-style tool that holds state across many small steps, or ship a one-shot tool that returns a similarity score directly (no frame bytes cross the boundary).
  • Each submit is 1-5 minutes of grader CPU. Episodes with hundreds of submits get expensive. Trainers should bound submit frequency.
  • Visible ROMs are also graded, so the agent gets some direct signal from the visible set. Designed this way intentionally — gives the agent a tractable iteration loop without leaking the bulk of the corpus.

Citations

@misc{gbaeval2026,
  title  = {GBA Eval},
  author = {Mechanize Inc.},
  year   = {2026},
  url    = {https://gbaeval.com/},
  note   = {Upstream eval; this repo ports it to a multi-submit gym.},
}