GBA-Emu-Gym

OpenReward gym port of mechanize-work/gba-eval — agents iteratively submit GBA emulator wasm builds and earn reward equal to the high-water mark of the grader's score.

Upstream is a one-shot 24-hour eval: the agent gets one Linux container with a Rust toolchain, writes a GBA emulator from scratch, and at the deadline a held-out grader runs the wasm against ~27 testcases (homebrew gameplay replays + procedural CPU/memory/DMA test ROMs + audio diffs against Mesen2). One submission, one number.

This is a gym: same shape, but the agent can submit any number of times per episode. Each submit call earns

reward = max(0, new_overall - previous_best_overall)

so an episode's cumulative reward equals the high-water mark of overall ∈ [0, 1]. A worse submission earns zero, no penalty. The agent's incentive is simple: ship something that compiles, get a baseline, then improve.

Status

End-to-end pipeline is verified live on OpenReward as GeneralReasoning/GBAGym. A no-op stub submission scores overall ≈ 0.0085 in roughly 3 minutes of grader wall-clock.

How the agent interacts with the env

Inside the agent's sandbox:

Path	What it is
`/task/`	Agent's working directory. Empty cargo workspace expected — agent creates `Cargo.toml` + `src/lib.rs`.
`/task/spec/ABI.md`	10 C-ABI functions the agent's wasm must export (`emu_init`, `emu_load_rom`, `emu_set_keys`, `emu_run_frame`, `emu_framebuffer`, `emu_audio_buffer`, …).
`/task/spec/gba_bios_stub.bin`	16 KiB ARM BIOS stub mapped at `0x00000000`.
`/task/spec/gbatek.htm`	5 MB GBATEK hardware reference.
`/task/dev-roms/`	5 visible GBA ROMs (a strict subset of the grader's full ROM set — see below).

Available tools (per session):

Tool	What it does
`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todo_write`	Standard Claude-Code-style file/shell tools, scoped to the agent's sandbox.
`oracle_run(rom_path, frames, replay_text?)`	Run a ROM through the reference Mesen2 emulator in a separate sandbox; per-frame PPMs + WAV audio land back as a tarball at `/task/.oracle-out/<run-id>.tar`. Mesen2 itself is never visible to the agent.
`submit`	Grade whatever wasm is currently at `/task/target/wasm32-unknown-unknown/release/gba_emu.wasm`. Returns `overall` + section subscores. Per-testcase scores are deliberately not surfaced.
`give_up`	End the episode early.

The agent is responsible for running cargo build themselves — submit does not build for them, it only grades the artifact at the canonical path.

ROM split

To preserve anti-shimming pressure while still letting the agent develop against real ROMs:

Bucket	ROMs
Visible (in `dev-roms/`, also graded)	`spout`, `waimanu` (gameplay replays); `armwrestler` (procedural test)
Pure dev (in `dev-roms/`, never graded)	`trogdor`, `another-world`
Held-out (grader only)	`celeste-classic`, `varooom-3d`, `bulletgba`, `chip-advance`, `collie-defense`, `goodboy-advance`, `heartwrench-advance`, `piugba`, `mgba-suite`, `jsmolka/memory`, `destoer/dma-priority`, several `nba-hw/*` ROMs, `tonc/snd1-demo`, audio test ROMs

The agent never sees the held-out names and submit only returns section aggregates (replay, procedural, audio), so they cannot reverse-engineer the held-out set from feedback.

Architecture

Each session has two sandboxes, both owned by env code (the agent never touches the eval sandbox):

┌──────────────────────────────────────────────────────────────────────┐
│ OpenReward env server (FastAPI, deployed from this repo)             │
│                                                                      │
│  ┌────────────────────────┐         ┌────────────────────────────┐   │
│  │  agent_sandbox  4:16   │         │  eval_sandbox      4:16    │   │
│  │  ────────────────────  │         │  ──────────────────────    │   │
│  │  rust 1.87 + wasmtime  │         │  oracle binary             │   │
│  │  clang / cmake / py3   │         │  grader binary             │   │
│  │  spec/, dev-roms/      │         │  mesen.wasm                │   │
│  │                        │         │  full corpus (held-out)    │   │
│  │  agent edits Rust here │         │  ref-cache (~230 MB LFS)   │   │
│  │  ▲                     │         │  ▲                         │   │
│  └──┼─────────────────────┘         └──┼─────────────────────────┘   │
│     │ Claude-Code-style tools          │ shell-in via env code       │
│     │ + oracle_run + submit + give_up  │ for oracle / grader runs    │
│     └──────────── env code ────────────┘                             │
└──────────────────────────────────────────────────────────────────────┘

Both sandboxes are pinned by GHCR digest (images/{task,eval}.sha, written by CI). Network is blocked on both pods.

`oracle_run` flow

Env downloads ROM bytes from agent_sandbox at rom_path.
Stages ROM (and optional replay file) in eval_sandbox at /eval/scratch/<run-id>/.
Runs oracle run rom.gba <frames> --replay … --dump-frames frames/ --dump-audio audio.wav in eval_sandbox.
Tars frames/ audio.wav together, downloads the tarball, uploads it to the agent's /task/.oracle-out/<run-id>.tar.

`submit` flow

Env downloads /task/target/wasm32-unknown-unknown/release/gba_emu.wasm from agent_sandbox. Errors clearly if absent (agent forgot to build).
Uploads it to eval_sandbox at /eval/scratch/grade-<run-id>/candidate.wasm.
Runs /eval/bin/grader --reference /opt/gba-eval/mesen.wasm candidate.wasm /eval/corpus <out> (15-min timeout).
Reads <out>/summary.json → returns scalar overall + section scores. Cleans the scratch dir.
Computes delta = max(0, overall - best_score), updates best_score, returns delta as reward.

Why this isolation matters

Mirrors upstream's services/task split: mesen.wasm and the oracle binary live only in the eval sandbox, never reachable from the agent's container. Stronger than upstream's HTTP-sidecar design — every interaction is a Pydantic-schema-validated tool call, no wire protocol exposed, and the agent has no network path to the reference at all.

Wasmtime memory patch

The default wasmtime::Config reserves ~4 GiB of virtual address space per linear memory + a 2 GiB guard region. The grader instantiates two wasms (reference + candidate) so a single submit would want ~12 GiB of VAS — the OpenReward sandbox's kernel/cgroup vm limits reject the mmap even at 4:16 (the largest non-GPU machine size). docker/eval.Dockerfile applies a small in-place patch to upstream/harness/grader/src/wasm_candidate.rs that switches every memory to dynamic mode with a 64 KiB guard, eliminating the giant mmap. Without it, every submit errors at instantiation with mmap failed to reserve 0x200000000 bytes.

Repo layout

server.py                       OpenReward env server entry point
env.py                          GBAEmuGym Environment — dual-sandbox flow + tools
TASK.md                         Agent-facing prompt (baked into the task image)
Dockerfile                      Env server image (FastAPI, deployed by OpenReward)
docker/task.Dockerfile          Agent sandbox image
docker/eval.Dockerfile          Eval sandbox image (oracle + grader + corpus)
images/{task,eval}.sha          Digest-pinned GHCR image refs (CI writes these)
upstream/                       Git submodule → mechanize-work/gba-eval @ pinned SHA
.github/workflows/
  build-images.yml              Builds task + eval images on every push,
                                pushes to GHCR, commits digests back to images/
requirements.txt                openreward, pydantic (env server runtime deps)
pyproject.toml                  Project metadata
runner/                         (gitignored) Local dev tooling — interactive
                                step_through script, snapshot extractor

Reward shape

Property	Value
Range	`overall ∈ [0, 1]`, cumulative episode reward ≤ 1.0
Sign	Monotone non-decreasing — worse submissions earn 0, no penalty
Termination	No auto-finish from `submit`. Episode ends when the agent calls `give_up` or the harness enforces a step/wall-clock cap
Scoring	`overall = 0.60 × replay + 0.20 × procedural + 0.20 × audio` (configurable in `corpus/grader.yaml`, but agents can't see it)

If you want larger reward magnitudes for training, scale at the trainer.

Local setup

git clone --recurse-submodules <this-repo>
cd GBA-Emu-Gym
git -C upstream lfs install && git -C upstream lfs pull   # ~230 MB ref cache
uv pip install -r requirements.txt

Build the sandbox images locally:

docker build -f docker/task.Dockerfile -t gba-emu-gym-task:dev .
docker build -f docker/eval.Dockerfile -t gba-emu-gym-eval:dev .

CI does this automatically on every push to main and writes the resulting GHCR digests back to images/{task,eval}.sha. The env code reads those pins via env.py:_read_image_pin(...) and falls back to :latest tags for local development.

License

Inherits upstream's per-file licensing — see upstream/LEGAL.md. Briefly:

Original work in this repo (env.py, server.py, Dockerfiles, README, TASK.md): MIT
Upstream harness/spec/corpus (non-ROM): MIT
corpus/roms/: per-ROM licenses (homebrew + open test ROMs)
Mesen2 wasm + build glue: GPL-3.0
spec/gba_bios_stub.bin: clean-room MIT (not a Nintendo dump)

Known limitations

oracle_run is slow per call — a 600-frame run shuttles a ~70 MB tarball through env code (agent ↔ env ↔ eval, base64-over-HTTP). Workable, but agents that want many high-frame queries will see latency. Mitigations on the table: cap frames lower, add a session-style tool that holds state across many small steps, or ship a one-shot tool that returns a similarity score directly (no frame bytes cross the boundary).
Each submit is 1-5 minutes of grader CPU. Episodes with hundreds of submits get expensive. Trainers should bound submit frequency.
Visible ROMs are also graded, so the agent gets some direct signal from the visible set. Designed this way intentionally — gives the agent a tractable iteration loop without leaking the bulk of the corpus.

Citations

@misc{gbaeval2026,
  title  = {GBA Eval},
  author = {Mechanize Inc.},
  year   = {2026},
  url    = {https://gbaeval.com/},
  note   = {Upstream eval; this repo ports it to a multi-submit gym.},
}