nethack
A Prime Intellect verifiers environment for training and evaluating language-model agents on NetHack.
This is layer 2 — a thin wrapper around the interface-agnostic nethack_core substrate. See ../../docs/design.md for the full architecture and feature roadmap.
Quickstart
# from the repo root: fetch + build the NetHack fork engine first
# (needs cmake/bison/flex/libbz2-dev). nle/minihack are no longer deps.
git submodule update --init --recursive
bash nethack_core/build_engine.sh # -> third_party/NetHack/src/build/libnethack.so
# install the workspace (--all-packages pulls numpy/gymnasium, formerly
# transitive via nle)
uv sync --all-packages
# smoke test against an OpenAI-compatible endpoint
uv run vf-eval nethack -m gpt-4.1-mini -n 3 -r 1 -a '{"tier": "mini_dungeon"}'
See ../../docs/engine-layer.md for the engine
API (snapshot/branch, level blobs, state modification, difficulty knobs).
Arguments
load_environment(...) accepts:
| arg | type | default | meaning |
|---|---|---|---|
tier | str or None | "corridor_explore" | Curriculum tier name; None = uniform across all |
n_examples | int | 256 | Dataset size |
seed | int | 0 | RNG seed for dataset construction |
max_turns | int | 200 | Per-rollout LM turn cap |
interface | str | "skill" | "skill" (one tool per skill) or "code" (sandboxed Python with nh namespace) |
sub_lm | SubLM or None | None | Backend for nh.summarize/plan/recall_lm. Default at rollout time: OfflineSubLM |
subgoal_proposer | Proposer or None | None | Backend for the dynamic_subgoal tier. Default: OfflineSubgoalProposer |
variant | str | "B1" | Observation/skill preset (see Observation variants). |
compact_obs | bool | True | Glyph-run encoding, blank-row strip, inventory diff. Token lever, not a capability lever. |
skill_set | str | "full" | "full", "dir8", "move", or a CSV whitelist of skills (NetPlay uses a curated CSV with no low-level move). |
trace_dir | str or None | None | If set, writes per-turn NDJSON (raw grid + rendered obs + assistant msg + tool calls + reward) for offline replay. |
continual | bool | False | Auto-reseed NLE on death and carry the journal/belief state across lives. |
continual_lives | int | 5 | Max lives when continual=True. |
CLI gotcha: -a vs -x
Override env args from the CLI with -a (env-args, baked at construction), NOT
-x (extra-env-kwargs, applied via env.set_kwargs() AFTER construction):
prime eval jonathanliu/nethack -m Qwen/Qwen3.5-9B -n 1 -r 1 \
-a '{"tier": "dynamic_subgoal", "interface": "code", "max_turns": 30}'
interface (skill vs code) bakes the tool list at construction time, so passing
it via -x is silently ignored. The hosted-eval writeup for Qwen3.5-9B v0.0.14
hit exactly this: -x '{"max_turns": 30}' had no effect and the rollout ran to
the default cap of 200 turns. Always pass env config through -a. See
docs/EVAL_RECIPES.md.
Observation variants
The variant kwarg selects a per-turn observation/skill preset. These let you
A/B the observation surface without touching env internals; each is a single
load_environment(variant=...) setting. They are wired up and swept by
experiments/exp16_obs_variants.py; see experiment_log.md for findings.
| code | source | what it changes |
|---|---|---|
B1 | current default | Standing baseline: ASCII grid + compaction + journal. |
B0 | calibration | All compaction off (raw rendering). Isolates whether compaction is load-bearing. |
G | Glyphbox (Wang, 2026) | ASCII + adjacency + hostile-list + code-mode tool surface. |
B | BALROG (Paglieri et al., ICLR 2025) | No ASCII grid; natural-language scene description only. |
N | NetPlay (Jeurissen, CoG 2024) | Skill-only action surface (no low-level move(direction=…)). |
R | CPP/GPP | Belief state every 25 turns + hard-drop history before the last checkpoint. |
P | Continual Harness (arXiv:2605.09998) | Periodic self-refinement directive (update journal objective / record a lesson). |
CH | Continual Harness (full) | Teacher "Refiner" model edits prompt + sub-agents + skill macros + memory. |
ND | this repo | NetPlay skill set + a persistent === DESCENT STATUS === salience block. |
FD | this repo | find_and_descend autopilot skill surface + descent salience block. |
E1 | this repo (Wave-3 C) | Surfaces find_frontiers output: === FRONTIERS === (top-5 nearest, with bearing + tile kind), === EXPLORATION === (coverage + per-turn scout delta), === SPATIAL BELIEF === (bearings + known stairs coords). Replaces the legacy descent-salience exhortation with pure spatial information. Skill-only + compacted obs (same as N). |
Findings so far (preliminary, Qwen3.5-9B, seeds 22–26, 200-turn budget):
the ASCII grid is load-bearing — B (no grid) collapses capability. Compaction
(B0 vs B1) is a token/cost lever, not a capability lever. The descent
bottleneck (reaching dungeon level 2) is the dominant failure mode: agents
explore but starve or die while looping on the first level. Skill-only surfaces
(N) and the v0.0.65 deadlock-breaker are the levers under active study;
see experiment_log.md for the live numbers.
Tiers
All tiers now run on the NetHack fork engine. The former MiniHack synthetic
tiers have been retired in the engine migration; a nle_task containing
"MiniHack" raises at construction. Synthetic levels are now produced via the
engine's level-blob load path instead (save_level/load_level,
../../docs/engine-layer.md).
Native NetHack tiers
| tier | nle_task | max_steps | success milestone | description |
|---|---|---|---|---|
corridor_explore | NetHackScore-v0 | 2,000 | reach_dlvl(2) | Default. Real NetHack; reach dungeon level 2. |
mini_dungeon | NetHackScore-v0 | 4,000 | reach_dlvl(3) | Reach dungeon level 3. |
mines_to_minetown | NetHackScore-v0 | 8,000 | mine_town_milestone | Find the Gnomish Mines branch; reach Mine Town. |
sokoban_complete | NetHackScore-v0 | 10,000 | sokoban_complete_milestone | Solve the Sokoban puzzle branch. |
oracle_consult | NetHackScore-v0 | 8,000 | oracle_consult_milestone | Find and pay the Oracle of Delphi. |
full_dungeon_easy | NetHackScore-v0 | 10,000 | reach_dlvl(6) | Standard NetHack with reduced max depth. |
full_nle | NetHackScore-v0 | 100,000 | none (ascension via tty markers) | The full game. Ascend. |
dynamic_subgoal | NetHackScore-v0 | 4,000 | per-rollout (LLM-proposed) | Proposer LLM emits an objective + termination_check; the env compiles it into a Milestone. |
MiniHack synthetic (retired)
The old empty_room / solo_combat / multi_combat MiniHack tiers (formerly
MiniHack-Skill-Custom-v0, gated behind pip install nethack[minihack]) have
been removed. minihack is no longer a dependency. Selecting a "MiniHack"
task now raises at NetHackCoreEnv construction. The replacement for fixed
synthetic levels is the engine's concrete level-blob path (generate a floor,
save_level it to an asset, load_level it at reset).
Rewards
The rubric is built from four @vf.reward(weight=...) functions in
nethack.py:
| reward | weight | fires on |
|---|---|---|
scout_reward | 1.0 | Per-step scout_delta / 1000.0 — newly-revealed dungeon tiles this step. |
descent_reward | 10.0 | +1 (× weight) the first time the agent reaches a new max dungeon level. |
success_reward | 100.0 | +1 (× weight) when the tier's success_milestone fires. |
ascension_reward | 1000.0 | +1 (× weight) when _detect_terminal_outcome finds an ascension marker. |
We deliberately do not use NetHack's in-game score as a training signal — it's gameable. See design doc §3.4. The four shaped rewards form an exponentially-spaced ladder (1 → 10 → 100 → 1000) so the gradient always points at the deepest unlocked rung.
Reading the reward signal
avg_score reported by prime eval is the unweighted sum of the four
raw reward-function values, not the rubric-weighted total. Decompose it
with prime eval samples <id> -o json — each sample carries scout_reward,
descent_reward, success_reward, and ascension_reward directly. A score
of 2.155, for example, is scout 0.155 + descent 1 + success 1 — a rollout
that explored, descended to dlvl 2, and fired the corridor_explore
milestone. Real Qwen3.5-9B rollouts reach this; scout reward accumulates
correctly across the trajectory.
Two things to keep in mind when interpreting short evals:
- Sparse by design.
descent_reward/success_reward/ascension_rewardonly fire on milestones. For a non-fine-tuned LM, onlyscout_rewardis expected to be nonzero until the agent actually descends. - Per-step averaging hides scout reward. If you look at verifiers'
per-step
avg_metricsrather than the trajectory sum,scout_reward(≤ ~0.05/step, exactly 0 on steps that reveal no new tiles) rounds to 0.0 in a two-decimal display. Sum across the trajectory, or readstate["scout_tiles_seen"], to see it accumulating.
Implementation notes for anyone extending the rubric: scout tiles are keyed
by (max_dlvl_reached, x, y), and max_dlvl_reached is bumped at the end of
env_response, so the first step on a new dlvl attributes its tiles to the
previous dlvl. Journal-op skills deliberately zero scout_delta and return
before stepping, so a journal-heavy agent shows scout_reward: 0 for those
turns regardless of what's on screen.
Replaying rollouts
tools/render_rollout_video.py renders an animated GIF/MP4 of a rollout
(ASCII map + status + per-turn tool call) from either a hosted eval
(--eval-id) or a local trace_dir NDJSON (--ndjson). tools/dashboard.py
is a browseable web dashboard over all evals: per-variant reward decomposition
plus a turn-by-turn replay view.
Status
Live on the Hub at jonathanliu/nethack.
Published: v0.0.64 (hosted eval pins the latest published version, not
local code). Verified end-to-end against Qwen3.5-9B in hosted eval across the
observation variants above — no crashes, both skill and code interfaces.
Rollouts reach descent + the corridor_explore success milestone (e.g. the
NetPlay N variant on seeds 22–23). The descent-reliability work in
v0.0.65 (deadlock-breaker + descent-salience obs) is under validation; see
experiment_log.md and experiments/results/ for the live numbers.