0

Nethack

Fresh

Training-grade NetHack environment for LM agents with skill-based tools, milestone curriculum, replay capture, and journal memory.

Type
RL Env
Publisher
Jonathanliu
Runtime
tool-use
License
apache-2.0
Size
v0.0.68
Published
May 2026

Cite

Notes

Only stored in your browser.

nethack

A Prime Intellect verifiers environment for training and evaluating language-model agents on NetHack.

This is layer 2 — a thin wrapper around the interface-agnostic nethack_core substrate. See ../../docs/design.md for the full architecture and feature roadmap.

Quickstart

# from the repo root: fetch + build the NetHack fork engine first
# (needs cmake/bison/flex/libbz2-dev). nle/minihack are no longer deps.
git submodule update --init --recursive
bash nethack_core/build_engine.sh   # -> third_party/NetHack/src/build/libnethack.so

# install the workspace (--all-packages pulls numpy/gymnasium, formerly
# transitive via nle)
uv sync --all-packages

# smoke test against an OpenAI-compatible endpoint
uv run vf-eval nethack -m gpt-4.1-mini -n 3 -r 1 -a '{"tier": "mini_dungeon"}'

See ../../docs/engine-layer.md for the engine API (snapshot/branch, level blobs, state modification, difficulty knobs).

Arguments

load_environment(...) accepts:

argtypedefaultmeaning
tierstr or None"corridor_explore"Curriculum tier name; None = uniform across all
n_examplesint256Dataset size
seedint0RNG seed for dataset construction
max_turnsint200Per-rollout LM turn cap
interfacestr"skill""skill" (one tool per skill) or "code" (sandboxed Python with nh namespace)
sub_lmSubLM or NoneNoneBackend for nh.summarize/plan/recall_lm. Default at rollout time: OfflineSubLM
subgoal_proposerProposer or NoneNoneBackend for the dynamic_subgoal tier. Default: OfflineSubgoalProposer
variantstr"B1"Observation/skill preset (see Observation variants).
compact_obsboolTrueGlyph-run encoding, blank-row strip, inventory diff. Token lever, not a capability lever.
skill_setstr"full""full", "dir8", "move", or a CSV whitelist of skills (NetPlay uses a curated CSV with no low-level move).
trace_dirstr or NoneNoneIf set, writes per-turn NDJSON (raw grid + rendered obs + assistant msg + tool calls + reward) for offline replay.
continualboolFalseAuto-reseed NLE on death and carry the journal/belief state across lives.
continual_livesint5Max lives when continual=True.

CLI gotcha: -a vs -x

Override env args from the CLI with -a (env-args, baked at construction), NOT -x (extra-env-kwargs, applied via env.set_kwargs() AFTER construction):

prime eval jonathanliu/nethack -m Qwen/Qwen3.5-9B -n 1 -r 1 \
  -a '{"tier": "dynamic_subgoal", "interface": "code", "max_turns": 30}'

interface (skill vs code) bakes the tool list at construction time, so passing it via -x is silently ignored. The hosted-eval writeup for Qwen3.5-9B v0.0.14 hit exactly this: -x '{"max_turns": 30}' had no effect and the rollout ran to the default cap of 200 turns. Always pass env config through -a. See docs/EVAL_RECIPES.md.

Observation variants

The variant kwarg selects a per-turn observation/skill preset. These let you A/B the observation surface without touching env internals; each is a single load_environment(variant=...) setting. They are wired up and swept by experiments/exp16_obs_variants.py; see experiment_log.md for findings.

codesourcewhat it changes
B1current defaultStanding baseline: ASCII grid + compaction + journal.
B0calibrationAll compaction off (raw rendering). Isolates whether compaction is load-bearing.
GGlyphbox (Wang, 2026)ASCII + adjacency + hostile-list + code-mode tool surface.
BBALROG (Paglieri et al., ICLR 2025)No ASCII grid; natural-language scene description only.
NNetPlay (Jeurissen, CoG 2024)Skill-only action surface (no low-level move(direction=…)).
RCPP/GPPBelief state every 25 turns + hard-drop history before the last checkpoint.
PContinual Harness (arXiv:2605.09998)Periodic self-refinement directive (update journal objective / record a lesson).
CHContinual Harness (full)Teacher "Refiner" model edits prompt + sub-agents + skill macros + memory.
NDthis repoNetPlay skill set + a persistent === DESCENT STATUS === salience block.
FDthis repofind_and_descend autopilot skill surface + descent salience block.
E1this repo (Wave-3 C)Surfaces find_frontiers output: === FRONTIERS === (top-5 nearest, with bearing + tile kind), === EXPLORATION === (coverage + per-turn scout delta), === SPATIAL BELIEF === (bearings + known stairs coords). Replaces the legacy descent-salience exhortation with pure spatial information. Skill-only + compacted obs (same as N).

Findings so far (preliminary, Qwen3.5-9B, seeds 22–26, 200-turn budget): the ASCII grid is load-bearing — B (no grid) collapses capability. Compaction (B0 vs B1) is a token/cost lever, not a capability lever. The descent bottleneck (reaching dungeon level 2) is the dominant failure mode: agents explore but starve or die while looping on the first level. Skill-only surfaces (N) and the v0.0.65 deadlock-breaker are the levers under active study; see experiment_log.md for the live numbers.

Tiers

All tiers now run on the NetHack fork engine. The former MiniHack synthetic tiers have been retired in the engine migration; a nle_task containing "MiniHack" raises at construction. Synthetic levels are now produced via the engine's level-blob load path instead (save_level/load_level, ../../docs/engine-layer.md).

Native NetHack tiers

tiernle_taskmax_stepssuccess milestonedescription
corridor_exploreNetHackScore-v02,000reach_dlvl(2)Default. Real NetHack; reach dungeon level 2.
mini_dungeonNetHackScore-v04,000reach_dlvl(3)Reach dungeon level 3.
mines_to_minetownNetHackScore-v08,000mine_town_milestoneFind the Gnomish Mines branch; reach Mine Town.
sokoban_completeNetHackScore-v010,000sokoban_complete_milestoneSolve the Sokoban puzzle branch.
oracle_consultNetHackScore-v08,000oracle_consult_milestoneFind and pay the Oracle of Delphi.
full_dungeon_easyNetHackScore-v010,000reach_dlvl(6)Standard NetHack with reduced max depth.
full_nleNetHackScore-v0100,000none (ascension via tty markers)The full game. Ascend.
dynamic_subgoalNetHackScore-v04,000per-rollout (LLM-proposed)Proposer LLM emits an objective + termination_check; the env compiles it into a Milestone.

MiniHack synthetic (retired)

The old empty_room / solo_combat / multi_combat MiniHack tiers (formerly MiniHack-Skill-Custom-v0, gated behind pip install nethack[minihack]) have been removed. minihack is no longer a dependency. Selecting a "MiniHack" task now raises at NetHackCoreEnv construction. The replacement for fixed synthetic levels is the engine's concrete level-blob path (generate a floor, save_level it to an asset, load_level it at reset).

Rewards

The rubric is built from four @vf.reward(weight=...) functions in nethack.py:

rewardweightfires on
scout_reward1.0Per-step scout_delta / 1000.0 — newly-revealed dungeon tiles this step.
descent_reward10.0+1 (× weight) the first time the agent reaches a new max dungeon level.
success_reward100.0+1 (× weight) when the tier's success_milestone fires.
ascension_reward1000.0+1 (× weight) when _detect_terminal_outcome finds an ascension marker.

We deliberately do not use NetHack's in-game score as a training signal — it's gameable. See design doc §3.4. The four shaped rewards form an exponentially-spaced ladder (1 → 10 → 100 → 1000) so the gradient always points at the deepest unlocked rung.

Reading the reward signal

avg_score reported by prime eval is the unweighted sum of the four raw reward-function values, not the rubric-weighted total. Decompose it with prime eval samples <id> -o json — each sample carries scout_reward, descent_reward, success_reward, and ascension_reward directly. A score of 2.155, for example, is scout 0.155 + descent 1 + success 1 — a rollout that explored, descended to dlvl 2, and fired the corridor_explore milestone. Real Qwen3.5-9B rollouts reach this; scout reward accumulates correctly across the trajectory.

Two things to keep in mind when interpreting short evals:

  1. Sparse by design. descent_reward/success_reward/ascension_reward only fire on milestones. For a non-fine-tuned LM, only scout_reward is expected to be nonzero until the agent actually descends.
  2. Per-step averaging hides scout reward. If you look at verifiers' per-step avg_metrics rather than the trajectory sum, scout_reward (≤ ~0.05/step, exactly 0 on steps that reveal no new tiles) rounds to 0.0 in a two-decimal display. Sum across the trajectory, or read state["scout_tiles_seen"], to see it accumulating.

Implementation notes for anyone extending the rubric: scout tiles are keyed by (max_dlvl_reached, x, y), and max_dlvl_reached is bumped at the end of env_response, so the first step on a new dlvl attributes its tiles to the previous dlvl. Journal-op skills deliberately zero scout_delta and return before stepping, so a journal-heavy agent shows scout_reward: 0 for those turns regardless of what's on screen.

Replaying rollouts

tools/render_rollout_video.py renders an animated GIF/MP4 of a rollout (ASCII map + status + per-turn tool call) from either a hosted eval (--eval-id) or a local trace_dir NDJSON (--ndjson). tools/dashboard.py is a browseable web dashboard over all evals: per-variant reward decomposition plus a turn-by-turn replay view.

Status

Live on the Hub at jonathanliu/nethack. Published: v0.0.64 (hosted eval pins the latest published version, not local code). Verified end-to-end against Qwen3.5-9B in hosted eval across the observation variants above — no crashes, both skill and code interfaces. Rollouts reach descent + the corridor_explore success milestone (e.g. the NetPlay N variant on seeds 22–23). The descent-reliability work in v0.0.65 (deadlock-breaker + descent-salience obs) is under validation; see experiment_log.md and experiments/results/ for the live numbers.