forth_lang

Multi-turn tool-using-agent environment on Forth code execution. The model defines a Forth word that satisfies a natural-language specification, using three sandboxed tools (submit_code, run_code, lookup_doc). Reward is binary: 1.0 if the latest submission passes all hidden test cases, else 0.0.

Env config (verifiers v1)

[env.taskset] / [env.harness] sections validate directly against ForthLangTasksetConfig / ForthLangHarnessConfig.

Taskset fields (`[env.taskset]`)

Field	Default	Notes
`tiers`	`None` (all)	List of tier ids (0-5) to include. Caller chooses train/eval splits by instantiating one env per split.
`categories`	`None` (all)	List of category names to include. See `forth_lang.tasks.CATEGORIES`.
`word_to_call`	`None` (all)	List of `word_to_call` ids to include (unique, stable task ids — each task defines a unique Forth word). Unknown ids raise `ValueError`. AND-composed with `tiers` / `categories`.
`exclude_word_to_call`	`None` (none)	List of `word_to_call` ids to drop. Use the same list as `word_to_call` for eval + `exclude_word_to_call` for train to define a held-out test set once.
`holdout_fraction`	`None` (off)	Hash-based deterministic split on `word_to_call`. Set the same value (e.g. `0.2`) on both the train and eval env declarations with matching `holdout_seed`; the train env gets the ~(1-fraction) complement, the eval env gets the fraction holdout (auto-derived from `get_dataset` vs `get_eval_dataset`).
`holdout_seed`	`0`	Salt for the holdout hash. Keep fixed across train+eval so the two sides are complementary.
`dataset_repo`	`PrimeIntellect/forth-lang-tasks`	Public HF repo id or local path that `datasets.load_dataset` can read.
`sandbox.image`	`team-clyvldofb0000gg1kx39rgzjq/forth-lang:v3`	Sandbox image — a baked image with gforth + python3 + bm25s + the docs bundle.
`sandbox.cpu_cores`, `sandbox.memory_gb`, `sandbox.disk_size_gb`, `sandbox.timeout_minutes`, `sandbox.command_timeout`	`1.0`, `1.0`, `2.0`, `30`, `15`	Standard `vf.SandboxConfig` fields.
`sandbox_labels`	`[]`	Labels appended to `sandbox.labels` at config-load time, so per-cell configs can tag sandboxes (cleanup, quota tracking, dashboard grouping) without overriding the full `sandbox` block.
`system_prompt`	(built-in Forth-tutor)	Override the default prompt.

Harness fields (`[env.harness]`)

Field	Default	Notes
`max_turns`	30	Hard cap on assistant turns per rollout.

Changelog

v0.3.0

Removed the task-generation pipeline (task_generation/ — aggregate.py, reverify.py, run_filter.py, sanity_check_trivial.py, verify.py, verify_task.py, upload_to_hf.py, ~1,300 lines). It produced the HF dataset once and was never used at runtime; recover from git history if a rebuild is needed (last present in commit 693f8c430).
Migrated to verifiers v1 (Taskset / Harness). Replaces the SandboxMixin + StatefulToolEnv subclass and the separate ForthLangRubric with a slim ForthLangTaskset + bare vf.Harness. Net delete: ~340 lines of glue (per-tool add_tool + args_to_skip + update_tool_args, custom setup_state, two @vf.cleanup methods, the Rubric subclass, the whole sandbox_helpers.py).
Sandbox lifecycle is framework-owned. Toolset declares the sandbox config; v1 provisions the per-rollout lease and releases it in cleanup_rollout. No more init_sandbox_client / create_sandbox / delete_sandbox calls in env code.
run_code.word_to_call is now hidden from the model. Bound from task.word_to_call, the model never has to (and can't fail to) pass it. Removes a real per-turn failure mode.
Hidden-test verifier is now a @vf.reward(priority=10). Drives the same in-rollout run_code callable the model uses via a passed.run_code = "tools.run_code" binding. The four diagnostic signals (pass_rate, has_error, banned_violation, submission_error_rate) are priority-0 @vf.metric functions that read state.
TOML config promoted to first-class. [env.taskset] / [env.harness] sections validate against ForthLangTasksetConfig / ForthLangHarnessConfig directly. CLI overrides go through the typed config too — e.g. vf-eval forth-lang -a '{"config": {"taskset": {"tiers": [5]}}}'.
Per-row knobs available for free. Task rows can set max_turns, sandbox, and tools show/hide for per-task sizing and action-space control — no code changes needed.
API drops: ForthLangEnv and ForthLangRubric classes are gone; load_environment now takes a single vf.EnvConfig. Public surface from forth_lang: load_environment, ForthLangTaskset, ForthLangTasksetConfig, ForthLangHarnessConfig.

v0.2.0

New tier scheme T0-T5. Tiers are now derived empirically from glm-5.1 pass-rate at 10 rollouts/task (default budget, max_turns=30). T0 = pr==1.0 (perfectly solved across all 10 rollouts × full test pass); T1..T5 are equal-rank quantile bins over the remaining pr<1.0 cohort. Replaces the prior T0-T8 solver-filter scheme. The 4-pass solver filter pipeline still lives in scripts/run_filter.py and is used for task generation, but its tier output is no longer the canonical task label.
+31 Phase 3 tasks generated for previously-zero-coverage categories (arithmetic, comparison-and-logic, conditionals, indefinite-loops, recursion, variables-and-memory). After the N=10 re-tier, 6 survived as non-T0 (poly-eval, walk-sum → T1; ilog-budget, pow-mod-find → T2; eval-node → T3; ll-run → T4); the 25 that landed at T0 (redundant with the existing perfect-solve cohort) were dropped.
Dataset now at 419 rows on HF (down from 444 after dropping the 25 T0 leaks; up from 413 at v0.1.2 with the 6 non-T0 Phase 3 additions). Tier distribution: T0×248, T1×35, T2-T5×34 each.
N=10 reveals N=3 flakiness: 91 of the 339 N=3-trial "T0" tasks migrate to T1+ at N=10 — 21% over-counting of the perfect cohort at N=3. The N=3 T3 bin was also degenerate (all-ties at pr=0.667); N=10 produces a real 0.800-0.854 T3 range.
New filter args on load_environment / load_tasks: word_to_call (positive: include only these task ids) and exclude_word_to_call (negative: drop these ids). Enables the "test set = these N tasks, train set = everything else" pattern with a single shared list. Both raise ValueError on unknown ids.
README updates: new tier definition with N=10 ranges (count, range, mean, median), N=3 vs N=10 comparison table, full category × tier matrix sorted by % non-T0, Phase 3 survivors table. Removed outdated T6/T7/T8 prose.
Coverage gaps documented: conditionals is now 18/18 at T0 (zero non-T0 coverage); arithmetic nearly as flat (1/20 non-T0). 10 of 15 categories have no T5 representation — the hardest cohort concentrates in data-structures, metaprogramming, strings, stack-manipulation, forth-idioms.

v0.1.2

+39 new tasks added (6 buckets explicitly aimed at T8: stack-only-memory, compile-time-meta, EXECUTE-only control flow, introspection / self-modifying, coroutines-via-rstack, bit-fiddle-without-arithmetic). After the 4-pass filter the empirical distribution was: T0×1, T1×1, T2×2, T3×8, T4×10, T6×2, T7×15. No T8 survivors, despite the T8-targeted generation — glm-5.1 cracked every fourth-pass task at 100 turns (12 @ 3/3, 3 @ 2/3). Honest data point: hardening the env beyond T7 requires deeper esoterica than the 6 OOD angles we tried; the previous T8 (srev, roll-from-rstack stack reversal) remains the only T8 task.
Dataset now at 413 rows on HF.
aggregate.py validator widened to accept tier ∈ [0, 8] (added in v0.1.1, codified for v0.1.2 spec-vs-implementation hygiene).
Spec-vs-test alignment audit pass added to the generation pipeline: a separate sub-agent reads each candidate's prose without the reference solution and flags cases where a fresh reader couldn't predict the expected outputs. This batch had 1 alignment fix (select-rstack-no-if test inputs converted from JSON booleans to Forth-flag -1/0 literals).

v0.1.1

Taskset moved to private HF dataset PrimeIntellect/forth-lang-tasks. 7-column schema; loaded via datasets.load_dataset with HF_TOKEN. Override the source with FORTH_LANG_TASKS_REPO (accepts an alternate HF repo or a local path) for staging.
All three tools now execute inside the sandbox. lookup_doc migrated from a host-side BM25 retriever to python3 /opt/forth-lang/lookup_docs.py <query> in the sandbox; bm25s dropped from host deps.
Team-registry Docker image team-clyvldofb0000gg1kx39rgzjq/forth-lang:v2 is now the only supported runtime image. Docs index is fetched fresh from the gforth manual at image-build time and baked into /opt/forth-lang/; no in-sandbox install or upload at rollout setup.
+33 new tasks added (4 buckets, filtered through a 4-pass solver pipeline with strict 0/3 cascade). Per-tier distribution of additions: T1×1, T2×2, T3×2, T4×6, T6×1, T7×20, T8×1.
New T8 ceiling-probe tier. Tasks where all four solver passes scored 0/3 but the reference solution still passes its own tests are now kept and labeled T8 (rather than dropped). stack-reverse-no-memory is the first T8 entry — solvable via a roll-based return-stack idiom, but neither gpt-4.1-mini, gpt-5-mini, gpt-5.4, nor glm-5.1 produced a single passing rollout at up to 100 turns.
Solver-filter pipeline extended to a 4th stage (run_filter.py fourth) and _empirical_tier now produces T7/T8 empirically (alongside any hand-curated t7_curated tasks). All cascade steps moved to strict 0/3 (a task only escalates when the previous solver scored 0 across all 3 rollouts).
Sandbox controls: labels= arg now reachable via load_environment so prime sandbox delete -l <label> can scope cleanups; sandbox_name is set internally for admin-panel display.
Scripts hygiene: aggregate_v2.py → aggregate.py, reverify_v2_raw.py → reverify.py; sys.path.insert hacks removed; DEFAULT_ENDPOINTS hardcode dropped; shared verify_task, read_jsonl, GROUP_TO_CATEGORY extracted into the env package.
Gforth helpers consolidated into forth_lang/gforth.py (format_literal, build_forth_line, parse_stack), shared by the sandbox runner and the offline subprocess runner.
Loader hygiene: taskset loading logic moved from tasks/__init__.py into tasks/loader.py; __init__.py is a pure re-export.
Misc bug fixes caught by Cursor's bugbot: NameError in upload_to_hf.py empty-rows guard; KeyError: 'scenario' in run_filter.py join logic (now joins on word_to_call); README-implementation drift on tier range and max_submissions; dead code in build_docs_cache.py.

v0.1.0

Created environment.