forth_lang
Multi-turn tool-using-agent environment on Forth code execution. The model defines a Forth word that satisfies a natural-language specification, using three sandboxed tools (submit_code, run_code, lookup_doc). Reward is binary: 1.0 if the latest submission passes all hidden test cases, else 0.0.
Env config (verifiers v1)
[env.taskset] / [env.harness] sections validate directly against ForthLangTasksetConfig / ForthLangHarnessConfig.
Taskset fields ([env.taskset])
| Field | Default | Notes |
|---|---|---|
tiers | None (all) | List of tier ids (0-5) to include. Caller chooses train/eval splits by instantiating one env per split. |
categories | None (all) | List of category names to include. See forth_lang.tasks.CATEGORIES. |
word_to_call | None (all) | List of word_to_call ids to include (unique, stable task ids — each task defines a unique Forth word). Unknown ids raise ValueError. AND-composed with tiers / categories. |
exclude_word_to_call | None (none) | List of word_to_call ids to drop. Use the same list as word_to_call for eval + exclude_word_to_call for train to define a held-out test set once. |
holdout_fraction | None (off) | Hash-based deterministic split on word_to_call. Set the same value (e.g. 0.2) on both the train and eval env declarations with matching holdout_seed; the train env gets the ~(1-fraction) complement, the eval env gets the fraction holdout (auto-derived from get_dataset vs get_eval_dataset). |
holdout_seed | 0 | Salt for the holdout hash. Keep fixed across train+eval so the two sides are complementary. |
dataset_repo | PrimeIntellect/forth-lang-tasks | Public HF repo id or local path that datasets.load_dataset can read. |
sandbox.image | team-clyvldofb0000gg1kx39rgzjq/forth-lang:v3 | Sandbox image — a baked image with gforth + python3 + bm25s + the docs bundle. |
sandbox.cpu_cores, sandbox.memory_gb, sandbox.disk_size_gb, sandbox.timeout_minutes, sandbox.command_timeout | 1.0, 1.0, 2.0, 30, 15 | Standard vf.SandboxConfig fields. |
sandbox_labels | [] | Labels appended to sandbox.labels at config-load time, so per-cell configs can tag sandboxes (cleanup, quota tracking, dashboard grouping) without overriding the full sandbox block. |
system_prompt | (built-in Forth-tutor) | Override the default prompt. |
Harness fields ([env.harness])
| Field | Default | Notes |
|---|---|---|
max_turns | 30 | Hard cap on assistant turns per rollout. |
Changelog
v0.3.0
- Removed the task-generation pipeline (
task_generation/—aggregate.py,reverify.py,run_filter.py,sanity_check_trivial.py,verify.py,verify_task.py,upload_to_hf.py, ~1,300 lines). It produced the HF dataset once and was never used at runtime; recover from git history if a rebuild is needed (last present in commit693f8c430). - Migrated to verifiers v1 (Taskset / Harness). Replaces the
SandboxMixin + StatefulToolEnvsubclass and the separateForthLangRubricwith a slimForthLangTaskset+ barevf.Harness. Net delete: ~340 lines of glue (per-tooladd_tool+args_to_skip+update_tool_args, customsetup_state, two@vf.cleanupmethods, the Rubric subclass, the wholesandbox_helpers.py). - Sandbox lifecycle is framework-owned. Toolset declares the sandbox config; v1 provisions the per-rollout lease and releases it in
cleanup_rollout. No moreinit_sandbox_client/create_sandbox/delete_sandboxcalls in env code. run_code.word_to_callis now hidden from the model. Bound fromtask.word_to_call, the model never has to (and can't fail to) pass it. Removes a real per-turn failure mode.- Hidden-test verifier is now a
@vf.reward(priority=10). Drives the same in-rolloutrun_codecallable the model uses via apassed.run_code = "tools.run_code"binding. The four diagnostic signals (pass_rate,has_error,banned_violation,submission_error_rate) are priority-0@vf.metricfunctions that read state. - TOML config promoted to first-class.
[env.taskset]/[env.harness]sections validate againstForthLangTasksetConfig/ForthLangHarnessConfigdirectly. CLI overrides go through the typed config too — e.g.vf-eval forth-lang -a '{"config": {"taskset": {"tiers": [5]}}}'. - Per-row knobs available for free. Task rows can set
max_turns,sandbox, andtoolsshow/hide for per-task sizing and action-space control — no code changes needed. - API drops:
ForthLangEnvandForthLangRubricclasses are gone;load_environmentnow takes a singlevf.EnvConfig. Public surface fromforth_lang:load_environment,ForthLangTaskset,ForthLangTasksetConfig,ForthLangHarnessConfig.
v0.2.0
- New tier scheme T0-T5. Tiers are now derived empirically from glm-5.1 pass-rate at 10 rollouts/task (default budget, max_turns=30).
T0 = pr==1.0(perfectly solved across all 10 rollouts × full test pass); T1..T5 are equal-rank quantile bins over the remainingpr<1.0cohort. Replaces the prior T0-T8 solver-filter scheme. The 4-pass solver filter pipeline still lives inscripts/run_filter.pyand is used for task generation, but its tier output is no longer the canonical task label. - +31 Phase 3 tasks generated for previously-zero-coverage categories (arithmetic, comparison-and-logic, conditionals, indefinite-loops, recursion, variables-and-memory). After the N=10 re-tier, 6 survived as non-T0 (
poly-eval,walk-sum→ T1;ilog-budget,pow-mod-find→ T2;eval-node→ T3;ll-run→ T4); the 25 that landed at T0 (redundant with the existing perfect-solve cohort) were dropped. - Dataset now at 419 rows on HF (down from 444 after dropping the 25 T0 leaks; up from 413 at v0.1.2 with the 6 non-T0 Phase 3 additions). Tier distribution: T0×248, T1×35, T2-T5×34 each.
- N=10 reveals N=3 flakiness: 91 of the 339 N=3-trial "T0" tasks migrate to T1+ at N=10 — 21% over-counting of the perfect cohort at N=3. The N=3 T3 bin was also degenerate (all-ties at pr=0.667); N=10 produces a real 0.800-0.854 T3 range.
- New filter args on
load_environment/load_tasks:word_to_call(positive: include only these task ids) andexclude_word_to_call(negative: drop these ids). Enables the "test set = these N tasks, train set = everything else" pattern with a single shared list. Both raiseValueErroron unknown ids. - README updates: new tier definition with N=10 ranges (count, range, mean, median), N=3 vs N=10 comparison table, full category × tier matrix sorted by % non-T0, Phase 3 survivors table. Removed outdated T6/T7/T8 prose.
- Coverage gaps documented:
conditionalsis now 18/18 at T0 (zero non-T0 coverage);arithmeticnearly as flat (1/20 non-T0). 10 of 15 categories have no T5 representation — the hardest cohort concentrates indata-structures,metaprogramming,strings,stack-manipulation,forth-idioms.
v0.1.2
- +39 new tasks added (6 buckets explicitly aimed at T8: stack-only-memory, compile-time-meta, EXECUTE-only control flow, introspection / self-modifying, coroutines-via-rstack, bit-fiddle-without-arithmetic). After the 4-pass filter the empirical distribution was: T0×1, T1×1, T2×2, T3×8, T4×10, T6×2, T7×15. No T8 survivors, despite the T8-targeted generation — glm-5.1 cracked every fourth-pass task at 100 turns (12 @ 3/3, 3 @ 2/3). Honest data point: hardening the env beyond T7 requires deeper esoterica than the 6 OOD angles we tried; the previous T8 (
srev,roll-from-rstack stack reversal) remains the only T8 task. - Dataset now at 413 rows on HF.
aggregate.pyvalidator widened to accepttier ∈ [0, 8](added in v0.1.1, codified for v0.1.2 spec-vs-implementation hygiene).- Spec-vs-test alignment audit pass added to the generation pipeline: a separate sub-agent reads each candidate's prose without the reference solution and flags cases where a fresh reader couldn't predict the expected outputs. This batch had 1 alignment fix (
select-rstack-no-iftest inputs converted from JSON booleans to Forth-flag-1/0literals).
v0.1.1
- Taskset moved to private HF dataset
PrimeIntellect/forth-lang-tasks. 7-column schema; loaded viadatasets.load_datasetwithHF_TOKEN. Override the source withFORTH_LANG_TASKS_REPO(accepts an alternate HF repo or a local path) for staging. - All three tools now execute inside the sandbox.
lookup_docmigrated from a host-side BM25 retriever topython3 /opt/forth-lang/lookup_docs.py <query>in the sandbox;bm25sdropped from host deps. - Team-registry Docker image
team-clyvldofb0000gg1kx39rgzjq/forth-lang:v2is now the only supported runtime image. Docs index is fetched fresh from the gforth manual at image-build time and baked into/opt/forth-lang/; no in-sandbox install or upload at rollout setup. - +33 new tasks added (4 buckets, filtered through a 4-pass solver pipeline with strict 0/3 cascade). Per-tier distribution of additions: T1×1, T2×2, T3×2, T4×6, T6×1, T7×20, T8×1.
- New T8 ceiling-probe tier. Tasks where all four solver passes scored 0/3 but the reference solution still passes its own tests are now kept and labeled T8 (rather than dropped).
stack-reverse-no-memoryis the first T8 entry — solvable via aroll-based return-stack idiom, but neither gpt-4.1-mini, gpt-5-mini, gpt-5.4, nor glm-5.1 produced a single passing rollout at up to 100 turns. - Solver-filter pipeline extended to a 4th stage (
run_filter.py fourth) and_empirical_tiernow produces T7/T8 empirically (alongside any hand-curatedt7_curatedtasks). All cascade steps moved to strict 0/3 (a task only escalates when the previous solver scored 0 across all 3 rollouts). - Sandbox controls:
labels=arg now reachable viaload_environmentsoprime sandbox delete -l <label>can scope cleanups;sandbox_nameis set internally for admin-panel display. - Scripts hygiene:
aggregate_v2.py → aggregate.py,reverify_v2_raw.py → reverify.py;sys.path.inserthacks removed;DEFAULT_ENDPOINTShardcode dropped; sharedverify_task,read_jsonl,GROUP_TO_CATEGORYextracted into the env package. - Gforth helpers consolidated into
forth_lang/gforth.py(format_literal,build_forth_line,parse_stack), shared by the sandbox runner and the offline subprocess runner. - Loader hygiene: taskset loading logic moved from
tasks/__init__.pyintotasks/loader.py;__init__.pyis a pure re-export. - Misc bug fixes caught by Cursor's bugbot:
NameErrorinupload_to_hf.pyempty-rows guard;KeyError: 'scenario'inrun_filter.pyjoin logic (now joins onword_to_call); README-implementation drift on tier range andmax_submissions; dead code inbuild_docs_cache.py.
v0.1.0
- Created environment.