0

Context Tools RL Env (Prime)

Fresh

Sandboxed Python-REPL harness for training models to manage their own context window across turns.

Type
RL Env
Publisher
Prime
Runtime
multi-turn
License
unknown
Size
v0.3.27
Published
May 2026

Cite

Notes

Only stored in your browser.

context-tools

Sandboxed Python-REPL harness for training models to manage their own context across turns. The current default data mix combines adaptive-cursor ledger tasks with realistic corpus-trail research synthesis, so raw appending fails under context_rewrite=True and compact state management is the reliable path.

How it works

Each rollout gets a per-rollout Prime Sandbox running a long-lived Python worker subprocess (re-used from RLMEnv). The worker keeps a persistent namespace dict across calls; it's plain Python — no Jupyter kernel, no ipykernel. The worker uses ast.parse → exec/eval so a trailing expression's repr lands in the result dict (similar to IPython's Out[N]).

The model has one tool — call_python_repl(code: str) — and a per-rollout namespace pre-seeded with:

  • The world's tool functions (e.g. observe / get_entity / look / read_event / etc.) bound to that rollout's hidden state.
  • submit_answer(value) — terminates the rollout with a final answer.
  • (context_rewrite=True only) context_window: list — the model-owned persistent memory across turns. The task text is shown separately; every context_window slot is hard-truncated under the same cap.

The toggle context_rewrite selects the prompting flavor:

  • context_rewrite=True (default). Fresh [system, user] every turn; the user message renders the static task text plus the hard-truncated current context_window (plus the previous turn's code if it errored). The trajectory is never visible to the model.
  • context_rewrite=False. Standard tool-calling flow — the model sees the full conversation history, each call_python_repl returns the truncated execution output as a normal tool message.

Files

context_tools.py    ContextToolsEnv (subclasses RLMEnv) + load_environment
taskset.py          ContextToolsTaskSet (data wrapper)
generators/         deterministic, solver-verified data pipeline
scripts/            data-generation CLIs, including build_context_mix.py
my_data/            default mixed train/eval JSONL files plus per-family splits

Task families

All have a single submitted answer per rollout, programmatically verifiable.

FamilyToolsScratchpad mechanicQuestion shape
rule_huntget_entity, testedit a hypothesis as evidence narrowssubmit a parse-tree rule
corpus_divelist_keys, read_nodeprune mostly-noise observationscount/sum under a subtree path
timeline_trackread_event, read_eventsoverwrite mutating fixed-schema stateowner/count at time T
detectiveget_entity, query_attributeshrink a candidate set via eliminationunique entity satisfying all conjunctive constraints
maze_walklook, movepush/pop discipline (path advance + backtrack)navigate to goal, submit goal's secret
adaptive_cursorobservechoose what returned cursor-page content to preservecheckpoint ledger audit rows
corpus_trailsearch_docs, read_docretain durable source-tagged facts across a noisy research DAGstructured project risk brief with evidence ids

The default train/eval mix is 60% adaptive_cursor and 40% corpus_trail, calibrated for from-scratch training with zero-gradient filtering. adaptive_cursor uses d0/d1/d2/d3/d4 at 12/23/35/22/8 within the family so early training has a broad easy on-ramp before harder ledger-update tasks enter. corpus_trail uses d0/d1/d2/d3/d4 at 20/35/30/12/3 within the family; d0/d1 are the search/read bridge, while d2+ provide the main source-synthesis frontier. There is no small manufactured per-turn tool-call limit. In context_rewrite=True, observe(handle), search_docs(...), and read_doc(...) are ordinary Python functions; the next prompt is only the hard-truncated render of whatever the model itself placed in context_window.

corpus_trail is an answer-first research family. Each example samples a final JSON brief, constructs a hidden evidence DAG with reusable facts such as aliases and policy rules, renders that DAG into verbose source documents plus distractors, and seeds the REPL with a long briefing_note. search_docs(...) returns locator-only ids plus non-evidentiary snippets; it intentionally omits titles, dates, source kinds, and answer-bearing text. Retrieval is controlled by hidden search terms rather than rendered titles/body text, so d0/d1 can expose a readable "read source, keep key, search next key" chain without reintroducing search-result leakage. The model must call read_doc(...) on source ids it relies on and keep compact notes because raw gold documents are several times larger than the per-example context cap. Unlike adaptive-cursor, corpus-trail uses final exact JSON correctness only; there is no partial process reward for this family.

Legacy families are solver-verified at generation time. corpus_trail is answer-first instead: the generator builds the answer and hidden evidence DAG first, then renders only the public documents; builder smoke checks verify that every gold evidence document is reachable through the exposed search terms and that compact gold notes fit while raw evidence overflows the per-example cap.

Quickstart

prime env install context-tools

# Re-generate the default mixed train/eval sets
python scripts/build_context_mix.py

# Smoke-test eval
prime eval run context-tools -m gpt-4.1-mini -n 5 -r 1

PRIME_API_KEY (for sandbox provisioning) is read automatically from ~/.prime/config.json if not in the env. Provider key (e.g. OPENAI_API_KEY) you'll need to export yourself.

Environment arguments

ArgTypeDefaultDescription
dataset_pathstrmy_data/train_context_mix.jsonlTraining JSONL (8,000 rows: 60% adaptive_cursor, 40% corpus_trail)
eval_pathstrmy_data/eval_context_mix.jsonlHeld-out eval JSONL (800 rows with the same default mix)
context_rewriteboolTrueTrue: model curates context_window. False: standard tool-calling flow.
max_turnsint15Max rollout turns
max_context_charsint400Display cap on rendered model-curated context_window slots (cr=True) / per-tool-response cap (cr=False)
max_code_display_charsint4000Display cap on echoed previous code (cr=True only)
show_previous_codeboolFalse(cr=True only) If True, echo prior code every turn; default echoes only on error
tool_call_budget_per_turnint1000000No small manufactured tool-call cap by default; adaptive-cursor progress is gated by semantic route choices in ordinary returned page strings
sandbox_docker_imagestrpython:3.11-slimSandbox image
code_execution_timeoutint120Per-turn code timeout (seconds)
sandbox_cpu_coresint1
sandbox_memory_gbint2
sandbox_timeout_minutesint30Hard sandbox lifetime cap
retain_filesystem_after_rolloutboolFalseKeep /rlm_fs for post-mortem

Rewards

RewardWeightDefinition
task_reward1.0Capped at 1.0. Exact submitted answer gets 1.0. For adaptive-cursor misses, partial credit is terminal-gated: before the correct terminal page is observed, reward is 0. After terminal, partial credit is 0.05 * complete_valid_submit + 0.10 * checkpoint_ids_in_order + 0.85 * submitted_checkpoint_row_fraction.
correctness_reward0.0Exact-answer metric only.
checkpoint_row_submit_fraction0.0Metric: exact gold checkpoint rows present in the submitted answer.
checkpoint_row_context_fraction0.0Metric: exact gold checkpoint rows visible in the final hard-truncated context_window.
valid_checkpoint_submit0.0Metric: submitted answer is a non-empty list of 4-field rows.
complete_checkpoint_submit0.0Metric: submitted answer has one valid row per expected checkpoint.
checkpoint_ids_in_order0.0Metric: submitted rows use the expected checkpoint ids in order.
adaptive_terminal_reached0.0Metric: the correct adaptive-cursor terminal page was observed.

Additional diagnostic metrics include append/edit counts, dynamic overwrite/remove counts, final manifest character count, truncation count, and turn efficiency.

Data generation invariants

  • 100% synthetic, 100% verifiable: ground truth is a deterministic function of the generated state.
  • 100% solvable from observations: legacy ground truth is recomputed independently before each example is emitted; answer-first corpus tasks are emitted only when their gold evidence docs are reachable through the public search/read surface.
  • Single answer per task: every example terminates with one submit_answer(...) call.
  • Difficulty stratified: default training set is stratified across difficulties 0-4 for the adaptive-cursor template, with the d1/d2-lite bridge emphasized.