context-tools

Sandboxed Python-REPL harness for training models to manage their own context across turns. The current default data mix combines adaptive-cursor ledger tasks with realistic corpus-trail research synthesis, so raw appending fails under context_rewrite=True and compact state management is the reliable path.

How it works

Each rollout gets a per-rollout Prime Sandbox running a long-lived Python worker subprocess (re-used from RLMEnv). The worker keeps a persistent namespace dict across calls; it's plain Python — no Jupyter kernel, no ipykernel. The worker uses ast.parse → exec/eval so a trailing expression's repr lands in the result dict (similar to IPython's Out[N]).

The model has one tool — call_python_repl(code: str) — and a per-rollout namespace pre-seeded with:

The world's tool functions (e.g. observe / get_entity / look / read_event / etc.) bound to that rollout's hidden state.
submit_answer(value) — terminates the rollout with a final answer.
(context_rewrite=True only) context_window: list — the model-owned persistent memory across turns. The task text is shown separately; every context_window slot is hard-truncated under the same cap.

The toggle context_rewrite selects the prompting flavor:

context_rewrite=True (default). Fresh [system, user] every turn; the user message renders the static task text plus the hard-truncated current context_window (plus the previous turn's code if it errored). The trajectory is never visible to the model.
context_rewrite=False. Standard tool-calling flow — the model sees the full conversation history, each call_python_repl returns the truncated execution output as a normal tool message.

Files

context_tools.py    ContextToolsEnv (subclasses RLMEnv) + load_environment
taskset.py          ContextToolsTaskSet (data wrapper)
generators/         deterministic, solver-verified data pipeline
scripts/            data-generation CLIs, including build_context_mix.py
my_data/            default mixed train/eval JSONL files plus per-family splits

Task families

All have a single submitted answer per rollout, programmatically verifiable.

Family	Tools	Scratchpad mechanic	Question shape
`rule_hunt`	`get_entity`, `test`	edit a hypothesis as evidence narrows	submit a parse-tree rule
`corpus_dive`	`list_keys`, `read_node`	prune mostly-noise observations	count/sum under a subtree path
`timeline_track`	`read_event`, `read_events`	overwrite mutating fixed-schema state	owner/count at time T
`detective`	`get_entity`, `query_attribute`	shrink a candidate set via elimination	unique entity satisfying all conjunctive constraints
`maze_walk`	`look`, `move`	push/pop discipline (path advance + backtrack)	navigate to goal, submit goal's secret
`adaptive_cursor`	`observe`	choose what returned cursor-page content to preserve	checkpoint ledger audit rows
`corpus_trail`	`search_docs`, `read_doc`	retain durable source-tagged facts across a noisy research DAG	structured project risk brief with evidence ids

The default train/eval mix is 60% adaptive_cursor and 40% corpus_trail, calibrated for from-scratch training with zero-gradient filtering. adaptive_cursor uses d0/d1/d2/d3/d4 at 12/23/35/22/8 within the family so early training has a broad easy on-ramp before harder ledger-update tasks enter. corpus_trail uses d0/d1/d2/d3/d4 at 20/35/30/12/3 within the family; d0/d1 are the search/read bridge, while d2+ provide the main source-synthesis frontier. There is no small manufactured per-turn tool-call limit. In context_rewrite=True, observe(handle), search_docs(...), and read_doc(...) are ordinary Python functions; the next prompt is only the hard-truncated render of whatever the model itself placed in context_window.

corpus_trail is an answer-first research family. Each example samples a final JSON brief, constructs a hidden evidence DAG with reusable facts such as aliases and policy rules, renders that DAG into verbose source documents plus distractors, and seeds the REPL with a long briefing_note. search_docs(...) returns locator-only ids plus non-evidentiary snippets; it intentionally omits titles, dates, source kinds, and answer-bearing text. Retrieval is controlled by hidden search terms rather than rendered titles/body text, so d0/d1 can expose a readable "read source, keep key, search next key" chain without reintroducing search-result leakage. The model must call read_doc(...) on source ids it relies on and keep compact notes because raw gold documents are several times larger than the per-example context cap. Unlike adaptive-cursor, corpus-trail uses final exact JSON correctness only; there is no partial process reward for this family.

Legacy families are solver-verified at generation time. corpus_trail is answer-first instead: the generator builds the answer and hidden evidence DAG first, then renders only the public documents; builder smoke checks verify that every gold evidence document is reachable through the exposed search terms and that compact gold notes fit while raw evidence overflows the per-example cap.

Quickstart

prime env install context-tools

# Re-generate the default mixed train/eval sets
python scripts/build_context_mix.py

# Smoke-test eval
prime eval run context-tools -m gpt-4.1-mini -n 5 -r 1

PRIME_API_KEY (for sandbox provisioning) is read automatically from ~/.prime/config.json if not in the env. Provider key (e.g. OPENAI_API_KEY) you'll need to export yourself.

Environment arguments

Arg	Type	Default	Description
`dataset_path`	str	`my_data/train_context_mix.jsonl`	Training JSONL (8,000 rows: 60% adaptive_cursor, 40% corpus_trail)
`eval_path`	str	`my_data/eval_context_mix.jsonl`	Held-out eval JSONL (800 rows with the same default mix)
`context_rewrite`	bool	`True`	True: model curates `context_window`. False: standard tool-calling flow.
`max_turns`	int	15	Max rollout turns
`max_context_chars`	int	400	Display cap on rendered model-curated `context_window` slots (cr=True) / per-tool-response cap (cr=False)
`max_code_display_chars`	int	4000	Display cap on echoed previous code (cr=True only)
`show_previous_code`	bool	`False`	(cr=True only) If True, echo prior code every turn; default echoes only on error
`tool_call_budget_per_turn`	int	1000000	No small manufactured tool-call cap by default; adaptive-cursor progress is gated by semantic route choices in ordinary returned page strings
`sandbox_docker_image`	str	`python:3.11-slim`	Sandbox image
`code_execution_timeout`	int	120	Per-turn code timeout (seconds)
`sandbox_cpu_cores`	int	1
`sandbox_memory_gb`	int	2
`sandbox_timeout_minutes`	int	30	Hard sandbox lifetime cap
`retain_filesystem_after_rollout`	bool	`False`	Keep `/rlm_fs` for post-mortem

Rewards

Reward	Weight	Definition
`task_reward`	1.0	Capped at 1.0. Exact submitted answer gets 1.0. For adaptive-cursor misses, partial credit is terminal-gated: before the correct terminal page is observed, reward is 0. After terminal, partial credit is `0.05 * complete_valid_submit + 0.10 * checkpoint_ids_in_order + 0.85 * submitted_checkpoint_row_fraction`.
`correctness_reward`	0.0	Exact-answer metric only.
`checkpoint_row_submit_fraction`	0.0	Metric: exact gold checkpoint rows present in the submitted answer.
`checkpoint_row_context_fraction`	0.0	Metric: exact gold checkpoint rows visible in the final hard-truncated `context_window`.
`valid_checkpoint_submit`	0.0	Metric: submitted answer is a non-empty list of 4-field rows.
`complete_checkpoint_submit`	0.0	Metric: submitted answer has one valid row per expected checkpoint.
`checkpoint_ids_in_order`	0.0	Metric: submitted rows use the expected checkpoint ids in order.
`adaptive_terminal_reached`	0.0	Metric: the correct adaptive-cursor terminal page was observed.

Additional diagnostic metrics include append/edit counts, dynamic overwrite/remove counts, final manifest character count, truncation count, and turn efficiency.

Data generation invariants

100% synthetic, 100% verifiable: ground truth is a deterministic function of the generated state.
100% solvable from observations: legacy ground truth is recomputed independently before each example is emitted; answer-first corpus tasks are emitted only when their gold evidence docs are reachable through the public search/read surface.
Single answer per task: every example terminates with one submit_answer(...) call.
Difficulty stratified: default training set is stratified across difficulties 0-4 for the adaptive-cursor template, with the d1/d2-lite bridge emphasized.