graphwalks-rlm

Overview

Environment ID: graphwalks-rlm
Short description: GraphWalks graph traversal benchmark using RLM (Recursive Language Model) with Python REPL
Tags: long-context, rlm, python, multi-turn, repl, graph

How It Works

This environment implements the GraphWalks benchmark for evaluating graph traversal capabilities using the RLMEnv.

The model receives the task instructions as a prompt and the full graph (edge list + operation question) as context in the REPL filesystem. It must parse the graph, implement the required algorithm, and verify correctness using the REPL.

The prompt is split at the "Here is the graph to operate on" separator:

Prompt: Everything before the separator (general instructions and examples)
Context: Everything from the separator onward (graph edge list + specific operation question)

Dataset

openai/graphwalks — 1,150 samples in the training split
Columns: prompt, answer_nodes, prompt_chars (2.6k–1.75M), problem_type, date_added

Quickstart

# Basic evaluation (exact scoring)
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5

# F1 scoring
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"scoring": "f1"}'

# Filter by problem type
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"problem_type": "parents"}'

# Only prompts with >1M characters
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": ">1000000"}'

# Only prompts with <10k characters
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": "<10000"}'

# Only prompts between 128k and 256k characters (inclusive range)
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": "128000-256000"}'

# Combine filters: BFS problems with >100k chars
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"problem_type": "BFS", "prompt_chars_filter": ">100000"}'

# With environment tips and shuffling
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"include_env_tips": true, "shuffle": true}'

Environment Arguments

Arg	Type	Default	Description
`split`	str	`"train"`	Dataset split
`scoring`	str	`"exact"`	Scoring mode: `"exact"` (set equality) or `"f1"` (set overlap)
`prompt_chars_filter`	str \| None	`None`	Filter on `prompt_chars` column using comparison operators (`">"`, `"<"`, `">="`, `"<="`, `"=="`) or an inclusive range (`"128000-256000"`). Example: `">1000000"` for entries with >1M prompt chars
`problem_type`	str \| None	`None`	Filter by `problem_type` (e.g., `"parents"`, `"BFS"`)
`shuffle`	bool	`False`	Whether to shuffle the dataset
`seed`	int \| None	`None`	Random seed for shuffling
`max_examples`	int \| None	`None`	Maximum number of examples to load (None = all)
`include_env_tips`	bool	`False`	Include strategy tips in prompt
`prompt_in_context_file`	bool	`False`	If `True`, both query and context go in a structured dict in the context file (`{"query": prompt, "context": graph}`)
`repl_language`	Literal["bash", "python"]	`"bash"`	REPL language for the RLM
`max_turns`	int	`30`	Maximum REPL iterations
`sub_llm_max_turns`	int	`5`	Max tool-calling turns for each sub-LLM call
`sub_model`	str \| None	`None`	Model for sub-LLM calls (defaults to same as root model)
`max_sub_llm_parallelism`	int	`5`	Max concurrent sub-LLM calls
`max_output_length`	int	`8192`	Maximum code execution output length
`code_execution_timeout`	int	`120`	Timeout in seconds for code execution
`abort_on_code_timeout`	bool	`False`	If True, abort rollout on code timeout; if False, return error to model
`max_startup_wait_seconds`	int	`120`	Max seconds to wait for sandbox startup
`pip_install_packages`	str	`""`	Packages to install in sandbox
`sandbox_docker_image`	str	`"python:3.11-slim"`	Docker image for sandbox
`sandbox_cpu_cores`	int	`1`	CPU cores for sandbox
`sandbox_memory_gb`	int	`2`	Memory in GB for sandbox
`sandbox_disk_size_gb`	int	`5`	Disk size in GB for sandbox
`sandbox_gpu_count`	int	`0`	Number of GPUs for sandbox
`sandbox_timeout_minutes`	int	`60`	Overall sandbox lifetime in minutes

Metrics

Scoring uses the answer extraction from the original GraphWalks environment:

Exact (scoring="exact"): 1.0 if predicted node set exactly matches truth set, 0.0 otherwise.
F1 (scoring="f1"): Harmonic mean of precision and recall based on set overlap of predicted vs truth nodes.

The model's final answer is expected in the format: Final Answer: [node1, node2, node3]

Changelog

0.1.0: Initial RLM version with prompt splitting, prompt_chars_filter, problem_type filter, exact/F1 scoring