0

Graphwalks RLM RL Env (Community)

Fresh

GraphWalks graph traversal evaluation environment using RLM with Python REPL

Type
RL Env
License
apache-2.0
Published
Apr 2026

Cite

Notes

Only stored in your browser.

graphwalks-rlm

Overview

  • Environment ID: graphwalks-rlm
  • Short description: GraphWalks graph traversal benchmark using RLM (Recursive Language Model) with Python REPL
  • Tags: long-context, rlm, python, multi-turn, repl, graph

How It Works

This environment implements the GraphWalks benchmark for evaluating graph traversal capabilities using the RLMEnv.

The model receives the task instructions as a prompt and the full graph (edge list + operation question) as context in the REPL filesystem. It must parse the graph, implement the required algorithm, and verify correctness using the REPL.

The prompt is split at the "Here is the graph to operate on" separator:

  • Prompt: Everything before the separator (general instructions and examples)
  • Context: Everything from the separator onward (graph edge list + specific operation question)

Dataset

  • openai/graphwalks — 1,150 samples in the training split
  • Columns: prompt, answer_nodes, prompt_chars (2.6k–1.75M), problem_type, date_added

Quickstart

# Basic evaluation (exact scoring)
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5

# F1 scoring
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"scoring": "f1"}'

# Filter by problem type
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"problem_type": "parents"}'

# Only prompts with >1M characters
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": ">1000000"}'

# Only prompts with <10k characters
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": "<10000"}'

# Only prompts between 128k and 256k characters (inclusive range)
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"prompt_chars_filter": "128000-256000"}'

# Combine filters: BFS problems with >100k chars
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"problem_type": "BFS", "prompt_chars_filter": ">100000"}'

# With environment tips and shuffling
uv run vf-eval graphwalks-rlm -m gpt-5-mini -n 5 -a '{"include_env_tips": true, "shuffle": true}'

Environment Arguments

ArgTypeDefaultDescription
splitstr"train"Dataset split
scoringstr"exact"Scoring mode: "exact" (set equality) or "f1" (set overlap)
prompt_chars_filterstr | NoneNoneFilter on prompt_chars column using comparison operators (">", "<", ">=", "<=", "==") or an inclusive range ("128000-256000"). Example: ">1000000" for entries with >1M prompt chars
problem_typestr | NoneNoneFilter by problem_type (e.g., "parents", "BFS")
shuffleboolFalseWhether to shuffle the dataset
seedint | NoneNoneRandom seed for shuffling
max_examplesint | NoneNoneMaximum number of examples to load (None = all)
include_env_tipsboolFalseInclude strategy tips in prompt
prompt_in_context_fileboolFalseIf True, both query and context go in a structured dict in the context file ({"query": prompt, "context": graph})
repl_languageLiteral["bash", "python"]"bash"REPL language for the RLM
max_turnsint30Maximum REPL iterations
sub_llm_max_turnsint5Max tool-calling turns for each sub-LLM call
sub_modelstr | NoneNoneModel for sub-LLM calls (defaults to same as root model)
max_sub_llm_parallelismint5Max concurrent sub-LLM calls
max_output_lengthint8192Maximum code execution output length
code_execution_timeoutint120Timeout in seconds for code execution
abort_on_code_timeoutboolFalseIf True, abort rollout on code timeout; if False, return error to model
max_startup_wait_secondsint120Max seconds to wait for sandbox startup
pip_install_packagesstr""Packages to install in sandbox
sandbox_docker_imagestr"python:3.11-slim"Docker image for sandbox
sandbox_cpu_coresint1CPU cores for sandbox
sandbox_memory_gbint2Memory in GB for sandbox
sandbox_disk_size_gbint5Disk size in GB for sandbox
sandbox_gpu_countint0Number of GPUs for sandbox
sandbox_timeout_minutesint60Overall sandbox lifetime in minutes

Metrics

Scoring uses the answer extraction from the original GraphWalks environment:

  • Exact (scoring="exact"): 1.0 if predicted node set exactly matches truth set, 0.0 otherwise.
  • F1 (scoring="f1"): Harmonic mean of precision and recall based on set overlap of predicted vs truth nodes.

The model's final answer is expected in the format: Final Answer: [node1, node2, node3]

Changelog

  • 0.1.0: Initial RLM version with prompt splitting, prompt_chars_filter, problem_type filter, exact/F1 scoring