0

UUID CTF RL Env (Community)

Fresh

RLM sandbox CTF environment for recovering a derived UUID token from hidden shard identifiers.

Type
RL Env
License
apache-2.0
Published
May 2026

Cite

Notes

Only stored in your browser.

rlm-uuid-ctf

RLM sandbox environment where the agent recovers a derived UUID from a generated incident archive.

Task

Each rollout creates /workspace/corpus, a deterministic collection of realistic support, ops, audit, and export files. The corpus contains many decoy UUIDs plus exactly five shard UUIDs relevant to the target incident. The final answer is not present in the corpus.

The agent must:

  1. Identify the target incident and tenant.
  2. Find the five shard UUIDs tied to that incident.
  3. Canonicalize mixed UUID encodings.
  4. Order shards by observed_at timestamp ascending.
  5. Compute:
uuid.UUID(bytes=hashlib.sha256(
    b"ctf-shard-v1\n" + b"".join(uuid.UUID(u).bytes for u in ordered_shards)
).digest()[:16])
  1. Write /workspace/answer.json:
{"result_uuid": "...", "source_uuids": ["...", "...", "...", "...", "..."], "evidence_paths": ["..."]}

Quickstart

# From research-environments root
uv pip install -e ./environments/rlm_uuid_ctf

# One deterministic task, eight rollouts
uv run vf-eval rlm-uuid-ctf -m openai/gpt-5-mini -n1 -r8 -d -A

# Curriculum calibration against a smaller model
uv run vf-eval rlm-uuid-ctf -p prime -m qwen/qwen3-8b -n1 -r5 -d -A \
  -a '{"difficulty":"extra_easy","max_turns":60}'

Environment Arguments

ArgumentDefaultDescription
num_samples1Number of generated tasks.
seed314159Base seed; sample i uses seed + i.
difficultyNoneOptional preset: standard, easy, extra_easy, or super_easy. Presets override num_noise_files, decoys_per_file, and primary evidence decoy counts.
num_noise_files180Number of unrelated files with decoy UUIDs.
decoys_per_file4UUID-bearing records per noise file.
primary_decoy_countNoneOptional custom primary evidence decoys per evidence file when difficulty is unset.
sandbox_image"python:3.11-slim"Sandbox Docker image.
max_turns80Environment-side rollout turn cap; also defaults RLM's turn cap.
timeout_seconds900.0Per-rollout wall-clock cap.
rlm_max_tool_output_chars20000Caps each RLM tool output.
include_rlm_metricsfalseInclude RLM harness metrics in eval output. Disabled by default because failed model calls can omit some RLM metric keys.
sandbox_client_max_workers32Explicit sandbox client worker cap.
**rlm_kwargs-Forwarded to rlm_harness.

Difficulty Presets

PresetNoise filesNoise UUID records/filePrimary decoysGuidance
standard180412,18,16,20No direct index.
easy40112,18,16,20No direct index; this matches the first calibration task shape.
extra_easy1613,4,3,4Adds /workspace/corpus/ops/recovery_index.json with evidence paths, encodings, and shard order plus a source-extractor helper.
super_easy401,1,1,1Adds /workspace/corpus/ops/recovery_manifest.json with the five canonical source UUIDs in order plus an answer-writer helper script.

Changelog

v0.2.0

  • Added curriculum difficulty presets for easier small-model calibration.

v0.1.0

  • Initial single-task UUID shard recovery CTF environment.