CL-bench (RLM version)

Overview

Environment ID: clbench-rlm
Short description: Minimal CL-bench RLM environment with strict rubric-based LLM-as-judge scoring.
Tags: in-context-learning, long-context, eval

Dataset

Primary dataset: tencent/CL-bench
Source links: HuggingFace, GitHub
Notes: The license on the dataset only allows the usage for evaluation, not training.

Quickstart

# Context offloaded to file (default)
# Environment: Python REPL (repl_language: "python"); include_env_tips adds prompt hint for llm_batch() sub-agent use
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 100 -r 1 -a '{"repl_language": "python", "include_env_tips": true}'

# Full content loaded in model prompt
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"include_content_in_context": true}'

# Use bash REPL instead of python
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"repl_language": "bash"}'

# Filter by category (use valid pairs from table below)
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 200 -r 1 -a '{"context_category": "Rule System Application", "sub_category": "Legal & Regulatory", "repl_language": "python", "include_env_tips": true}'

# Filter by multiple sub-categories within same context
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"context_category": "Rule System Application", "sub_category": ["Game Mechanics", "Legal & Regulatory"]}'

Context modes

Each CL-bench example contains a system prompt and a long content trajectory (user/assistant turns). The include_content_in_context flag controls how this content is presented to the model:

false (default) — Only the system prompt is loaded into the model's prompt. The full content trajectory is written to context.txt in the sandbox working directory. A blurb instructs the model to read the file via the REPL and use sub-agents to analyze it. This tests the model's ability to work with offloaded long-context through tool use.
true — The full trajectory (system prompt + all content turns) is loaded directly into the model's prompt. The content is also written to context.txt so the model can search and re-read it via the REPL. A note tells the model to leverage sub-agents and the REPL to verify its answers.

In both modes the content is always available as context.txt in the sandbox.

Environment Arguments

Arg	Type	Default	Description
`judge_model`	str or null	`"openai/gpt-5.2"`	Judge model
`judge_base_url`	str or null	`"https://api.pinference.ai/api/v1"`	OpenAI-compatible base URL (Prime Intellect)
`judge_api_key_var`	str or null	`null`	Env var used for judge API key (defaults to `PRIME_API_KEY`)
`include_content_in_context`	bool	`false`	If true, load full content trajectory into the model prompt; if false, offload to `context.txt` (see above)
`include_env_tips`	bool	`false`	Appends a small `<env_tips>` block encouraging `llm_batch()` sub-agent delegation
`context_category`	str or list[str] or null	`null`	Filter examples by metadata `context_category`; pass a string or list of strings to match
`sub_category`	str or list[str] or null	`null`	Filter examples by metadata `sub_category`; pass a string or list of strings to match
`repl_language`	str	`"python"`	Sandbox REPL language (`"python"` or `"bash"`)
`max_turns`	int	`30`	Max root-agent turns
`sub_llm_max_turns`	int	`5`	Max turns per sub-agent (`llm_batch`)
`sub_model`	str or null	`null`	Optional model override for sub-agents
`max_sub_llm_parallelism`	int	`5`	Max concurrent sub-agent calls
`max_output_length`	int	`8192`	Max REPL output length
`code_execution_timeout`	int	`120`	Timeout (s) for REPL execution
`abort_on_code_timeout`	bool	`false`	Abort rollout on execution timeout
`max_startup_wait_seconds`	int	`120`	Max sandbox startup wait
`pip_install_packages`	str	`""`	Extra pip packages for sandbox
`sandbox_docker_image`	str	`"python:3.11-slim"`	Sandbox image
`sandbox_cpu_cores`	int	`1`	Sandbox CPU cores
`sandbox_memory_gb`	int	`2`	Sandbox memory
`sandbox_disk_size_gb`	int	`5`	Sandbox disk
`sandbox_gpu_count`	int	`0`	Sandbox GPUs
`sandbox_timeout_minutes`	int	`60`	Sandbox lifetime
`**kwargs`	Any	-	Additional args forwarded to `RLMEnv`

Valid categories

Only certain context_category / sub_category pairs exist in the dataset. An error is raised if you specify invalid names or a non-existent combination.

context_category (4 values): Domain Knowledge Reasoning, Empirical Discovery & Simulation, Procedural Task Execution, Rule System Application

Valid (context_category, sub_category) pairs:

context_category	sub_category
Domain Knowledge Reasoning	Finance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science
Empirical Discovery & Simulation	Experimental Data, Observational Data, Simulation Environment
Procedural Task Execution	Instructional Procedures, Operational Procedures, Workflow Orchestration
Rule System Application	Game Mechanics, Legal & Regulatory, Mathematical Formalism, Programming Syntax, Technical Standards

Changelog

0.1.0: Environment created.