0

Bench RLM RL Env (Community)

Fresh

τ²-bench evaluation environment. Focus on tau-knowledge with RLM.

Type
RL Env
License
apache-2.0
Published
Apr 2026

Cite

Notes

Only stored in your browser.

tau3-bench-rlm

Source Code

Overview

  • Environment ID: tau3-bench-rlm
  • Short description: TauBench in RLM form with root messaging and sub-agent tool use.
  • Tags: tool-agent-user, tool-use, multi-turn, user-sim, sierra-research, rlm

Architecture

This environment keeps TauBench's native dual-LLM setup:

  • Main evaluated model runs in RLMEnv Python REPL.
  • Tau user simulator remains a separate LLM (UserSimulator).

Control split:

  • Root model uses send_message(message=...) for user-facing assistant turns.
  • Sub-agents (via llm_batch) can call Tau assistant tools (for example KB_search, grep, and other domain tools).
  • Raw text fallback: If the root model emits a plain-text response with no tool call, it is automatically converted into a synthetic send_message tool call so the conversation advances instead of terminating on no_tools_called.

There is no manual step/get_state API.

Datasets

Quickstart

uv run vf-eval tau3-bench-rlm

Domain examples:

uv run vf-eval tau3-bench-rlm -a '{"domain":"banking_knowledge"}'
uv run vf-eval tau3-bench-rlm -a '{"domain":"retail"}'
uv run vf-eval tau3-bench-rlm -a '{"domain":"airline"}'
uv run vf-eval tau3-bench-rlm -n 100 -r 1 -s -m openai/gpt-5.2 -a '{"domain":"banking_knowledge","retrieval_variant":"openai_embeddings_grep"}'

Environment Arguments

ArgTypeDefaultDescription
domainstr"banking_knowledge"Tau domain/task set
user_modelstr"custom_openai/openai/gpt-4.1"Model used by Tau user simulator
user_argsdictDEFAULT_LLM_ARGS_USERSampling args for user simulator
user_base_urlstr"https://api.pinference.ai/api/v1"Base URL for user simulator model
user_api_key_varstr"PRIME_API_KEY"Env var for user simulator key
retrieval_variantstr | nullnullBanking knowledge retrieval variant
retrieval_kwargsdict | nullnullExtra retrieval args
max_stepsint200Tau internal max step count
max_errorsint10Tau internal max tool-error count
max_workersint128Thread pool workers for blocking Tau calls
max_turnsint50Max root tool calls per Tau assistant turn; resets after each send_message. When exceeded, the model is forced to call send_message (further tool calls raise until then).
sub_llm_max_turnsint5Sub-LLM tool-calling turn cap
sub_modelstr | nullnullOptional sub-LLM model override
max_sub_llm_parallelismint5Max concurrent sub-LLM calls
max_output_lengthint8192Max REPL execution output
code_execution_timeoutint120REPL code execution timeout (seconds)
abort_on_code_timeoutboolfalseAbort rollout on REPL timeout
sandbox_docker_imagestr"python:3.11-slim"Sandbox image
sandbox_cpu_coresint1Sandbox CPU cores
sandbox_memory_gbint2Sandbox memory
sandbox_disk_size_gbint5Sandbox disk size
sandbox_gpu_countint0Sandbox GPU count
sandbox_timeout_minutesint60Sandbox lifetime

Metrics

MetricMeaning
reward / evaluate_tau2_taskOfficial TauBench reward
num_errorsTau internal tool error count
num_stepsTau internal step count
num_assistant_tool_callsAssistant tool calls executed (mostly via sub-agents)
num_user_tool_callsUser simulator tool calls
main_rlm_*, sub_llm_*, repl_*, root_tool_*Built-in RLM monitor metrics

Rubric & reward info in results

The environment automatically includes RECOMMENDED_STATE_COLUMNS (tau2_reward_info, tau2_task_info) in every eval run — no extra flags needed. Any additional columns passed via -C are merged in.

State columnContents
tau2_reward_infoFull reward breakdown: db_check, action_checks, env_assertions, communicate_checks, nl_assertions, reward_basis, reward_breakdown
tau2_task_infoTask rubric: task_id, evaluation_criteria (expected actions, reward_basis), user_scenario (user instructions), description, required_documents

Changelog

v0.1.1 (Apr 10, 2026)

  • Pin tau2 to commit 58e5e1ace69302e6982d27014569c03e0ffccdd2 instead of the moving main branch for reproducible installs.

v0.1.0 (Mar 21, 2026)

  • Ported tau-bench environment to RLMEnv.
  • Added root bridge tool send_message(...).
  • Exposed Tau assistant tools to sub-agents (via llm_batch), not root.
  • Kept official Tau simulation + evaluation logic.
  • Raw text assistant messages (no tool call) are auto-converted to send_message instead of terminating the episode.
  • Task rubric info (tau2_task_info) is persisted to state for inclusion in results.
  • Added tau2_task_info to RECOMMENDED_STATE_COLUMNS.