tau2_infinity_wg
Prime Intellect RL environment — airline customer-service gym backed by world-gen failure artifacts.
An agent plays the airline-support agent role against an LLM-simulated customer across a stateful tool environment. Each episode runs against a failure-*.json artifact that pins the initial world state and a set of objectives the agent must satisfy.
Rewards
Four components (ToolRL decomposition):
- outcome — LLM-as-judge (ORM) scores each objective on a continuous 5-bucket rubric; final score is mean across objectives. See
verifier.pyfor the judge determinism treatment. - format — strict tool-call schema compliance, binary.
- efficiency — tool-call count vs. optimal, decaying.
- compliance — penalties for parallel calls and unauthorized mutations.
See rewards.py for weights and the "Things explicitly deferred" section.
Artifacts
The artifacts/ directory holds world-gen outputs — each file is a task the agent must solve. Artifacts are loaded verbatim by dataset.py; the env never regenerates them at runtime.
Models
Default judge + customer-simulation model: bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0. Override via load_environment(judge_model=..., customer_model=...).
Required secrets (or env vars) at runtime: AWS_BEARER_TOKEN_BEDROCK, AWS_REGION.