0

Agent BT RL Env (Shiels AI)

Fresh

Bradley-Terry pairwise environment for comparing blinded research papers

Type
RL Env
Publisher
Shiels AI
Runtime
tool-use
License
unknown
Size
v0.1.7
Published
Mar 2026

Cite

Notes

Only stored in your browser.

impact-agent-bt

Environment for Bradley-Terry style pairwise comparison of blinded scientific papers.

Overview

  • Environment package name: impact-agent-bt
  • Local eval ID after install: impact-agent-bt
  • Task type: multi-turn tool-use (vf.StatefulToolEnv)
  • Domain: blinded dementia-research paper comparisons

Task

The model is given two blinded papers, Paper A and Paper B, and must inspect sections from both before submitting which paper should win the pairwise comparison.

The final tool call must submit a JSON object with exactly:

  • predicted_winner: "A" or "B"
  • confidence_logit: finite numeric confidence
  • reasoning: brief explanation

Tools

The environment exposes two tools to the model:

  1. scan_paper(section_name: str, target_paper: str)
  • Reads a section from Paper A or Paper B.
  • target_paper must be "A" or "B".
  • Raises a tool error if the section does not exist.
  1. submit_preference(prediction_json: str)
  • Submits the final pairwise decision as a JSON string.
  • Required schema:
    • {"predicted_winner": "A" | "B", "confidence_logit": <number>, "reasoning": <string>}
  • Extra fields are rejected.

Notes:

  • Hidden tool args are injected from environment state via StatefulToolEnv.update_tool_args(...).
  • The environment tracks viewed sections for both papers.
  • The rollout is marked complete after a valid submit_preference(...) call.

Reward / Metrics

The default rubric (BradleyTerryRubric) combines:

  • order_reward: positive reward for the correct winner, scaled by sigmoid-transformed confidence
  • bilateral_scan_bonus: rewards scanning evidence-bearing sections from both papers
  • reasoning_bonus: rewards valid submission JSON
  • completion_bonus: rewards finishing with a valid submission

Key tunables:

  • order_reward_weight
  • bilateral_scan_bonus_weight
  • reasoning_bonus_weight
  • completion_weight
  • logit_clip
  • max_turns

Dataset

The runtime loader supports normalized pair/paper JSONL artifacts and legacy JSONL records.

Default behavior:

  • Loads from packaged dataset resources under impact_agent_bt/data/
  • Supports train, val, test, and all split selection when those files are available

Optional override:

  • Pass jsonl_path to load_environment() to use a custom dataset base path

Environment Arguments

Supported load_environment(...) args:

ArgTypeDefaultDescription
splitstr"train"Dataset split: train, val, test, all
max_turnsint10Max tool interactions per rollout
jsonl_pathstr | NoneNoneOptional dataset path override
order_reward_weightfloat1.0Correct-order reward weight
bilateral_scan_bonus_weightfloat0.2Bonus for scanning both papers
reasoning_bonus_weightfloat0.05Bonus for valid JSON reasoning payload
completion_weightfloat0.1Bonus for successful completion
logit_clipfloat8.0Confidence clipping before sigmoid

Quickstart

Install the environment package locally:

prime env install impact-agent-bt

Run local evaluation:

prime eval run impact-agent-bt -m gpt-4.1-mini

Run with explicit args:

prime eval run impact-agent-bt -m gpt-4.1-mini -a '{"split":"train","max_turns":10}'

Hosted Training

Example hosted training config lives at:

  • configs/hosted/impact-agent-bt.toml

Run:

prime rl run @ configs/hosted/impact-agent-bt.toml

Development Notes

  • Source of truth implementation: environments/impact_agent_bt/impact_agent_bt/env.py
  • Tools: environments/impact_agent_bt/impact_agent_bt/tools.py
  • Rubric: environments/impact_agent_bt/impact_agent_bt/rubrics/bradley_terry.py
  • Dataset loader: environments/impact_agent_bt/impact_agent_bt/dataset.py