0

SDR Arena RL Env (Hashkat)

Fresh

SDR-Arena: Verifiers environment for SDR-Bench - benchmarking LLM personalization capabilities on B2B sales research with time-restricted web searc...

Type
RL Env
Publisher
Hashkat
Runtime
tool-use
License
unknown
Size
v0.1.0
Published
Mar 2026

Cite

Notes

Only stored in your browser.

SDR-Arena

A verifiers environment for SDR-Bench — benchmarking LLM personalization capabilities on B2B sales research tasks.

Paper & resources:


Background

SDR-Bench evaluates how well LLMs can generate personalized value propositions for B2B sales prospects. The benchmark is grounded in a Bayesian Persuasion framework that formally defines personalization, using real customer success stories as ground truth.

Dataset: 6,279 verified business success stories across 203 domains, 134 industries, and 361 seller companies. Each story contains verified pitch points extracted from real deal evidence.

Key insight from the paper: Deep research agents (STORM, ODR) that use multi-turn web search outperform standard LLMs, but a substantial gap remains before human-level sales proficiency. The benchmark uses time-restricted web search (results before the story's publication date) to prevent data leakage.

Top results from the paper: STORM-QWEN-2.5 achieved the highest aggregate Weighted Coverage Score (42.51 on success stories), with significant variance across industries — deep research agents excel in Healthcare while standard models plateau in competitive Tech sectors.


This Environment

This verifiers environment implements SDR-Bench with:

  • Time-filtered web search via Brightdata SERP API (date-restricted to before each story's publication)
  • LLM-as-judge scoring (Azure GPT-4o) with a 0-5 Likert scale per pitch point
  • Two modes: SDRArenaEnv for standard LLM eval/training, AgentArena for benchmarking custom agents (STORM, ODR, etc.)

How Scoring Works

For each ground-truth pitch point, the judge assigns:

ScoreMeaning
0Miss — concept absent from candidate pitch
1Marketing fluff — vague mention, no substance
2Topic match — correct product/pain, wrong solution
3Soft match — correct value prop, missing hero evidence
4Strong — core value + key mechanism captured
5Bullseye — product + pain + value + specific metric

Reward = avg(scores) / 5.0 normalized to [0, 1].


Quick Start

Install

prime env install sdr-arena
# or
pip install git+https://github.com/h4shk4t/sdr-arena-verifiers.git

Setup credentials

Create a .env file in the environment directory:

# Azure GPT-4o (judge model)
ENDPOINT=https://<resource>.cognitiveservices.azure.com
SUBSCRIPTION_KEY=<azure-api-key>
API_VERSION=2025-01-01-preview

# Brightdata SERP (web search)
BRIGHTDATA_API_TOKEN=<brightdata-token>
BRIGHTDATA_SERP_ZONE=serp_api_bdr

Run evaluation

# Default (5 examples, 3 rollouts)
prime eval run sdr-arena

# With Azure GPT-4o as the policy model
set -a && source .env && set +a
prime eval run sdr-arena -m azure-gpt4o -n 45 -r 3 -s -v \
  -b "https://<resource>.cognitiveservices.azure.com/openai/deployments/gpt-4o?api-version=2025-01-01-preview" \
  -k SUBSCRIPTION_KEY

Azure note: Always pass both -b and -k explicitly — the prime bridge otherwise injects its own inference URL and API key.


Custom Agents (AgentArena)

Benchmark any Python agent (STORM, ODR, custom pipelines) against the same dataset and judge:

from sdr_arena import build_dataset, build_rubric, AgentArena

async def my_agent(topic, search_fn, client=None, model="", info={}):
    results = await search_fn([f"{info['seller']} {info['customer']} case study"])
    # ... your multi-step research pipeline ...
    return "pitch points string"

def load_environment(**kwargs):
    return AgentArena(agent_fn=my_agent, dataset=build_dataset(), rubric=build_rubric())

Register in pyproject.toml and run:

[tool.verifiers.environments]
my-agent = "my_agent_module"
prime eval run my-agent -m azure-gpt4o -n 15 -r 3 -b "..." -k SUBSCRIPTION_KEY

The included storm-arena environment implements the full STORM pipeline (perspective generation, knowledge curation via parallel Q&A with web search, two-pass outline, pitch synthesis).


Dataset

FieldDescription
promptBDR task: seller, customer, products to pitch
answerGround-truth pitch points with evidence
info.sellerSeller company
info.customerTarget customer
info.productsProducts being pitched
info.published_dateStory publication date (search cutoff)

7,283 matched samples derived from SDR-Bench. The original dataset contains 6,279 success stories; the matched samples pair prompts with extracted pitch point ground truth.


Environment Arguments

ArgDefaultDescription
websearch_concurrency20Max parallel Brightdata SERP requests

Citation

@misc{sdrbench2026,
  author = {Srivastava, Ashutosh and Yedlapati, Siddharth and Aggarwal, Vinay and Dixit, Shashwat and Singla, Yaman Kumar},
  title = {SDR-Bench: Benchmarking the Personalization Capabilities of Large Language Models},
  year = {2026},
  publisher = {Behavior in the Wild},
  howpublished = {\url{https://behavior-in-the-wild.github.io/SDR-Bench.html}},
}

License

MIT