SDR-Arena
A verifiers environment for SDR-Bench — benchmarking LLM personalization capabilities on B2B sales research tasks.
Paper & resources:
Background
SDR-Bench evaluates how well LLMs can generate personalized value propositions for B2B sales prospects. The benchmark is grounded in a Bayesian Persuasion framework that formally defines personalization, using real customer success stories as ground truth.
Dataset: 6,279 verified business success stories across 203 domains, 134 industries, and 361 seller companies. Each story contains verified pitch points extracted from real deal evidence.
Key insight from the paper: Deep research agents (STORM, ODR) that use multi-turn web search outperform standard LLMs, but a substantial gap remains before human-level sales proficiency. The benchmark uses time-restricted web search (results before the story's publication date) to prevent data leakage.
Top results from the paper: STORM-QWEN-2.5 achieved the highest aggregate Weighted Coverage Score (42.51 on success stories), with significant variance across industries — deep research agents excel in Healthcare while standard models plateau in competitive Tech sectors.
This Environment
This verifiers environment implements SDR-Bench with:
- Time-filtered web search via Brightdata SERP API (date-restricted to before each story's publication)
- LLM-as-judge scoring (Azure GPT-4o) with a 0-5 Likert scale per pitch point
- Two modes:
SDRArenaEnvfor standard LLM eval/training,AgentArenafor benchmarking custom agents (STORM, ODR, etc.)
How Scoring Works
For each ground-truth pitch point, the judge assigns:
| Score | Meaning |
|---|---|
| 0 | Miss — concept absent from candidate pitch |
| 1 | Marketing fluff — vague mention, no substance |
| 2 | Topic match — correct product/pain, wrong solution |
| 3 | Soft match — correct value prop, missing hero evidence |
| 4 | Strong — core value + key mechanism captured |
| 5 | Bullseye — product + pain + value + specific metric |
Reward = avg(scores) / 5.0 normalized to [0, 1].
Quick Start
Install
prime env install sdr-arena
# or
pip install git+https://github.com/h4shk4t/sdr-arena-verifiers.git
Setup credentials
Create a .env file in the environment directory:
# Azure GPT-4o (judge model)
ENDPOINT=https://<resource>.cognitiveservices.azure.com
SUBSCRIPTION_KEY=<azure-api-key>
API_VERSION=2025-01-01-preview
# Brightdata SERP (web search)
BRIGHTDATA_API_TOKEN=<brightdata-token>
BRIGHTDATA_SERP_ZONE=serp_api_bdr
Run evaluation
# Default (5 examples, 3 rollouts)
prime eval run sdr-arena
# With Azure GPT-4o as the policy model
set -a && source .env && set +a
prime eval run sdr-arena -m azure-gpt4o -n 45 -r 3 -s -v \
-b "https://<resource>.cognitiveservices.azure.com/openai/deployments/gpt-4o?api-version=2025-01-01-preview" \
-k SUBSCRIPTION_KEY
Azure note: Always pass both
-band-kexplicitly — the prime bridge otherwise injects its own inference URL and API key.
Custom Agents (AgentArena)
Benchmark any Python agent (STORM, ODR, custom pipelines) against the same dataset and judge:
from sdr_arena import build_dataset, build_rubric, AgentArena
async def my_agent(topic, search_fn, client=None, model="", info={}):
results = await search_fn([f"{info['seller']} {info['customer']} case study"])
# ... your multi-step research pipeline ...
return "pitch points string"
def load_environment(**kwargs):
return AgentArena(agent_fn=my_agent, dataset=build_dataset(), rubric=build_rubric())
Register in pyproject.toml and run:
[tool.verifiers.environments]
my-agent = "my_agent_module"
prime eval run my-agent -m azure-gpt4o -n 15 -r 3 -b "..." -k SUBSCRIPTION_KEY
The included storm-arena environment implements the full STORM pipeline (perspective generation, knowledge curation via parallel Q&A with web search, two-pass outline, pitch synthesis).
Dataset
| Field | Description |
|---|---|
prompt | BDR task: seller, customer, products to pitch |
answer | Ground-truth pitch points with evidence |
info.seller | Seller company |
info.customer | Target customer |
info.products | Products being pitched |
info.published_date | Story publication date (search cutoff) |
7,283 matched samples derived from SDR-Bench. The original dataset contains 6,279 success stories; the matched samples pair prompts with extracted pitch point ground truth.
Environment Arguments
| Arg | Default | Description |
|---|---|---|
websearch_concurrency | 20 | Max parallel Brightdata SERP requests |
Citation
@misc{sdrbench2026,
author = {Srivastava, Ashutosh and Yedlapati, Siddharth and Aggarwal, Vinay and Dixit, Shashwat and Singla, Yaman Kumar},
title = {SDR-Bench: Benchmarking the Personalization Capabilities of Large Language Models},
year = {2026},
publisher = {Behavior in the Wild},
howpublished = {\url{https://behavior-in-the-wild.github.io/SDR-Bench.html}},
}
License
MIT