SDR-Arena

A verifiers environment for SDR-Bench — benchmarking LLM personalization capabilities on B2B sales research tasks.

Paper & resources:

SDR-Bench paper | Dataset (HuggingFace) | Leaderboard

Background

SDR-Bench evaluates how well LLMs can generate personalized value propositions for B2B sales prospects. The benchmark is grounded in a Bayesian Persuasion framework that formally defines personalization, using real customer success stories as ground truth.

Dataset: 6,279 verified business success stories across 203 domains, 134 industries, and 361 seller companies. Each story contains verified pitch points extracted from real deal evidence.

Key insight from the paper: Deep research agents (STORM, ODR) that use multi-turn web search outperform standard LLMs, but a substantial gap remains before human-level sales proficiency. The benchmark uses time-restricted web search (results before the story's publication date) to prevent data leakage.

Top results from the paper: STORM-QWEN-2.5 achieved the highest aggregate Weighted Coverage Score (42.51 on success stories), with significant variance across industries — deep research agents excel in Healthcare while standard models plateau in competitive Tech sectors.

This Environment

This verifiers environment implements SDR-Bench with:

Time-filtered web search via Brightdata SERP API (date-restricted to before each story's publication)
LLM-as-judge scoring (Azure GPT-4o) with a 0-5 Likert scale per pitch point
Two modes: SDRArenaEnv for standard LLM eval/training, AgentArena for benchmarking custom agents (STORM, ODR, etc.)

How Scoring Works

For each ground-truth pitch point, the judge assigns:

Score	Meaning
0	Miss — concept absent from candidate pitch
1	Marketing fluff — vague mention, no substance
2	Topic match — correct product/pain, wrong solution
3	Soft match — correct value prop, missing hero evidence
4	Strong — core value + key mechanism captured
5	Bullseye — product + pain + value + specific metric

Reward = avg(scores) / 5.0 normalized to [0, 1].

Quick Start

Install

prime env install sdr-arena
# or
pip install git+https://github.com/h4shk4t/sdr-arena-verifiers.git

Setup credentials

Create a .env file in the environment directory:

# Azure GPT-4o (judge model)
ENDPOINT=https://<resource>.cognitiveservices.azure.com
SUBSCRIPTION_KEY=<azure-api-key>
API_VERSION=2025-01-01-preview

# Brightdata SERP (web search)
BRIGHTDATA_API_TOKEN=<brightdata-token>
BRIGHTDATA_SERP_ZONE=serp_api_bdr

Run evaluation

# Default (5 examples, 3 rollouts)
prime eval run sdr-arena

# With Azure GPT-4o as the policy model
set -a && source .env && set +a
prime eval run sdr-arena -m azure-gpt4o -n 45 -r 3 -s -v \
  -b "https://<resource>.cognitiveservices.azure.com/openai/deployments/gpt-4o?api-version=2025-01-01-preview" \
  -k SUBSCRIPTION_KEY

Azure note: Always pass both -b and -k explicitly — the prime bridge otherwise injects its own inference URL and API key.

Custom Agents (AgentArena)

Benchmark any Python agent (STORM, ODR, custom pipelines) against the same dataset and judge:

from sdr_arena import build_dataset, build_rubric, AgentArena

async def my_agent(topic, search_fn, client=None, model="", info={}):
    results = await search_fn([f"{info['seller']} {info['customer']} case study"])
    # ... your multi-step research pipeline ...
    return "pitch points string"

def load_environment(**kwargs):
    return AgentArena(agent_fn=my_agent, dataset=build_dataset(), rubric=build_rubric())

[tool.verifiers.environments]
my-agent = "my_agent_module"

prime eval run my-agent -m azure-gpt4o -n 15 -r 3 -b "..." -k SUBSCRIPTION_KEY

The included storm-arena environment implements the full STORM pipeline (perspective generation, knowledge curation via parallel Q&A with web search, two-pass outline, pitch synthesis).

Dataset

Field	Description
`prompt`	BDR task: seller, customer, products to pitch
`answer`	Ground-truth pitch points with evidence
`info.seller`	Seller company
`info.customer`	Target customer
`info.products`	Products being pitched
`info.published_date`	Story publication date (search cutoff)

7,283 matched samples derived from SDR-Bench. The original dataset contains 6,279 success stories; the matched samples pair prompts with extracted pitch point ground truth.

Environment Arguments

Arg	Default	Description
`websearch_concurrency`	`20`	Max parallel Brightdata SERP requests

Citation

@misc{sdrbench2026,
  author = {Srivastava, Ashutosh and Yedlapati, Siddharth and Aggarwal, Vinay and Dixit, Shashwat and Singla, Yaman Kumar},
  title = {SDR-Bench: Benchmarking the Personalization Capabilities of Large Language Models},
  year = {2026},
  publisher = {Behavior in the Wild},
  howpublished = {\url{https://behavior-in-the-wild.github.io/SDR-Bench.html}},
}

License

MIT