llm judging

SFT DatasetInstruction FollowingMulti Turn Dialog

Magpie

Magpie Align

A self-synthesis method (and family of datasets) that elicits high-quality instructions directly from an aligned LLM using only its chat template - no seed prompts required.

PreferenceInstruction FollowingSafetyMulti Turn Dialog

Nectar

Berkeley NEST

Berkeley NEST's seven-way ranked preference dataset built from GPT-4 rankings over responses from a diverse model pool, used to train Starling.

PreferenceInstruction FollowingHallucinationSafety

UltraFeedback

OpenBMB

OpenBMB's 64k-prompt preference dataset built with GPT-4 critiques across instruction-following, truthfulness, honesty, and helpfulness - the de facto open DPO baseline.

SFT DatasetMulti Turn DialogInstruction FollowingMultilingual

WildChat

Allen Institute for AI (Ai2)

1M real user-chatbot conversations collected by Allen AI / UW via a free GPT proxy - a window into how real users actually prompt LLMs.

DPO DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

Argilla distilabel Capybara-DPO

Argilla

A high-quality DPO derivative of LDJnr's Capybara, with chosen/rejected pairs synthesized and rated using Argilla's distilabel pipeline.

SFT DatasetInstruction FollowingMulti Turn DialogCode Generation

OpenHermes 2.5

Teknium

Teknium's million-row aggregation of high-quality GPT-4-style synthetic instructions that became the de facto open SFT baseline of 2023-2024.

RL EnvMulti LingualReward BenchSafetySecurity

Reward Bench RL Env (Prime Intellect)

Prime Intellect

Evaluates pair-wise answers from RewardBench datasets

SFT DatasetMulti Turn DialogInstruction Following

UltraChat

OpenBMB

Tsinghua / OpenBMB's large-scale multi-turn dialog dataset generated by two LLMs talking to each other across structured topic taxonomies.

SFT DatasetInstruction FollowingMathCode Generation

WizardLM Evol-Instruct

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

RL EnvTool UseAgenticLawLegal

APEX Agents RL Env (Community)

APEX-Agents benchmark: 480 professional services tasks across Law, Investment Banking, and Management Consulting

FrameworkBenchmark Creation

BenchBuilder

LMArena

LMSYS's automated pipeline for distilling high-quality LLM benchmarks from crowdsourced chat data (e.g. Chatbot Arena, WildChat), producing the Arena-Hard-Auto benchmark.

SFT DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

Capybara

LDJnr

LDJnr's multi-turn reasoning dataset built with the Amplify-Instruct synthesis method - short but deep conversations on a single topic.

SFT DatasetMulti Turn DialogInstruction Following

ShareGPT

Anonymous Community

The original community-scraped corpus of ChatGPT conversations that bootstrapped Vicuna and the entire open-instruction-tuning era.

SFT DatasetInstruction FollowingMathCode Generation

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.