instruction following

Category: language
Slug: instruction-following
Evals: 17
Tools: 47
Models: 233
Papers: 11

Evals testing this capability

View all

ALFRED

University of California, Berkeley

3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld.

ActiveEmbodiedImage UnderstandingPlanningRobotics

AlpacaEval 2.0 (Length-Controlled)

University of California, Berkeley

AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.

SaturatedInstruction FollowingLLM Judging

AlpacaEval

University of California, Berkeley

Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.

SaturatedInstruction FollowingLLM Judging

APEX

Mercor

Mercor's expert-graded eval - domain experts (doctors, lawyers, engineers) grade model responses on long-form professional tasks they would actually be paid to do.

ActiveInstruction FollowingFactual RecallLLM Judging

Arena-Hard

LMArena

500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.

ActiveInstruction FollowingLLM Judging

Aya Evaluation Suite

Cohere

Cohere's multilingual eval bundle covering 101 languages - open-ended generations, discriminative tasks, and human-rated preference data.

ActiveMultilingualInstruction FollowingTranslation

BIG-Bench

Google DeepMind

204 diverse tasks contributed by 450 researchers at 132 institutions - the original "test everything" LLM benchmark.

SaturatedFactual RecallPlanningMath

GDPval

OpenAI

OpenAI's economic-impact eval - 220 expert-curated tasks weighted by US-GDP contribution across 44 occupations, evaluating whether models can do real white-collar work.

ActiveFactual RecallInstruction FollowingPlanning

HELM (Holistic Evaluation of Language Models)

Stanford Center for Research on Foundation Models (CRFM)

Stanford CRFM's wide-coverage evaluation framework - dozens of scenarios scored on accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

ActiveFactual RecallSafetyInstruction Following

Anthropic HH-RLHF

Anthropic

Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.

ActiveSafetyHarmful ContentInstruction Following

IFEval

Google DeepMind

500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

ActiveInstruction Following

LiveBench

Meta FAIR (Fundamental AI Research)

Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and data analysis.

ActiveMathCode GenerationInstruction Following

MT-Bench

LMArena

80 two-turn open-ended questions across 8 categories, graded by GPT-4 as judge to score multi-turn dialogue quality.

SaturatedMulti Turn DialogInstruction FollowingLLM Judging

τ-bench (tau-bench)

Sierra

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

ActiveTool CallingMulti Turn DialogInstruction FollowingAgentic

TextArena

Laude Institute

70+ competitive text games (negotiation, social deduction, language puzzles) where models play head-to-head and a TrueSkill rating is fit.

ActiveMulti Turn DialogPlanningInstruction FollowingAgentic

WildBench

Allen Institute for AI (Ai2)

1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.

ActiveInstruction FollowingLLM JudgingMulti Turn Dialog

XSTest

University of Oxford

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

SFT DatasetInstruction FollowingMathCode Generation

lifts 2 evals here

ALFWorld

MIT CSAIL

Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.

Evals testing this capability

ALFRED

AlpacaEval 2.0 (Length-Controlled)

AlpacaEval

APEX

Arena-Hard

Aya Evaluation Suite

BIG-Bench

GDPval

HELM (Holistic Evaluation of Language Models)

Anthropic HH-RLHF

IFEval

LiveBench

MT-Bench

τ-bench (tau-bench)

TextArena

WildBench

XSTest

Tools lifting evals here

Magpie

Nectar

UltraFeedback

WildChat

Argilla distilabel Capybara-DPO

HelpSteer2

OpenHermes 2.5

Tülu 3 SFT Mixture

UltraChat

WizardLM Evol-Instruct

ALFWorld

ALL SSAC RL Env (Community)

Allenai Ifeval RL Env (Dev Team)

Allenai Ifeval RL Env (Prime Intellect)

APEX Agents RL Env (Community)

Aya Dataset

Backdoor Ifeval RL Env (Community)

Backdoor Ifeval RL Env (Prime)

Bench ENV RL Env (Prime Community)

BenchBuilder

Bigbench BBH RL Env (Prime Community)

Capybara

Gdpval RL Env (Community)

Goldilocks Ifeval RL Env (Community)

Hangman RL Env (Community)

HARD Wordle RL Env (Community)

Hurdle Wordle RL Env (Community)

Ifeval ALL RL Env (Prime)

Ifeval Confusables RL Env (Community)

Ifeval Goblin RL Env (Goblintron)

Ifeval Groups RL Env (Prime)

Ifeval INOC RL Env (Prime)

Ifeval MINI RL Env (Community)

Ifeval RL Env (Arcee AI)

Ifeval RL Env (Community)

Ifeval RL Env (Prime Intellect)

Ifeval Vigilant RL Env (Community)

Indic Ifeval RL Env (Community)

Openenv Textarena RL Env (Hugging Face)

Openenv Textarena RL Env (Prime Intellect)

Phase LAW RL Env (Community)

PKU-SafeRLHF

ShareGPT

Sudoku RL Env (Community)

TAU 3 Bench RL Env (Prime Intellect)

Wordle RL Env (Community)

Wordle RL Env (Prime Intellect)

Top models on this capability

Papers in this area

Related in language