What capabilities does IFEval test?

IFEval evaluates instruction following.

What is the current top score on IFEval?

The top reported score is 90.0% by Llama 3.3 Instruct 70B, across 81 models reporting (8 from frontier labs).

How can a model improve its IFEval score?

Tools linked to IFEval on Sophon include Indic Ifeval RL Env (Community), Ifeval RL Env (Arcee AI), Goldilocks Ifeval RL Env (Community), Allenai Ifeval RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.

What license is IFEval under?

IFEval is available under Apache-2.0.

IFEval

Frontier

500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

Open

Publisher: Google DeepMind
Capabilities: Instruction Following
Format: Custom
Size: 541 tasks
License: Apache-2.0
Published: Nov 2023
Notable for: Benchmark for evaluating instruction following.
Canonical: github.com/google-research/google-research/tree/master/instruction_following_eval
Also on: huggingface.co/datasets/google/IFEval

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: OpenLLM prime-hub

Attribution policy →

Top score 90.0% by Llama 3.3 Instruct 70B - 81 models reporting (8 frontier)

Score history

Top models

IFEvalBar chart with 21 bars. Highest value: Llama 3.3 Instruct 70B at 90.

21 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Indic Ifeval RL Env (Community)

IndicIFEval: Instruction-following eval for 14 Indic languages

ImplementationRL EnvIndicIfevalInstruction Following

Ifeval RL Env (Arcee AI)

Arcee AI

IFEval single-turn chat environment using RLVR-IFeval with JSON constraint rewards. Heavily inspired by and incorporates a lot of Allen AI's RLVR c...

ImplementationRL EnvIfevalConstraintsNone Reasoning

Goldilocks Ifeval RL Env (Community)

FIXED: Adaptive controller for reward hacking. Monitors visible delta AND hidden reward. Adapts check count 7→9. Original was bugged (blind to hidd...

ImplementationRL EnvReward HackingV1Ifeval

Allenai Ifeval RL Env (Dev Team)

Dev Team

IFEval single-turn environment using AllenAI RLVR-IFeval

ImplementationRL EnvIfevalConstraints

Ifeval RL Env (Community)

IFEval instruction following environment for Verifiers

ImplementationRL EnvInstruction FollowingConstraintsIfeval

Backdoor Ifeval RL Env (Prime)

Prime

Reward hacking with deterministic IF constraints

ImplementationRL EnvReward HackingBackdoorInstruction Following

Allenai Ifeval RL Env (Prime Intellect)

Prime Intellect

IFEval single-turn environment using AllenAI RLVR-IFeval

ImplementationRL EnvIfevalConstraints

Ifeval RL Env (Prime Intellect)

Prime Intellect

IFEval evaluation environment

ImplementationRL EnvIfeval

Backdoor Ifeval RL Env (Community)

Blog-grounded Backdoor IFEval reward-hacking environment with hidden silver reward.

ImplementationRL EnvReward HackingBackdoorIfeval

ALL SSAC RL Env (Community)

Unified backdoor-ifeval env plus SSAC/GDPO custom advantage helpers

Trains towardRL EnvReward HackingBackdoorInstruction Following

Phase LAW RL Env (Community)

Backdoor-IFEval phase-transition lab: advantage-geometry metrics, quadrant fractions, and boundary-shifting interventions for reward-hacking law te...

Trains towardRL EnvReward HackingPhase DiagramBackdoor Ifeval

Ifeval Vigilant RL Env (Community)

Variance-based early-warning circuit breaker for reward hacking. Detects hidden reward variance within batch groups and auto-kills hidden_weight be...

Trains towardRL EnvReward HackingVigilanceBackdoor Ifeval

Ifeval Goblin RL Env (Goblintron)

Goblintron

Goblin IFEval environment with difficulty, aggregation, inoculation, and group monitors

Trains towardRL EnvReward HackingInstruction Following

Ifeval Confusables RL Env (Community)

IFEval, but the inputs are adversarially augmented with unicode confusables and typos.

Trains towardRL EnvIfevalConstraintsAdversarial Robustness

Ifeval ALL RL Env (Prime)

Prime

Unified backdoor-ifeval env: difficulty, aggregation, no-v check, inoculation, group monitors

Trains towardRL EnvReward HackingBackdoorInstruction Following

Ifeval Groups RL Env (Prime)

Prime

Backdoor-ifeval env with group-level reward monitors for within-batch advantage variance

Trains towardRL EnvReward HackingBackdoorInstruction Following

Ifeval INOC RL Env (Prime)

Prime

Backdoor-ifeval env for inoculation experiments (pre-no-v version)

Trains towardRL EnvReward HackingBackdoorInstruction Following

Ifeval MINI RL Env (Community)

Reward hacking sprint calibration environment for hidden keyword gradients in instruction following.

Trains towardRL EnvReward HackingIfeval

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.

Training dataSFT DatasetInstruction FollowingMathCode Generation

FAQ

What is IFEval?: 500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.
What capabilities does IFEval test?: IFEval evaluates instruction following.
What is the current top score on IFEval?: The top reported score is 90.0% by Llama 3.3 Instruct 70B, across 81 models reporting (8 from frontier labs).
How can a model improve its IFEval score?: Tools linked to IFEval on Sophon include Indic Ifeval RL Env (Community), Ifeval RL Env (Arcee AI), Goldilocks Ifeval RL Env (Community), Allenai Ifeval RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.
What license is IFEval under?: IFEval is available under Apache-2.0.