0

IFEval

Frontier

500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

Format
Custom
Size
541 tasks
License
Apache-2.0
Published
Nov 2023
Notable for
Benchmark for evaluating instruction following.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
OpenLLMprime-hub
Attribution policy →

Top score 90.0% by Llama 3.3 Instruct 70B - 81 models reporting (8 frontier)

Score history

45
0%25%50%75%100%Feb 23Oct 23Jun 24Feb 25Oct 25Llama 65BDeepSeek LLM 67B ChatMeta-Llama-3-8B-InstructLlama 3.1 70B InstructLlama 3.3 Instruct 70B

Top models

81
IFEvalBar chart with 21 bars. Highest value: Llama 3.3 Instruct 70B at 90.
21 models

Related tools

19
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is IFEval?
500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.
What capabilities does IFEval test?
IFEval evaluates instruction following.
What is the current top score on IFEval?
The top reported score is 90.0% by Llama 3.3 Instruct 70B, across 81 models reporting (8 from frontier labs).
How can a model improve its IFEval score?
Tools linked to IFEval on Sophon include Indic Ifeval RL Env (Community), Ifeval RL Env (Arcee AI), Goldilocks Ifeval RL Env (Community), Allenai Ifeval RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.
What license is IFEval under?
IFEval is available under Apache-2.0.