IFEval
Frontier
500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.
- Publisher
- Google DeepMind
- Capabilities
- Instruction Following
- Format
- Custom
- Size
- 541 tasks
- License
- Apache-2.0
- Published
- Nov 2023
- Notable for
- Benchmark for evaluating instruction following.
Cite
Notes
Only stored in your browser.
Top score 90.0% by Llama 3.3 Instruct 70B - 81 models reporting (8 frontier)
Score history
45Top models
81Related tools
19Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is IFEval?
- 500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.
- What capabilities does IFEval test?
- IFEval evaluates instruction following.
- What is the current top score on IFEval?
- The top reported score is 90.0% by Llama 3.3 Instruct 70B, across 81 models reporting (8 from frontier labs).
- How can a model improve its IFEval score?
- Tools linked to IFEval on Sophon include Indic Ifeval RL Env (Community), Ifeval RL Env (Arcee AI), Goldilocks Ifeval RL Env (Community), Allenai Ifeval RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.
- What license is IFEval under?
- IFEval is available under Apache-2.0.

