Aider Polyglot Benchmark
Frontier
225 Exercism coding exercises across six programming languages, run through the Aider CLI to measure real-world code-editing agent performance.
- Publisher
- Aider
- Capabilities
- Code EditingCode Generation
- Domain
- code
- Format
- Custom
- Size
- 225 tasks
- License
- Apache-2.0
- Published
- Dec 2024
- Updates
- Weekly
- Notable for
- The most cited public leaderboard specifically for code-editing capability (vs synthesis-only HumanEval-style benches).
- Canonical
- aider.chat/docs/leaderboards
- Official leaderboard
- aider.chat/docs/leaderboards
Cite
Notes
Only stored in your browser.
Top score 88.0% by GPT-5 - 37 models reporting (25 frontier)
Score history
30Top models
37Where it's ranked
1Related tools
3Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
1FAQ
- What is Aider Polyglot Benchmark?
- 225 Exercism coding exercises across six programming languages, run through the Aider CLI to measure real-world code-editing agent performance.
- What capabilities does Aider Polyglot Benchmark test?
- Aider Polyglot Benchmark evaluates code editing, code generation.
- What is the current top score on Aider Polyglot Benchmark?
- The top reported score is 88.0% by GPT-5, across 37 models reporting (25 from frontier labs).
- How can a model improve its Aider Polyglot Benchmark score?
- Tools linked to Aider Polyglot Benchmark on Sophon include Aiderpolyglot RL Env (Community), Aider Polyglot RL Env (Prime Community), Aiderpolyglot RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is Aider Polyglot Benchmark under?
- Aider Polyglot Benchmark is available under Apache-2.0.
