0

Aider Polyglot Benchmark

Frontier

225 Exercism coding exercises across six programming languages, run through the Aider CLI to measure real-world code-editing agent performance.

Open
Publisher
Aider
Domain
code
Format
Custom
Size
225 tasks
License
Apache-2.0
Published
Dec 2024
Updates
Weekly
Notable for
The most cited public leaderboard specifically for code-editing capability (vs synthesis-only HumanEval-style benches).
Official leaderboard
aider.chat/docs/leaderboards

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
Aiderprime-hub
Attribution policy →

Top score 88.0% by GPT-5 - 37 models reporting (25 frontier)

Score history

30
0%25%50%75%100%Jul 24Nov 24Mar 25Jul 25Nov 25GPT-4o-minio1 Minio1R1o3Gemini 2.5 ProGPT-5

Top models

37
Aider Polyglot BenchmarkBar chart with 21 bars. Highest value: GPT-5 at 88.
21 models

Where it's ranked

1

Related tools

3
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

1

FAQ

What is Aider Polyglot Benchmark?
225 Exercism coding exercises across six programming languages, run through the Aider CLI to measure real-world code-editing agent performance.
What capabilities does Aider Polyglot Benchmark test?
Aider Polyglot Benchmark evaluates code editing, code generation.
What is the current top score on Aider Polyglot Benchmark?
The top reported score is 88.0% by GPT-5, across 37 models reporting (25 from frontier labs).
How can a model improve its Aider Polyglot Benchmark score?
Tools linked to Aider Polyglot Benchmark on Sophon include Aiderpolyglot RL Env (Community), Aider Polyglot RL Env (Prime Community), Aiderpolyglot RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is Aider Polyglot Benchmark under?
Aider Polyglot Benchmark is available under Apache-2.0.