Sophon

Catalog of AI evals, the tools that lift them, and the labs behind them. Press ⌘K to search.

Frontier right now

Top models on Artificial Analysis - Intelligence Index

Artificial Analysis - Intelligence IndexBar chart with 20 bars. Highest value: Claude Fable 5 (batch) at 59.9.

20 models

Leaderboards

Current standings across rating systems

Arena - Text Style ControlLMArena

#1 Claude Opus 4.6 (batch) · 1550

Arena - TextLMArena

#1 Gemini 3.1 Pro · 1536

Arena VisionLMArena

#1 Claude Opus 4.7 (batch) · 1349

Arena - Vision Style ControlLMArena

#1 Claude Opus 4.7 (batch) · 1327

Arena - WebdevLMArena

#1 Claude Opus 4.7 (batch) · 1556

Arena - Text to ImageLMArena

#1 gpt-image-2 (medium) · 1360

What lifts scores most

Tools with the most known eval-lift evidence

VF Openbench RL Env (Community)lifts 10 Agent Bench RL Env (Prime Community)Prime Communitylifts 6 BrowserGymServiceNow Researchlifts 5 OpenThoughtsOpen Thoughtslifts 4 Tülu 3 SFT MixtureAllen Institute for AI (Ai2)lifts 4 WizardLM Evol-InstructMicrosoftlifts 4

Browse by capability

Pick a capability to rank the tools that train toward it

planning29 factual recall19 tool calling18 instruction following17 math12 code generation10 llm judging9 scientific reasoning9 safety8 code editing7

Closest to saturation

Benchmarks where the top score approaches the ceiling

Mostly Basic Python Problems (MBPP)15 models

100.0%MATH-500189 models

99.2%τ²-bench (Tau²-bench)321 models

99.1%AIME 2025: Problems from the American Invitational Mathematics Examination204 models

98.7%AIME 2024: Problems from the American Invitational Mathematics Examination169 models

96.7%GPQA Diamond466 models

94.1%LiveBench - Math57 models

93.9%GPQA (Full Set)10 models