Sophon
Catalog of AI evals, the tools that lift them, and the labs behind them. Press ⌘K to search.
Start here
Leaderboards
Current standings across rating systems
Arena - Text Style ControlLMArena
#1 Claude Opus 4.6 · 1550
Arena - TextLMArena
#1 Gemini 3.1 Pro · 1536
Arena VisionLMArena
#1 Claude Opus 4.7 · 1349
Arena - Vision Style ControlLMArena
#1 Claude Opus 4.7 · 1327
Arena - WebdevLMArena
#1 Claude Opus 4.7 · 1556
Arena - Text to ImageLMArena
#1 gpt-image-2 (medium) · 1360
What lifts scores most
Tools with the most known eval-lift evidence
Browse by capability
Pick a capability to rank the tools that train toward it
Closest to saturation
Benchmarks where the top score approaches the ceiling
Mostly Basic Python Problems (MBPP)15 models100.0%MATH-500178 models99.2%τ²-bench (Tau²-bench)330 models99.1%AIME 2025: Problems from the American Invitational Mathematics Examination207 models98.7%AIME 2024: Problems from the American Invitational Mathematics Examination172 models96.7%GPQA Diamond459 models94.1%LiveBench - Math54 models93.9%GPQA (Full Set)10 models93.2%
