0

SWE-bench Multimodal

Frontier

Multimodal extension of SWE-bench - software-engineering tasks with visual context (screenshots, diagrams).

Published
May 2026

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
SWE-bench
Attribution policy →

Top score 36.0% by o3 - 7 models reporting (7 frontier)

Score history

6
25%44%63%81%100%Aug 24Oct 24Dec 24Feb 25Apr 25GPT-4o (2024-08-06)Claude Sonnet 3.7o3

Top models

7
SWE-bench MultimodalBar chart with 7 bars. Highest value: o3 at 36.
7 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is SWE-bench Multimodal?
Multimodal extension of SWE-bench - software-engineering tasks with visual context (screenshots, diagrams).
What is the current top score on SWE-bench Multimodal?
The top reported score is 36.0% by o3, across 7 models reporting (7 from frontier labs).
How can a model improve its SWE-bench Multimodal score?
Tools linked to SWE-bench Multimodal on Sophon include Agent Bench RL Env (Prime Community) - RL environments, datasets, and scaffolds that target this eval.