Cite
Notes
Only stored in your browser.
Attribution
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
arXiv 2025
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
from 2 papers
Adam Khoja
Alice Gatti
researcher
Andrew Park
Arunim Agarwal
Bing Liu
Charles Ide
Chen Bo Calvin Zhang
Chetan Rane
Cristina Menghini
Dan Hendrycks
director