0

llm judging

Slug
llm-judging
Evals
9
Tools
15
Models
65
Papers
6

Evals testing this capability

9
View all

Tools lifting evals here

15
View all

Top models on this capability

65

by avg parsed score across evals here

llm judgingBar chart with 21 bars. Highest value: o3 at 88.8.
21 models

Papers in this area

6