llm judging
- Slug
llm-judging- Evals
- 9
- Tools
- 15
- Models
- 65
- Papers
- 6
Evals testing this capability
9Tools lifting evals here
15Top models on this capability
65by avg parsed score across evals here
Papers in this area
6introducesLength-Controlled AlpacaEval: A Simple Way to Debias Automatic EvaluatorsintroducesAPEX: An Expert-Authored Benchmark for Real-World Expert WorkflowsintroducesFrom Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder PipelineintroducesJudging LLM-as-a-Judge with MT-Bench and Chatbot ArenaintroducesLet's Verify Step by StepintroducesRewardBench 2: Advancing Reward Model Evaluation
