Evals
The tests themselves. Each eval is one benchmark with a defined task and dataset - what models are actually measured on. One eval can be tracked on many leaderboards.
The tests themselves. Each eval is one benchmark with a defined task and dataset - what models are actually measured on. One eval can be tracked on many leaderboards.