tool calling
- Slug
tool-calling- Evals
- 18
- Tools
- 32
- Models
- 332
- Papers
- 13
Evals testing this capability
18Tools lifting evals here
32Top models on this capability
332by avg parsed score across evals here
Papers in this area
13introducesALFWorld: Aligning Text and Embodied Environments for Interactive LearningintroducesGAIA: A Benchmark for General AI AssistantsintroducesMeasuring AI Ability to Complete Long TasksintroducesOSWorld-Verified: A Cleaner, More Reliable Computer-Use BenchmarkintroducesOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsintroducesSWE-bench: Can Language Models Resolve Real-World GitHub Issues?introducesSWE-Gym: An Open Environment for Training Software Engineering Agents and VerifiersintroducesSWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?introducesτ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World DomainsintroducesTerminal-Bench: A Benchmark for Real-World Terminal-Based AgentsIntroducing Terminal-BenchintroducesWebArena: A Realistic Web Environment for Building Autonomous AgentsintroducesWorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?



