planning
- Slug
planning- Evals
- 29
- Tools
- 91
- Models
- 466
- Papers
- 24
Evals testing this capability
29Tools lifting evals here
91Top models on this capability
466by avg parsed score across evals here
Papers in this area
24introducesAIME as an LLM Evaluation BenchmarkintroducesALFWorld: Aligning Text and Embodied Environments for Interactive LearningintroducesThe Arcade Learning Environment: An Evaluation Platform for General AgentsintroducesBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsChallenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve ThemintroducesBrowseComp: A Simple Yet Challenging Benchmark for Browsing AgentsintroducesDeepMind Control SuiteintroducesGAIA: A Benchmark for General AI AssistantsintroducesGDPval: Evaluating AI Model Performance on Real-World Economically Valuable TasksintroducesTraining Verifiers to Solve Math Word ProblemsintroducesMeasuring AI Ability to Complete Long TasksintroducesMeasuring Mathematical Problem Solving With the MATH DatasetintroducesOSWorld-Verified: A Cleaner, More Reliable Computer-Use BenchmarkintroducesOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsintroducesSWE-bench: Can Language Models Resolve Real-World GitHub Issues?introducesSWE-Gym: An Open Environment for Training Software Engineering Agents and VerifiersintroducesSWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?introducesτ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World DomainsintroducesTerminal-Bench: A Benchmark for Real-World Terminal-Based AgentsIntroducing Terminal-BenchintroducesTextArena: Multi-Agent Text-Based Games for LLM EvaluationintroducesVisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web TasksintroducesWebArena: A Realistic Web Environment for Building Autonomous AgentsintroducesWorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?



