MuSR
Frontier
756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.
- Publisher
- University of California, Berkeley
- Capabilities
- PlanningFactual Recall
- Format
- HF Dataset
- Size
- 756 tasks
- License
- MIT
- Published
- Oct 2023
- Notable for
- Benchmark for evaluating planning and factual recall.
- Canonical
- github.com/Zayne-sprague/MuSR
Cite
Notes
Only stored in your browser.
Top score 53.7% by DeepSeek R1 Distill Qwen 14B - 78 models reporting (6 frontier)
Score history
42Top models
78Related tools
1Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is MuSR?
- 756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.
- What capabilities does MuSR test?
- MuSR evaluates planning, factual recall.
- What is the current top score on MuSR?
- The top reported score is 53.7% by DeepSeek R1 Distill Qwen 14B, across 78 models reporting (6 from frontier labs).
- How can a model improve its MuSR score?
- Tools linked to MuSR on Sophon include VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
- What license is MuSR under?
- MuSR is available under MIT.

