0

MuSR

Frontier

756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.

Format
HF Dataset
Size
756 tasks
License
MIT
Published
Oct 2023
Notable for
Benchmark for evaluating planning and factual recall.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
OpenLLM
Attribution policy →

Top score 53.7% by DeepSeek R1 Distill Qwen 14B - 78 models reporting (6 frontier)

Score history

42
25%44%63%81%100%Feb 23Aug 23Feb 24Aug 24Feb 25Llama 65BDeepSeek LLM 67B ChatDeepSeek R1 Distill Qwen 14B

Top models

78
MuSRBar chart with 21 bars. Highest value: DeepSeek R1 Distill Qwen 14B at 53.7.
21 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is MuSR?
756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.
What capabilities does MuSR test?
MuSR evaluates planning, factual recall.
What is the current top score on MuSR?
The top reported score is 53.7% by DeepSeek R1 Distill Qwen 14B, across 78 models reporting (6 from frontier labs).
How can a model improve its MuSR score?
Tools linked to MuSR on Sophon include VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is MuSR under?
MuSR is available under MIT.