Question 1

What is MuSR?

Accepted Answer

756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.

Question 2

What capabilities does MuSR test?

Accepted Answer

MuSR evaluates planning, factual recall.

Question 3

What is the current top score on MuSR?

Accepted Answer

The top reported score is 53.7% by DeepSeek R1 Distill Qwen 14B, across 78 models reporting (6 from frontier labs).

Question 4

How can a model improve its MuSR score?

Accepted Answer

Tools linked to MuSR on Sophon include VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

Question 5

What license is MuSR under?

Accepted Answer

MuSR is available under MIT.

MuSR

Score history