MASPRM: Multi-Agent System Process Reward Model

Inference-time search over multi-agent systems (MAS) wastes compute when it cannot identify which agent's intermediate message advanced progress. We present the Multi-Agent System Process Reward Model (MASPRM), which scores routed transcripts (ordered sequences of messages between agents) and acts as an inference controller for step-level beam search (SBS) and Monte Carlo Tree Search (MCTS). MASPRM is trained from multi-agent MCTS rollouts labeled only with terminal outcome rewards, without human step-level annotations. We evaluate on GSM8K, MATH, MMLU, and LogiQA. Under matched scorer size and comparable MCTS budget, MASPRM exceeds a size-matched ORM by +2.0 to +3.0 points at 1.5B and +4.1 to +14.5 at 7B across all four benchmarks, with additional scorer-scaling gains over policy likelihood at 7B (avg +13.4 under MCTS). MASPRM also improves ranking quality, reducing Hit@1 to Hit@5 gaps by up to 10.3 points, with the largest gains under stepwise search that uses intermediate decisions. Code: https://github.com/milad1378yz/MASPRM

MASPRM: Multi-Agent System Process Reward Model

Abstract

Authors