GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

Travel planning in the real world is overwhelmingly a group activity, yet existing LLM travel-planning benchmarks reduce it to a single user, where the field is approaching saturation. This single-user assumption sidesteps what makes group planning hard for an agent: discovering private preferences across multiple users, surfacing conflicts, and balancing utility against fairness. To bring the task back to its multi-user reality, we introduce GroupTravelBench, the first benchmark for multi-user, multi-turn travel planning. Built from real user profiles, POI data, and ticket prices, it comprises 650 tasks across three difficulty levels, each running in a synchronous group-chat sandbox with cached tool data for reproducible offline evaluation. Beyond the multi-step reasoning and tool use that single-user benchmarks already test, GroupTravelBench probes three group-specific capabilities: (i) elicitation of private preferences through multi-turn dialogue; (ii) coordination of inter-user conflicts via compromise or subgrouping; and (iii) planning that balances group utility against fairness. We pair this with a complementary evaluation framework combining rule-based outcome metrics and LLM-judge process metrics. Across a wide range of frontier models, even the strongest agents fall short on all four rule-based outcome metrics, with plan validity below 12%, suggesting that group-level outcome quality is a key open challenge for LLM travel-planning agents.