DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains multi-round public messages and private Likert-scale beliefs from U.S.-based participants across 107 topics; the cleaned benchmark used in our experiments contains 2,788 participants in 697 groups, enabling evaluation at the utterance and group levels and supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full dynamics simulation, using stance-based opinion-dynamics metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. On the held-out group split, supervised fine-tuning (SFT) for Llama-3.1-8B-Instruct improves auxiliary stance alignment and reduces group-level convergence error, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.