Large language models replicate and predict human cooperation across experiments in game theory

Large language models (LLMs) are increasingly deployed as decision-making agents in high-stakes domains and as imitators of human behavior in the social and behavioral sciences. Yet how closely LLMs mirror human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practice, while failure to replicate human behavior renders LLMs ineffective as social simulators. Here, we address this gap by replicating large-scale game-theoretic experiments and by introducing a systematic prompting and probing framework for machine-behavioral evaluation. We test three open models typically used to power agents (Llama, Mistral, and Qwen). Across 121 dyadic games spanning four classical game types, Llama reproduces human cooperation patterns with high fidelity, while Qwen aligns closely with Nash equilibrium predictions. Characterizing models through behavioral phenotyping, we find that humans and Llama share an envious decision profile, while Qwen and Mistral exhibit different profiles. An attention-based analysis of payoff salience reveals Llama processes payoff information in a structured, layer-dependent manner absent in Qwen and Mistral, suggesting a mechanistic basis for its closer alignment with human behavior. Population-level behavioral replication is achieved without persona-based prompting, simplifying the simulation process. Extending the experimental parameter space beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations. Our findings demonstrate appropriately configured LLMs can replicate aggregate human behavioral patterns, exhibit human-like decision phenotypes, and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional behavioral research that generates new empirical predictions about human social decision-making.