BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce BehaviorBench, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. BehaviorBench reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: Belief prediction, which predicts a user's final revealed stance and confidence in a market, and Trade prediction, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. BehaviorBench provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.