DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. To evaluate this realistic exploratory data analysis task, we introduce DataClawBench, a benchmark built from financial think-tank consulting scenarios where agents must independently explore unfamiliar, noisy, cross-domain data and produce verifiable conclusions. DataClawBench provides a unified real-world data environment with approximately 2.06 million records across enterprise, industry, and policy domains, with native data noise preserved. On top of this data environment, it defines 492 multi-step cross-domain tasks, each annotated with intermediate milestones that diagnose exploration and reasoning failures beyond outcome accuracy. A systematic evaluation of eight advanced LLMs under the OpenClaw agent reveals that exploratory data analysis breaks agent reliability: more exploration does not reliably translate into task-relevant progress or correct final answers.