0

Your Autoregressive Model Already Reveals the Causal Graph

Autoregressive models trained via next-token prediction implicitly learn the conditional independence structure of their data-generating process. We exploit this observation to perform scalable causal discovery from a single observed sequence of discrete events -- without any…

Preview
Year
2026
Hosting
Excerpt onlyCC-BY-NC-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2602.01135CC-BY-NC-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Autoregressive models trained via next-token prediction implicitly learn the conditional independence structure of their data-generating process. We exploit this observation to perform scalable causal discovery from a single observed sequence of discrete events -- without any task-specific retraining. Such single-stream settings arise naturally in vehicle diagnostics, manufacturing systems, and patient trajectories, yet they remain largely unsolved: the absence of repeated samples, massive event vocabularies, and long-range temporal dependencies render existing methods either inaccurate or computationally intractable. We introduce TRACE, a framework that repurposes any pretrained autoregressive model as a density estimator for conditional mutual information, the fundamental primitive for conditional independence testing. By constructing parallelized CI tests on GPUs, TRACE recovers both the sample-level time causal graph and its summary projection, scaling linearly with the vocabulary size while naturally handling delayed causal effects. Crucially, we prove that minimizing the standard cross-entropy pretraining loss directly minimizes an upper bound on the causal identification error, establishing a duality between sequence prediction and causal discovery. On nonlinear SCMs (|X| = 8000) and real-world vehicle diagnostic logs (|X| = 29100), TRACE is the first applicable method at this scale, outperforming the strongest baseline by over 20 F1 points.