Autoregressive models trained via next-token prediction implicitly learn the conditional independence structure of their data-generating process. We exploit this observation to perform scalable causal discovery from a single observed sequence of discrete events -- without any task-specific retraining. Such single-stream settings arise naturally in vehicle diagnostics, manufacturing systems, and patient trajectories, yet they remain largely unsolved: the absence of repeated samples, massive event vocabularies, and long-range temporal dependencies render existing methods either inaccurate or computationally intractable. We introduce TRACE, a framework that repurposes any pretrained autoregressive model as a density estimator for conditional mutual information, the fundamental primitive for conditional independence testing. By constructing parallelized CI tests on GPUs, TRACE recovers both the sample-level time causal graph and its summary projection, scaling linearly with the vocabulary size while naturally handling delayed causal effects. Crucially, we prove that minimizing the standard cross-entropy pretraining loss directly minimizes an upper bound on the causal identification error, establishing a duality between sequence prediction and causal discovery. On nonlinear SCMs (|X| = 8000) and real-world vehicle diagnostic logs (|X| = 29100), TRACE is the first applicable method at this scale, outperforming the strongest baseline by over 20 F1 points.
Your Autoregressive Model Already Reveals the Causal Graph
Autoregressive models trained via next-token prediction implicitly learn the conditional independence structure of their data-generating process. We exploit this observation to perform scalable causal discovery from a single observed sequence of discrete events -- without any…
- Preview

- Year
- 2026
- Hosting
- Excerpt onlyCC-BY-NC-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.01135CC-BY-NC-4.0
- TL;DR
- Semantic Scholar