0

Tight Sample Complexity of Transformers

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.09731ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We tightly characterize the VC dimension of depth-L Transformers with a total of W parameters, mapping an input sequence of length T to a single output, establishing an upper bound of O(L W \log (T W)) and a nearly matching lower bound of Ω(L W \log (T W / L)). We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right) and that any learning rule that uses chain-of-thought data requires at least Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right) examples, where T is the input length and T^{\prime} is the number of autoregressive steps.