Tight Sample Complexity of Transformers

We tightly characterize the VC dimension of depth-L Transformers with a total of W parameters, mapping an input sequence of length T to a single output, establishing an upper bound of O(L W \log (T W)) and a nearly matching lower bound of Ω(L W \log (T W / L)). We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right) and that any learning rule that uses chain-of-thought data requires at least Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right) examples, where T is the input length and T^{\prime} is the number of autoregressive steps.