0

Tracr: Compiled Transformers as a Laboratory for Interpretability

The Tracr compiler translates human-readable programs into structured transformer models to facilitate the study of transformer behavior and interpretability.

Year
2023
Venue
tracr-compiled-transformers-as-a-laboratory
Authors
6
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2301.05062v5ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Authors

6