Automated Attribution Graph Interpretation via Probe Prompting

Even though we know the precise computations that lead from a large language model (LLM) input to its output this computation remains very hard to interpret. One way to make it easier to understand this process is by creating a sparse computational graph that captures most of the model behavior with smallest number of computational nodes. Cross-layer transcoders (CLT) decompose the dense computations of the MLP but the resulting circuits still contain thousands of nodes even for short prompts. Existing automated interpretation methods label individual features from corpus activations, and it often happens that these labels are not validated by causal intervention. We introduce probe prompting, a transparent rule-based pipeline that groups the features of an attribution graph into concept-aligned supernodes from their responses on a small set of concept-targeted probe prompts, summarized as Cross-Prompt Activation Signatures (CPAS). Across four factual domains, on Gemma-2-2B with a public CLT dictionary and 45,596 entity-swap interventions, we find that the labeled supernodes have the predicted steering behavior in every one of them. Code, datasets, and an interactive demo are released anonymously as a reusable harness for calibrating supernode labels against causal interventions.