0

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis.

Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2604.01604ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.