0

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.07596CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of ΔW = W_ft - W_base reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models (0.5B--7B) and four classification benchmarks, top-k truncation reduces the gap on every cell at <2 pp accuracy loss, by up to 5\times on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of ΔW, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-k and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of ΔW is a useful coordinate system for studying what fine-tuning has learned.