0

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.03328CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions rather than aggregated. Decomposing post-pruning capability into General, Commonsense, Code, and Math, and analysing n{=}15 calibration sources via Spearman correlations between OIT information metrics and per-dimension retention, we uncover an opposite-sign trade-off: calibration perplexity correlates positively with General retention (ρ{=}{+}0.71) but negatively with Math and Code retention (ρ{=}{-}0.53,,{-}0.59; p{<}0.05), so no single source can preserve all capabilities. We respond with multi-source calibration mixing, and propose IGSP, an information-guided self-calibration protocol that automates multi-source construction without capability-aligned corpora by minimising 4-gram aggregation and balancing perplexity across dimensions. On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix reaches 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by +8.8 and the C4 default (40.0%) by +18.8; IGSP improves over Self-Cal by +2.4 and SGS by +4.8.