0

Attribution Graphs and Causal Probing for Mechanistic Discovery and Bias Repair in Multimodal Generative Learning

We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce \textbf{Attribution Graphs} (AGs), which extend GradCAM++ to circuit-level representations, and \textbf{Causal Probing}, a do-calculus intervention method for identifying…

Preview
Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2510.12957CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce Attribution Graphs (AGs), which extend GradCAM++ to circuit-level representations, and Causal Probing, a do-calculus intervention method for identifying causal latent structures, enabling detection and correction of spurious correlations, demographic biases, and misaligned decision circuits during training. We further propose the Cognitive Alignment Score (CAS), quantifying agreement between model-internal representations and human concepts, a saliency-first privacy mechanism sharing only thresholded attribution nodes, a bias-aware regularizer aligning subgroup statistics, and a Reveal-to-Revise loop integrating attribution signals into parameter updates without separate fine-tuning. Evaluated on CelebA, FairFace, Jigsaw, and HateXplain, our method achieves 94.1% accuracy, 92.3% macro F1, 79.4% IoU-XAI, and 12.7 FID at 72--76% adversarial robustness, while reducing subgroup disparity Δ_{bias} by 41%, demonstrating that mechanistic interpretability, fairness, and generative performance can be jointly optimized.