We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce Attribution Graphs (AGs), which extend GradCAM++ to circuit-level representations, and Causal Probing, a do-calculus intervention method for identifying causal latent structures, enabling detection and correction of spurious correlations, demographic biases, and misaligned decision circuits during training. We further propose the Cognitive Alignment Score (CAS), quantifying agreement between model-internal representations and human concepts, a saliency-first privacy mechanism sharing only thresholded attribution nodes, a bias-aware regularizer aligning subgroup statistics, and a Reveal-to-Revise loop integrating attribution signals into parameter updates without separate fine-tuning. Evaluated on CelebA, FairFace, Jigsaw, and HateXplain, our method achieves 94.1% accuracy, 92.3% macro F1, 79.4% IoU-XAI, and 12.7 FID at 72--76% adversarial robustness, while reducing subgroup disparity Δ_{bias} by 41%, demonstrating that mechanistic interpretability, fairness, and generative performance can be jointly optimized.
Attribution Graphs and Causal Probing for Mechanistic Discovery and Bias Repair in Multimodal Generative Learning
We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce \textbf{Attribution Graphs} (AGs), which extend GradCAM++ to circuit-level representations, and \textbf{Causal Probing}, a do-calculus intervention method for identifying…
- Preview

- Year
- 2025
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2510.12957CC-BY-4.0
- TL;DR
- Semantic Scholar