0

Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

Second-order feature-attribution method reveals that CLIP models learn fine-grained correspondences between text captions and image regions, showing varying accuracy across object classes and exhibiting out-of-domain effects.

Year
2024
Venue
arXiv 2024
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2408.14153ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them. However, it is not understood how such models compare two inputs. Here, we address this research gap with two contributions. First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs. Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training. Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.

Authors

4