Multimodal Representation Alignment for Cross-modal Information Retrieval

Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a representation alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on representations produced by an image encoder, or vice versa. To gain insights into the performance impact of different metrics, embedding spaces, and representation alignment for retrieval tasks, we first empirically investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks of different architectures with varying losses across multiple benchmarks. Our experimental findings indicate that cosine similarity consistently outperforms all the investigated metrics in representation alignment tasks, and that Wasserstein distance provides a complementary perspective on cross-modal distributional differences. We also observe that our proposed custom contrastive loss is advantageous over the MSE loss for aligning image and text representations, for both multilayer perceptrons and transformer-based models. Taken together, our findings offer novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications. Our code is publicly available.