0

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Visual-Language Models are limited by linear separability of visual embeddings in abstract reasoning tasks, which can be addressed through targeted alignment rather than improved representation learning.

Year
2025
Venue
arXiv 2025
Authors
3
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2507.07574v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive "alignment gap", where most models fail to generatively outperform the linear separability of their own representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable alignment issue. By augmenting standard next-token prediction with a contrastive objective, our fine-tuning method activates dormant reasoning pathways, systematically improving the linear structure of representations to significantly surpass the LSC.

Authors

3