VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception -- 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue -- fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) Bottleneck in the architecture -- LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Abstract

Authors