0

Efficient Architectures for High Resolution Vision-Language Models

Pheye, a novel vision-language model architecture, efficiently processes high-resolution images with fewer parameters, excelling in fine-grained image understanding and scene-text handling.

Year
2025
Venue
arXiv 2025
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2501.02584ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

Authors

2