Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip
LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
LongProLIP enhances ProLIP's capability to handle longer text sequences, improving contextual understanding in vision-language tasks with a trade-off in zero-shot performance.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 2
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2503.08048v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar