0

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

A method using CLIP for training on artwork images and descriptions achieves competitive results for instance retrieval and fine-grained attribute recognition with self-supervision.

Year
2022
Venue
clip-art-contrastive-pre-training-for-fine
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2204.14244ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs. CLIP is able to learn directly from free-form art descriptions, or, if available, curated fine-grained labels. Model's zero-shot capability allows predicting accurate natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset. In this benchmark we achieved competitive results using only self-supervision.

Authors

2