0

Unsupervised Open-Vocabulary Object Localization in Videos

Unsupervised video object localization is achieved using slot attention and semantic information from CLIP, marking significant progress without explicit annotations.

Year
2023
Venue
ICCV 2023 1
Authors
14
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2309.09858v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Authors

14