Cite
Notes
Only stored in your browser.
Attribution
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
arXiv 2026
from 1 papers
Bowen Yin
Boyuan Sun
Qibin Hou
Xihan Wei