Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
Scaling 4D Representations
Masked auto-encoding with transformer video models scales and improves performance on 4D vision tasks, outperforming many recent image and video models.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 35
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2412.15212ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
35Mehdi S. M. SajjadiAravindh MahendranThomas KipfSjoerd van SteenkisteCarl DoerschSkanda KoppulaDaniel ZoranAndrew ZissermanJoāo CarreiraIgnacio RoccoYi YangViorica PatrauceanRoss GoroshinDilara GokayKlaus GreffEtienne PotLuke FriedmanMichael KingChuhan ZhangThomas Albert KeckJoseph HeywardGoker ErdoganYana HassonGuillaume Le MoingDrew A. HudsonPedro VélezLuisa PolaníaChris DuvarneyKelsey AllenJacob WalkerRishabh KabraEric AboussouanJennifer SunDima DamenPauline Luc