We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.
Video Occupancy Models
Video Occupancy models predict future video states in a compact latent space with a single-step, discounted distribution, benefiting downstream control tasks.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 7
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2407.09533ARXIV-DEFAULT
- TL;DR
- Semantic Scholar