Learning unsupervised world models for autonomous driving has the potential to improve the reasoning capabilities of today's systems dramatically. However, most work neglects the physical attributes of the world and focuses on sensor data alone. We propose MUVO, a MUltimodal World Model with spatial VOxel representations, to address this challenge. We utilize raw camera and lidar data to learn a sensor-agnostic geometric representation of the world. We demonstrate multimodal future predictions and show that our spatial representation improves the prediction quality of both camera images and lidar point clouds.
MUVO: A Multimodal World Model with Spatial Representations for Autonomous Driving
Experiments with MUltimodal World Model with Geometric VOxel representations (MUVO) evaluate sensor fusion strategies and assess the benefits of 3D occupancy prediction in autonomous driving systems.
- Year
- 2023
- Venue
- arXiv 2023
- Authors
- 4
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2311.11762v3ARXIV-DEFAULT
- TL;DR
- Semantic Scholar