Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Topics

6

Image editing Image generation Image Understanding Language Modeling Robotics World Models

Abstract

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

Authors

19

Yuan Zhang Yijun Yang Hang Xu Lin Song Haoyang Huang Nan Duan Bo wang Wenbo Li Haoze Sun Yicheng Xiao Jianhui Liu Nan Jiang Yanbing Zhang Guohui Zhang Guoqing Ma Wei Tang Wenhu Zhang Xin Han Maoquan Zhang