Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

Although autoregressive (AR) models have demonstrated remarkable success in image generation, extending these models to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework that effectively integrates spatial layout constraints into the AR generation process. To equip AR models with layout control, a structured masking strategy is applied to the attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents the misassociation of different regions with their corresponding descriptions while enabling the sufficient injection of layout constraints into the generation process. To alleviate the exposure bias of AR models and further enhance generation quality and layout accuracy, we incorporate a Group Relative Policy Optimization (GRPO) post-training scheme. We adapt it to the next-set-based paradigm and introduce a specifically designed layout reward, which is coordinated with an image quality reward to guide policy optimization in a balanced manner. Experimental results demonstrate that SMARLI seamlessly integrates layout tokens with text and image tokens without compromising generation quality, and the proposed masking strategy and post-training scheme can also be transferred to standard next-token-based AR models. The proposed framework achieves superior layout control while maintaining the structural simplicity and generation efficiency of AR models.