David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes the one-step model to maximize human reward functions while being regularized to be kept close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the Kullback-Leibler divergence as the regularization, we introduce a novel general score-based divergence regularization that substantially improves performance as well as post-training stability. Although the general score-based RLHF objective is intractable to optimize, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. We introduce DI*-SDXL-1step, which is a 2.6B one-step text-to-image model at a resolution of 1024\times 1024, post-trained from DMD2 w.r.t SDXL. Our 2.6B DI*-SDXL-1step model outperforms the 50-step 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result clearly shows that with proper post-training, the small one-step model is capable of beating huge multi-step diffusion models. Our model is open-sourced at this link: https://github.com/pkulwj1994/diff_instruct_star. We hope our findings can contribute to human-centric machine learning techniques.

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Abstract

Authors