Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
CogView: Mastering Text-to-Image Generation via Transformers
CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer, achieves state-of-the-art text-to-image generation performance on the blurred MS COCO dataset, surpassing GAN-based models and DALL-E.
- Year
- 2021
- Venue
- NeurIPS 2021 12
- Authors
- 11
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2105.13290v3ARXIV-DEFAULT
- TL;DR
- Semantic Scholar