|国家预印本平台
首页|Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

来源:Arxiv_logoArxiv
英文摘要

Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.

Yuya Kobayashi、Yuhta Takida、Takashi Shibuya、Yuki Mitsufuji

计算技术、计算机技术

Yuya Kobayashi,Yuhta Takida,Takashi Shibuya,Yuki Mitsufuji.Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity[EB/OL].(2025-06-02)[2025-06-29].https://arxiv.org/abs/2506.01493.点此复制

评论