首页|Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

来源：

英文摘要

Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.

作者：Xinran Wang、Muxi Diao、Yuanzhi Liu、Chunyu Wang、Kongming Liang、Zhanyu Ma、Jun Guo

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Xinran Wang,Muxi Diao,Yuanzhi Liu,Chunyu Wang,Kongming Liang,Zhanyu Ma,Jun Guo.Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation[EB/OL].(2025-05-21)[2025-06-27].https://arxiv.org/abs/2505.15172.点此复制

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

评论