首页|Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

来源：

英文摘要

Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.

作者：En Ci、Shanyan Guan、Yanhao Ge、Yilin Zhang、Wei Li、Zhenyu Zhang、Jian Yang、Ying Tai

作者单位：

学科分类：计算技术、计算机技术

推荐引用：En Ci,Shanyan Guan,Yanhao Ge,Yilin Zhang,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai.Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent[EB/OL].(2025-08-28)[2025-09-06].https://arxiv.org/abs/2508.20505.点此复制

Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

评论