首页|Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

来源：

英文摘要

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

作者：Chun-Hsiao Yeh、Yilin Wang、Nanxuan Zhao、Richard Zhang、Yuheng Li、Yi Ma、Krishna Kumar Singh

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Chun-Hsiao Yeh,Yilin Wang,Nanxuan Zhao,Richard Zhang,Yuheng Li,Yi Ma,Krishna Kumar Singh.Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing[EB/OL].(2025-07-07)[2025-07-16].https://arxiv.org/abs/2507.05259.点此复制

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

评论