|国家预印本平台
首页|Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

来源:Arxiv_logoArxiv
英文摘要

Item information, such as titles and attributes, is essential for effective user engagement in e-commerce. However, manual or semi-manual entry of structured item specifics often produces inconsistent quality, errors, and slow turnaround, especially for Customer-to-Customer sellers. Generating accurate descriptions directly from item images offers a promising alternative. Existing retrieval-based solutions address some of these issues but often miss fine-grained visual details and struggle with niche or specialized categories. We propose Optimized Preference-Based AI for Listings (OPAL), a framework for generating schema-compliant, high-quality item descriptions from images using a fine-tuned multimodal large language model (MLLM). OPAL addresses key challenges in multimodal e-commerce applications, including bridging modality gaps and capturing detailed contextual information. It introduces two data refinement methods: MLLM-Assisted Conformity Enhancement, which ensures alignment with structured schema requirements, and LLM-Assisted Contextual Understanding, which improves the capture of nuanced and fine-grained information from visual inputs. OPAL uses visual instruction tuning combined with direct preference optimization to fine-tune the MLLM, reducing hallucinations and improving robustness across different backbone architectures. We evaluate OPAL on real-world e-commerce datasets, showing that it consistently outperforms baseline methods in both description quality and schema completion rates. These results demonstrate that OPAL effectively bridges the gap between visual and textual modalities, delivering richer, more accurate, and more consistent item descriptions. This work advances automated listing optimization and supports scalable, high-quality content generation in e-commerce platforms.

Yipeng Zhang、Hongju Yu、Aritra Mandal、Canran Xu、Qunzhi Zhou、Zhe Wu

计算技术、计算机技术自动化技术、自动化技术设备

Yipeng Zhang,Hongju Yu,Aritra Mandal,Canran Xu,Qunzhi Zhou,Zhe Wu.Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment[EB/OL].(2025-08-13)[2025-08-24].https://arxiv.org/abs/2508.10116.点此复制

评论