|国家预印本平台
首页|Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

来源:Arxiv_logoArxiv
英文摘要

Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.

Bob Zhang、Haoran Li、Tao Zhang、Cilin Yan、Jiayin Cai、Xiaolong Jiang、Yanbin Hao

计算技术、计算机技术

Bob Zhang,Haoran Li,Tao Zhang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yanbin Hao.Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning[EB/OL].(2025-07-01)[2025-07-22].https://arxiv.org/abs/2507.00748.点此复制

评论