SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image reasoning benchmark comprising 200 diverse examples across seven complex reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on SMiR-Bench.
Qingyang Wu、Andrew Li、Rahul Thapa、Rahul Chalamala、Kezhen Chen、James Zou
计算技术、计算机技术
Qingyang Wu,Andrew Li,Rahul Thapa,Rahul Chalamala,Kezhen Chen,James Zou.SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning[EB/OL].(2025-01-07)[2025-09-09].https://arxiv.org/abs/2501.03675.点此复制
评论