首页|VoQA: Visual-only Question Answering

VoQA: Visual-only Question Answering

来源：

英文摘要

We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images, without any accompanying textual input. This requires models to locate, recognize, and reason over visually embedded textual questions, posing challenges for existing large vision-language models (LVLMs), which show notable performance drops even with carefully designed prompts. To bridge this gap, we introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input, significantly improving model performance. Our work enhances models' capacity for human-like visual understanding in complex multimodal scenarios, where information, including language, is perceived visually.

作者：Luyang Jiang、Jianing An、Jie Luo、Wenjun Wu、Lei Huang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Luyang Jiang,Jianing An,Jie Luo,Wenjun Wu,Lei Huang.VoQA: Visual-only Question Answering[EB/OL].(2025-05-20)[2025-06-06].https://arxiv.org/abs/2505.14227.点此复制

VoQA: Visual-only Question Answering

VoQA: Visual-only Question Answering

评论