|国家预印本平台
首页|ZINA: Multimodal Fine-grained Hallucination Detection and Editing

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

来源:Arxiv_logoArxiv
英文摘要

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

Yuiga Wada、Kazuki Matsuda、Komei Sugiura、Graham Neubig

计算技术、计算机技术

Yuiga Wada,Kazuki Matsuda,Komei Sugiura,Graham Neubig.ZINA: Multimodal Fine-grained Hallucination Detection and Editing[EB/OL].(2025-06-16)[2025-07-16].https://arxiv.org/abs/2506.13130.点此复制

评论