|国家预印本平台
首页|OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

来源:Arxiv_logoArxiv
英文摘要

Document content extraction is a critical task in computer vision, underpinning the data needs of large language models (LLMs) and retrieval-augmented generation (RAG) systems. Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. OmniDocBench supports flexible, multi-level evaluations--ranging from an end-to-end assessment to the task-specific and attribute--based analysis using 19 layout categories and 15 attribute labels. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-language models, revealing their strengths and weaknesses across different document types. OmniDocBench sets a new standard for the fair, diverse, and fine-grained evaluation in document parsing. Dataset and code are available at https://github.com/opendatalab/OmniDocBench.

Linke Ouyang、Yuan Qu、Hongbin Zhou、Jiawei Zhu、Rui Zhang、Qunshu Lin、Bin Wang、Zhiyuan Zhao、Man Jiang、Xiaomeng Zhao、Jin Shi、Fan Wu、Pei Chu、Minghao Liu、Zhenxiang Li、Chao Xu、Bo Zhang、Botian Shi、Zhongying Tu、Conghui He

计算技术、计算机技术

Linke Ouyang,Yuan Qu,Hongbin Zhou,Jiawei Zhu,Rui Zhang,Qunshu Lin,Bin Wang,Zhiyuan Zhao,Man Jiang,Xiaomeng Zhao,Jin Shi,Fan Wu,Pei Chu,Minghao Liu,Zhenxiang Li,Chao Xu,Bo Zhang,Botian Shi,Zhongying Tu,Conghui He.OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations[EB/OL].(2024-12-10)[2025-04-29].https://arxiv.org/abs/2412.07626.点此复制

评论