Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.
Kotaro Inoue
计算技术、计算机技术
Kotaro Inoue.Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity[EB/OL].(2025-03-30)[2025-05-14].https://arxiv.org/abs/2503.23667.点此复制
评论