首页|Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

来源：

英文摘要

This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

作者：Hrishit Madhavi、Jacob Cherian、Yuvraj Khamkar、Dhananjay Bhagat

作者单位：

学科分类：印欧语系南印语系（达罗毗荼语系、德拉维达语系）

推荐引用：Hrishit Madhavi,Jacob Cherian,Yuvraj Khamkar,Dhananjay Bhagat.Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline[EB/OL].(2025-05-16)[2025-06-04].https://arxiv.org/abs/2505.11177.点此复制

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

评论