Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments
Hrishit Madhavi、Jacob Cherian、Yuvraj Khamkar、Dhananjay Bhagat
印欧语系南印语系(达罗毗荼语系、德拉维达语系)
Hrishit Madhavi,Jacob Cherian,Yuvraj Khamkar,Dhananjay Bhagat.Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline[EB/OL].(2025-05-16)[2025-06-04].https://arxiv.org/abs/2505.11177.点此复制
评论