|国家预印本平台
首页|CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

来源:Arxiv_logoArxiv
英文摘要

Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.

Marco Peer、Anna Scius-Bertrand、Andreas Fischer

计算技术、计算机技术

Marco Peer,Anna Scius-Bertrand,Andreas Fischer.CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality[EB/OL].(2025-08-11)[2025-08-24].https://arxiv.org/abs/2508.07904.点此复制

评论