Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

A framework for OCR error correction and linguistic surface form detection in digitized corpora utilizing a Large Language Model is introduced and applied to a new dataset of 19th-century Latin American press texts.

Open

Preview
Year: 2024
Venue: arXiv 2024
ArXiv: arxiv.org/abs/2407.12838
Authors: 4
Hosting: Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2407.12838v2ARXIV-DEFAULT
TL;DR: Semantic Scholar

Attribution policy →

Abstract

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Authors

Tony Montes Laura Manrique-Gómez Rubén Manrique Arturo Rodríguez-Herrera