Revolutionizing Access to Library Heritage: HTR Systems between Digital Humanities and Information Science

Authors

DOI:

https://doi.org/10.12795/PH.2024.v38.i02.03

Keywords:

Handwritten Text Recognition (HTR), general models, Progetto Mambrino, information science, digital scholarly edition

Abstract

The present work aims to offer a state of the art on recent developments in the field of automatic transcription of historical printed documents and manuscripts with HTR (Handwritten Text Recognition) systems, focusing primarily on the recent creation of HTR general models. In this regard, the main characteristics of the most widespread tools and the workflow for generating text recognition models are explained. Secondly, a significant sample of the models currently available is provided, insisting on the production process, the criteria adopted and the evaluation of the results, in relation to the experience matured by the Progetto Mambrino research group of the University of Verona. Finally, some future research directions are provided for the creation and dissemination of these resources, emphasizing the need to seek greater synergy between the academic context, computer experts and memory institutions.

Downloads

Download data is not yet available.

References

Allés Torrent, S. (2020). Crítica textual y edición digital o ¿dónde está la crítica en las ediciones digitales?. Studia Aurea: revista de literatura española y teoría literaria del Renacimiento y Siglo de Oro, 14, 63-98. https://doi.org/10.5565/rev/studiaaurea.395

Alvite-Díez, M. L. y Barrionuevo, L. (2020). Confluence between library and information science and digital humanities in Spain. Methodologies, standards and collections. The Journal of Documentation, 77(1), 41-68. https://doi.org/10.1108/JD-02-2020-0030

Alvite-Díez, M. L. y Rojas-Castro, A. (2022). Ediciones digitales académicas: Concepto, estándares de calidad y software de publicación. El Profesional de la Información, 31(2), 1-19. https://doi.org/10.3145/epi.2022.mar.16

Ball, R. y Parker, G. (Eds.). (2014). Cómo ser rey. Instrucciones del emperador Carlos V a su hijo Felipe. Mayo de 1543. CSA-The Hispanic Society of America.

Bazzaco, S. (2018). El Progetto Mambrino y las tecnologías OCR: estado de la cuestión. Historias Fingidas, (6), 257-272. https://doi.org/10.13136/2284-2667/89

Bazzaco, S. (2020). El reconocimiento automático de textos en letra gótica del Siglo de Oro: creación de un modelo HTR basado en libros de caballerías del siglo XVI en la plataforma Transkribus. Janus. Estudios sobre el Siglo de Oro, (9), 534-561. https://www.janusdigital.es/articulo.htm?id=160

Bazzaco, S., Jiménez Ruiz, A. M., Torralba Ruberte, A. y Martín Molares, M. (2022). Sistemas de reconocimiento de textos e impresos hispánicos de la Edad Moderna. La creación de unos modelos de HTR para la transcripción automatizada de documentos en gótica y redonda (s. XV-XVII). Historias Fingidas, (Número Especial 1), 67-125. https://doi.org/10.13136/2284-2667/1190

Bermúdez Carreño, J. (2023). Inteligencia artificial para la transcripción de letra itálica española del siglo XVIII: Transkribus como herramienta para las humanidades digitales. Revista De Humanidades Digitales, 8, 109-127. https://doi.org/10.5944/rhd.vol.8.2023.38111

Capurro, C., Provatorova, V. y Kanoulas, E. (2023). Experimenting with Training a Neural Network in Transkribus to Recognise Text in a Multilingual and Multi-Authored Manuscript Collection. Heritage, 6(12), 7482-7494. https://doi.org/10.3390/heritage6120392

Cordell, R. y Smith, D. (2018). A Research Agenda for Historical and Multilingual Optical Character Recognition. Northeastern University Library. http://hdl.handle.net/2047/D20297452

Cuéllar, Á. (2023). La Inteligencia Artificial al rescate del Siglo de Oro. Transcripción y modernización automática de mil trescientos impresos y manuscritos teatrales. Hipogrifo. Revista de literatura y cultura del Siglo de Oro, 11(1), 101-115. https://doi.org/10.13035/H.2023.11.01.08

Firmani, D., Maiorino, M., Merialdo, P. y Nieddu, E. (2018). Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. En Association for Computing Machinery (Ed.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 263-272). Association for Computing Machinery. https://doi.org/10.1145/3219819.3219879

Fradejas Rueda, J. M. (2022). De editor analógico a editor digital. Historias Fingidas, (Número Especial 1), 39-65. https://doi.org/10.13136/2284-2667/1108

García-Reidy, A. (2019). Deconstructing the Authorship of Siempre ayuda la verdad: A Play by Lope de Vega?. Neophilologus, 103(4), 493-510. https://doi.org/10.1007/s11061-019-09607-8

Gille Levenson, M. (2023). Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR). Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10416

Hodel, T., Schoch, D., Schneider, C. y Purcell, J. (2021). General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example. Journal of Open Humanities Data, 7(13), 1-10. https://doi.org/10.5334/johd.46

Kroll, S. y Sanz-Lázaro, F. (2022). Romances teatrales entre Mira de Amescua, Calderón y Lope, ritmo, asonancia y cuestiones de autoría. Revista de Humanidades Digitales, 7, 1-18. https://doi.org/10.5944/rhd.vol.7.2022.31620

Liceras Garrido, R., Comino, A. y Murrieta Flores, P. (2022). Mujeres en el Catálogo Monumental de España: Discursos arqueológicos sobre Prehistoria y Edad del Hierro en las provincias de Ávila, Soria y Burgos. Complutum, 33(1), 269-288. https://doi.org/10.5209/cmpl.80895

Mancinelli, T. (2016). Early printed edition and OCR techniques: what is the state-of-art? Strategies to be developed from the working-progress Mambrino project work. Historias Fingidas, (4), 255-260. https://doi.org/10.13136/2284-2667/65

Menta, A., Sánchez-Salido, E. y García-Serrano, A. (2022). Transcripción de periódicos históricos: Aproximación CLARA-HD. En M. Á. Alonso, M. Alonso-Ramos, C. Gómez Rodríguez, D. Vilares Calvo y J. Vilares (Eds.), Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations SEPLN-PD 2022. (pp. 70-74). Universidade da Coruña y CITIC, LYS Research Group.

Mühlberger, G., Seaward, L., Terras, M., Ares Oliveira, S., Bosch, V., Bryan, M., Colutto, S., Déjean, H., Diem, M., Fiel, S., Gatos, B., Greinoecker, A., Grüning, T., Hackl, G., Haukkovaara, V., Heyer, G., Hirvonen, L., Hodel, T., Jokinen, M., … Zagoris, K. (2019). Transforming scholarship in the archives through Handwritten Text Recognition. Transkribus as a case study. Journal of Documentation - Emerald Publishing, 75(5), 954-976. https://doi.org/10.1108/JD-07-2018-0114

Neto, A. F. de S., Bezerra, B. L. D. y Toselli, A. H. (2020). Towards the natural language processing as spelling correction for offline handwritten text recognition systems. Applied Sciences, 10(21), 7711. https://doi.org/10.3390/app10217711

Pavlopoulos, J., Kougia, V., Platanou, P., Shabalin, S., Liagkou, K., Papadatos, E., Essler, H., Camps, J. B. y Fischer, F. (2022). Error Correcting HTR’ed Byzantine Text. HTREC, 1-15. https://doi.org/10.21203/rs.3.rs-2921088/v1

Perdiki, E. (2023). Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training. Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10419

Pinche, A. (2023). Generic HTR Models for Medieval Manuscripts. The CREMMALab Project. Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10252

Rabus, A. (2019). Recognizing Handwritten Text in Slavic Manuscripts: A Neural-Network Approach Using Transkribus. Scripta & e-Scripta, 19, 9-32.

Schwarz-Ricci, V. I. (2022). Handwritten Text Recognition per registri notarili (secc. XV-XVI): una sperimentazione. Umanistica Digitale, (13), 171-181. https://doi.org/10.6092/issn.2532-8816/14926

Souibgui, M. A., Bensalah, A., Chen, J., Fornés, A. y Waldispühl, M. (2022). A User Perspective on HTR Methods for the Automatic Transcription of Rare Scripts: The Case of Codex Runicus. Journal on Computing and Cultural Heritage, 15(4), 1-18. https://doi.org/10.1145/3519306

Terras, M. (2010). The Rise of Digitization: An Overview. En R. Rukowski (Ed.), Digital Libraries (pp. 3-20). Sense Publishers.

Terras, M. (2022a). Inviting AI into the Archives: The Reception of Handwritten Recognition Technology into Historical Manuscript Transcription. En S. Jaillant (Ed.), Archives, Access and Artificial Intelligence. Working with Born-Digital and Digitized Archival Collections (pp. 179-204). Verlag - Bielefeld University Press. https://doi.org/10.14361/9783839455845-008

Terras, M. (2022b). The Role of the Library When Computers Can Read: Critically Adopting Handwritten Text Recognition (HTR) Technologies to Support Research. En A. Wheatley y S. Hervieux (Eds.), The Rise of AI: Implications and Applications of Artificial Intelligence in Academic Libraries (pp. 137-148). ACRL - Association of College & Research Libraries.

Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M. y Schomaker, L. (2018). Towards a Digital Infrastructure for Illustrated Handwritten Archives. En M. Ioannides (Ed.), Digital Cultural Heritage (pp. 155-166). Springer. https://doi.org/10.1007/978-3-319-75826-8_13

Published

2024-12-04

How to Cite

Bazzaco, S. (2024). Revolutionizing Access to Library Heritage: HTR Systems between Digital Humanities and Information Science. Philologia Hispalensis, 38(2), 59–77. https://doi.org/10.12795/PH.2024.v38.i02.03

Issue

Section

Monographic Section
Received 2024-02-01
Accepted 2024-02-12
Published 2024-12-04
Views
  • Abstract 46
  • PDF (Español (España)) 32
  • HTML (Español (España)) 15
  • XML (Español (España)) 8