Revolutionizing Access to Library Heritage: HTR Systems between Digital Humanities and Information Science
DOI:
https://doi.org/10.12795/PH.2024.v38.i02.03Keywords:
Handwritten Text Recognition (HTR), general models, Progetto Mambrino, information science, digital scholarly editionAbstract
The present work aims to offer a state of the art on recent developments in the field of automatic transcription of historical printed documents and manuscripts with HTR (Handwritten Text Recognition) systems, focusing primarily on the recent creation of HTR general models. In this regard, the main characteristics of the most widespread tools and the workflow for generating text recognition models are explained. Secondly, a significant sample of the models currently available is provided, insisting on the production process, the criteria adopted and the evaluation of the results, in relation to the experience matured by the Progetto Mambrino research group of the University of Verona. Finally, some future research directions are provided for the creation and dissemination of these resources, emphasizing the need to seek greater synergy between the academic context, computer experts and memory institutions.
Downloads
References
Allés Torrent, S. (2020). Crítica textual y edición digital o ¿dónde está la crítica en las ediciones digitales?. Studia Aurea: revista de literatura española y teoría literaria del Renacimiento y Siglo de Oro, 14, 63-98. https://doi.org/10.5565/rev/studiaaurea.395
Alvite-Díez, M. L. y Barrionuevo, L. (2020). Confluence between library and information science and digital humanities in Spain. Methodologies, standards and collections. The Journal of Documentation, 77(1), 41-68. https://doi.org/10.1108/JD-02-2020-0030
Alvite-Díez, M. L. y Rojas-Castro, A. (2022). Ediciones digitales académicas: Concepto, estándares de calidad y software de publicación. El Profesional de la Información, 31(2), 1-19. https://doi.org/10.3145/epi.2022.mar.16
Ball, R. y Parker, G. (Eds.). (2014). Cómo ser rey. Instrucciones del emperador Carlos V a su hijo Felipe. Mayo de 1543. CSA-The Hispanic Society of America.
Bazzaco, S. (2018). El Progetto Mambrino y las tecnologías OCR: estado de la cuestión. Historias Fingidas, (6), 257-272. https://doi.org/10.13136/2284-2667/89
Bazzaco, S. (2020). El reconocimiento automático de textos en letra gótica del Siglo de Oro: creación de un modelo HTR basado en libros de caballerías del siglo XVI en la plataforma Transkribus. Janus. Estudios sobre el Siglo de Oro, (9), 534-561. https://www.janusdigital.es/articulo.htm?id=160
Bazzaco, S., Jiménez Ruiz, A. M., Torralba Ruberte, A. y Martín Molares, M. (2022). Sistemas de reconocimiento de textos e impresos hispánicos de la Edad Moderna. La creación de unos modelos de HTR para la transcripción automatizada de documentos en gótica y redonda (s. XV-XVII). Historias Fingidas, (Número Especial 1), 67-125. https://doi.org/10.13136/2284-2667/1190
Bermúdez Carreño, J. (2023). Inteligencia artificial para la transcripción de letra itálica española del siglo XVIII: Transkribus como herramienta para las humanidades digitales. Revista De Humanidades Digitales, 8, 109-127. https://doi.org/10.5944/rhd.vol.8.2023.38111
Capurro, C., Provatorova, V. y Kanoulas, E. (2023). Experimenting with Training a Neural Network in Transkribus to Recognise Text in a Multilingual and Multi-Authored Manuscript Collection. Heritage, 6(12), 7482-7494. https://doi.org/10.3390/heritage6120392
Cordell, R. y Smith, D. (2018). A Research Agenda for Historical and Multilingual Optical Character Recognition. Northeastern University Library. http://hdl.handle.net/2047/D20297452
Cuéllar, Á. (2023). La Inteligencia Artificial al rescate del Siglo de Oro. Transcripción y modernización automática de mil trescientos impresos y manuscritos teatrales. Hipogrifo. Revista de literatura y cultura del Siglo de Oro, 11(1), 101-115. https://doi.org/10.13035/H.2023.11.01.08
Firmani, D., Maiorino, M., Merialdo, P. y Nieddu, E. (2018). Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. En Association for Computing Machinery (Ed.), Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 263-272). Association for Computing Machinery. https://doi.org/10.1145/3219819.3219879
Fradejas Rueda, J. M. (2022). De editor analógico a editor digital. Historias Fingidas, (Número Especial 1), 39-65. https://doi.org/10.13136/2284-2667/1108
García-Reidy, A. (2019). Deconstructing the Authorship of Siempre ayuda la verdad: A Play by Lope de Vega?. Neophilologus, 103(4), 493-510. https://doi.org/10.1007/s11061-019-09607-8
Gille Levenson, M. (2023). Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR). Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10416
Hodel, T., Schoch, D., Schneider, C. y Purcell, J. (2021). General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example. Journal of Open Humanities Data, 7(13), 1-10. https://doi.org/10.5334/johd.46
Kroll, S. y Sanz-Lázaro, F. (2022). Romances teatrales entre Mira de Amescua, Calderón y Lope, ritmo, asonancia y cuestiones de autoría. Revista de Humanidades Digitales, 7, 1-18. https://doi.org/10.5944/rhd.vol.7.2022.31620
Liceras Garrido, R., Comino, A. y Murrieta Flores, P. (2022). Mujeres en el Catálogo Monumental de España: Discursos arqueológicos sobre Prehistoria y Edad del Hierro en las provincias de Ávila, Soria y Burgos. Complutum, 33(1), 269-288. https://doi.org/10.5209/cmpl.80895
Mancinelli, T. (2016). Early printed edition and OCR techniques: what is the state-of-art? Strategies to be developed from the working-progress Mambrino project work. Historias Fingidas, (4), 255-260. https://doi.org/10.13136/2284-2667/65
Menta, A., Sánchez-Salido, E. y García-Serrano, A. (2022). Transcripción de periódicos históricos: Aproximación CLARA-HD. En M. Á. Alonso, M. Alonso-Ramos, C. Gómez Rodríguez, D. Vilares Calvo y J. Vilares (Eds.), Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations SEPLN-PD 2022. (pp. 70-74). Universidade da Coruña y CITIC, LYS Research Group.
Mühlberger, G., Seaward, L., Terras, M., Ares Oliveira, S., Bosch, V., Bryan, M., Colutto, S., Déjean, H., Diem, M., Fiel, S., Gatos, B., Greinoecker, A., Grüning, T., Hackl, G., Haukkovaara, V., Heyer, G., Hirvonen, L., Hodel, T., Jokinen, M., … Zagoris, K. (2019). Transforming scholarship in the archives through Handwritten Text Recognition. Transkribus as a case study. Journal of Documentation - Emerald Publishing, 75(5), 954-976. https://doi.org/10.1108/JD-07-2018-0114
Neto, A. F. de S., Bezerra, B. L. D. y Toselli, A. H. (2020). Towards the natural language processing as spelling correction for offline handwritten text recognition systems. Applied Sciences, 10(21), 7711. https://doi.org/10.3390/app10217711
Pavlopoulos, J., Kougia, V., Platanou, P., Shabalin, S., Liagkou, K., Papadatos, E., Essler, H., Camps, J. B. y Fischer, F. (2022). Error Correcting HTR’ed Byzantine Text. HTREC, 1-15. https://doi.org/10.21203/rs.3.rs-2921088/v1
Perdiki, E. (2023). Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training. Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10419
Pinche, A. (2023). Generic HTR Models for Medieval Manuscripts. The CREMMALab Project. Journal of Data Mining and Digital Humanities. Special Issue: Historical documents and automatic text recognition. https://doi.org/10.46298/jdmdh.10252
Rabus, A. (2019). Recognizing Handwritten Text in Slavic Manuscripts: A Neural-Network Approach Using Transkribus. Scripta & e-Scripta, 19, 9-32.
Schwarz-Ricci, V. I. (2022). Handwritten Text Recognition per registri notarili (secc. XV-XVI): una sperimentazione. Umanistica Digitale, (13), 171-181. https://doi.org/10.6092/issn.2532-8816/14926
Souibgui, M. A., Bensalah, A., Chen, J., Fornés, A. y Waldispühl, M. (2022). A User Perspective on HTR Methods for the Automatic Transcription of Rare Scripts: The Case of Codex Runicus. Journal on Computing and Cultural Heritage, 15(4), 1-18. https://doi.org/10.1145/3519306
Terras, M. (2010). The Rise of Digitization: An Overview. En R. Rukowski (Ed.), Digital Libraries (pp. 3-20). Sense Publishers.
Terras, M. (2022a). Inviting AI into the Archives: The Reception of Handwritten Recognition Technology into Historical Manuscript Transcription. En S. Jaillant (Ed.), Archives, Access and Artificial Intelligence. Working with Born-Digital and Digitized Archival Collections (pp. 179-204). Verlag - Bielefeld University Press. https://doi.org/10.14361/9783839455845-008
Terras, M. (2022b). The Role of the Library When Computers Can Read: Critically Adopting Handwritten Text Recognition (HTR) Technologies to Support Research. En A. Wheatley y S. Hervieux (Eds.), The Rise of AI: Implications and Applications of Artificial Intelligence in Academic Libraries (pp. 137-148). ACRL - Association of College & Research Libraries.
Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M. y Schomaker, L. (2018). Towards a Digital Infrastructure for Illustrated Handwritten Archives. En M. Ioannides (Ed.), Digital Cultural Heritage (pp. 155-166). Springer. https://doi.org/10.1007/978-3-319-75826-8_13
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Stefano Bazzaco
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
The printed and electronic editions of this Journal are edited by the University of Seville Editorial, and the source must be cited in any partial or total reproduction.
Unless otherwise indicated, all the contents of the electronic edition are distributed under a license of use and distribution “Attribution-NonCommercial-NoDerivatives 4.0 International” . You can view the informative version and the legal text of the license here. This fact must be expressly stated in this way when necessary.
Authors who publish in this journal accept the following conditions:
- The author/s retain copyright and grant the journal the first publication right, and accept it to be distributed with the Creative Commons By NC ND 4.0 licence, which allows third parties to use what is published whenever they mention the authorship of the work and the first publication in this journal and whenever they do not make commercial use and reuse it in the same way.
- Authors can make other independent and additional contractual agreements for the non-exclusive distribution of the article published in this journal (e.g., include it in an institutional repository or publish it in a book) provided they clearly indicate that the work was published for the first time in this journal.
Authors are allowed and recommended, once the article has been published in the journal Philologia Hispalensis (online version), to download the corresponding PDF and disseminate it online (ResearchGate, Academia.edu, etc.) as it may lead to productive scientific exchanges and to a greater and faster dissemination of published work (see The Effect of Open Access).
Accepted 2024-02-12
Published 2024-12-04
- Abstract 46
- PDF (Español (España)) 32
- HTML (Español (España)) 15
- XML (Español (España)) 8