Lemmatization of codea data and its use in quantitative analyzes on the eñe and the silent hache
DOI:
https://doi.org/10.12795/PH.2019.v33.i01.10Keywords:
lemmatization, Spanish old documents, eñe, silent hache.Abstract
In this article we will explain a method of lemmatization of Spanish old documents using the data of «CODEA» Corpus de Documentos Españoles Anteriores a 1800 (Sánchez-Prieto et al., 2009) and the analysis tool «LYNEAL» (Letras y Números en Análisis Lingüísticos). Our goal is to present the simplest possible method of lemmatization, easy to perform with high degree of accuracy. Next, we will expose two examples of its use in the historical study of Spanish spelling: on the eñe and the silent hache.Downloads
References
Ávila Muñoz, A. (1999). Léxico de frecuencia del español hablado en la Ciudad de Málaga. Málaga, España: Universidad de Málaga.
Buckley, C., Salton, G., Allen, J., y Singhal, A. (1995). Automatic query expansion using SMART. Proceedings of the TREC’3 Conference, 69-80. Gaithersburg, MA: NIST publication.
Gómez Díaz, R. (2005). La lematización en español: una aplicación para la recuperación de información. Gijón, España: Ediciones Trea.
Halliday, M. A. K. (1991). Corpus studies and probabilistic grammar. En Aijmer y B. Altenberg (Eds.), English corpus linguistics. Studies in honour of Jan Svartvick (pp. 30-43). London, UK: Longman.
Hockett, C. F. (1954). Two models of grammatical description. Word, 10, 210-231. https://doi.org/10.1080/00437956.1954.11659524
Marcet Rodríguez, V. J. (2010). De nuevo sobre los usos y valores de la grafía h en la escritura medieval leonesa. En M. T. Encinas Manterola et al. (Eds.). Ars longa. Diez años de Asociación de Jóvenes Investigadores de Historiografía e Historia de la Lengua Española (pp. 63-80). Salamanca, España: Universidad de Salamanca.
McEnery, T. & Hardie, A. (2012). Corpus linguistics. Cambridge, UK: Cambridge University Press. https://doi.org/10.1093/oxfordhb/9780199276349.013.0024
Moreno Sandoval, A. (2019). Lenguas y computación. Madrid, España: Editorial Síntesis.
Real Academia Española. (2010). Ortografía de la lengua española. Madrid, España: Espasa Libros.
Salvador, G. y Lodares, J. R. (2001). Historia de las letras. Madrid, España: Espasa.
Sánchez-Prieto, P., Paredes García, F. R., Martínez Sánchez, Miguel Franco, R. Simón Parra, M. y Vicente Miguel, I. (2009). El Corpus de Documentos Españoles Anteriores a 1700 (CODEA). En A. Enrique-Arias (Ed.), Diacronía de las lenguas iberorrománicas: Nuevas aportaciones desde la lingüística de corpus (pp 25-38). Madrid/Frankfurt am Main, España/Alemania: Iberoamericana-Vervuert. https://doi.org/10.31819/9783865278685-003
Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952. https://doi.org/10.1002/(SICI)1097-4571(1999)50:10<944::AID-ASI9>3.0.CO;2-Q
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press.
Stubbs, M. (2007). On texts, corpora and models of language. En Hoey, E., Mahlberg, M., y Teubert, W (Eds.), Text, discourse and corpora. Theory and analysis (pp. 127-162). New York, EEUU: Continuum.
Torrens Álvarez, M. J. (2018). Evolución e historia de la lengua española. 2a edición. Madrid, España: Arco / Libros.
Ueda, H. (2017). Unilateral correspondence analysis applied to Spanish linguistic data in time and space. Sixteenth International Conference on Methods in Dialectology. National Institute for Japanese Language and Linguistics, Tokyo, 10 August, 2017.
https://lecture.ecc.u-tokyo.ac.jp/~cueda/kenkyuchiricorrespondencecorrespondence2017.pdf
_____ (2018). Tratamiento lingüístico y matemático de textos digitales españoles. Presentación del Programa LEXIS-web. Actas del IX Congreso de la Asociación Asiática de Hispanistas (Bangkok, 2016), 617-630.
http://www.sinoele.org/images/Revistá17/monograficos/AAH_2016/AAH_2016_hiroto_ueda.pdf
Published
How to Cite
Issue
Section
License
The printed and electronic editions of this Journal are edited by the University of Seville Editorial, and the source must be cited in any partial or total reproduction.
Unless otherwise indicated, all the contents of the electronic edition are distributed under a license of use and distribution “Attribution-NonCommercial-NoDerivatives 4.0 International” . You can view the informative version and the legal text of the license here. This fact must be expressly stated in this way when necessary.
Authors who publish in this journal accept the following conditions:
- The author/s retain copyright and grant the journal the first publication right, and accept it to be distributed with the Creative Commons By NC ND 4.0 licence, which allows third parties to use what is published whenever they mention the authorship of the work and the first publication in this journal and whenever they do not make commercial use and reuse it in the same way.
- Authors can make other independent and additional contractual agreements for the non-exclusive distribution of the article published in this journal (e.g., include it in an institutional repository or publish it in a book) provided they clearly indicate that the work was published for the first time in this journal.
Authors are allowed and recommended, once the article has been published in the journal Philologia Hispalensis (online version), to download the corresponding PDF and disseminate it online (ResearchGate, Academia.edu, etc.) as it may lead to productive scientific exchanges and to a greater and faster dissemination of published work (see The Effect of Open Access).
Accepted 2019-11-06
Published 2019-12-29
- Abstract 279
- HTML (Español (España)) 176
- PDF (Español (España)) 117