La compatibilidad del web scraping con los principios de la protección de datos personales

Pablo Agustín Viollier Bonvin

doi:10.12795/IETSCIENTIA.2025.i02.04

Autores/as

Pablo Agustín Viollier Bonvin Universidad Central de Chile https://orcid.org/0000-0001-9893-7974

DOI:

https://doi.org/10.12795/IETSCIENTIA.2025.i02.04

Palabras clave:

Protección de datos personales, Principios, Inteligencia artificial, Web scraping

Resumen

Este artículo analiza la compatibilidad del raspado web (web scraping) con la normativa europea de protección de datos personales, particularmente el Reglamento General de Protección de Datos (RGPD) y el Reglamento de Inteligencia Artificial (RIA) de la Unión Europea. A través de un estudio doctrinal y jurisprudencial, se examinan los principios fundamentales del tratamiento de datos y su tensión con el web scraping. Se evalúan los límites y excepciones aplicables al web scraping y el rol de la recolección de datos de fuentes públicas para el entrenamiento de sistemas de inteligencia artificial. Finalmente, se discuten los desafíos regulatorios y las brechas existentes en la normativa que requieren ser subsanadas mediante pronunciamientos interpretativos o a través de una solución regulatoria que garantice alcanzar un equilibrio entre la protección de los derechos de los titulares y el acceso de datos de entrenamiento para el desarrollo modelos de inteligencia artificial.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

Akhtar, F. (2023). Regulating Artificial Intelligence for a Safer and More Ethical Future: A Review of the EU’s AI Act. http://dx.doi.org/10.2139/ssrn.4560224

Andreotta, et al. (2021). AI, big data, and the future of consent. AI & Society. Volume 37, p. 1715–1728. https://doi.org/10.1007/s00146-021-01262-5

Alemohammad, S. et al. (2023). Self-consuming generative models go MAD. ArXiv, abs/2307.01850.

Almada, M, (2025). The EU AI Act in a Global Perspective. Handbook on the Global Governance of AI (Furendal & Lundgren, eds, Edward Elgar 2025), http://dx.doi.org/10.2139/ssrn.5083993

Almaqbali, I. S., Al Khufairi, F. M., Khan, M. S., Bhat, A. Z., Ahmed, I. (2019). Web Scraping: Data Extraction from Websites. Journal of Student Research. https://doi.org/10.47611/jsr.vi.942

Baack, S. (2024). A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl’, The 2024 ACM Conference on Fairness, Accountability, and Transparency https://dl.acm.org/doi/10.1145/3630106.3659033

Baldini, D. y Francis, K. (2024). AI Regulatory Sandboxes between the AI Act and the GDPR: the role of Data Protection as a Corporate Social Responsibility. Conference: ITASEC 2024 Italian Conference on Cyber Security 2024. https://doi.org/10.2139/ssrn.5533498

Bhatia, M. A. (2016). Artificial Intelligence–Making an Intelligent personal assistant. Indian J. Comput. Sci. Eng, 6, 208-214.

Birhane, et al.(2023) Into the LAIONs Den: Investigating Hate in Multimodal Datasets. arXiv, http://arxiv.org/abs/2311.03449

Blake, B. (2023), Google Says Data-Scraping Lawsuit Would Take ‘Sledgehammer’ to Generative AI [recurso web]. Publicado el 17 de octubre de 2023 en Reuters https://www.reuters.com/legal/litigation/google-says-data-scraping-lawsuit-would-take-sledgehammer-generative-ai-2023-10-17/ (último acceso el 21 de marzo de 2025)

Bowles et al.(2018). GAN Augmentation: Augmenting Training Data Using Generative Adversarial Networks, arXiv https://doi.org/10.48550/arXiv.1810.10863

Chan, A., Bradley, H. y Rajkumar, N. (2023). Reclaiming the Digital Commons: A Public Data Trust for Training Data. Accepted at AIES 2023 https://doi.org/10.48550/arXiv.2303.09001

Comité Europeo de Protección de Datos (2024). Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models [recurso web]. Disponible en: https://www.edpb.europa.eu/our-work-tools/our-documents/opinion-board-art-64/opinion-282024-certain-data-protection-aspects_en (último acceso el 21 de marzo de 2025).

Contreras, P. y Trigo, P. (2019). Interés legítimo y tratamiento de datos personales: Antecedentes comparados y regulación en Chile. Revista Chilena De Derecho Y Tecnología, 8(1), 69–106. https://doi.org/10.5354/0719-2584.2019.52915

Cook, S. (2008) The contribution revolution, Harvard Business Review, 86, 10, 60-69.

de Terwangne, C. (2020). Article 5. Principles relating to processing of personal data, en Kuner, Christopher et al. (eds.), The EU General Data Protection Regulation (GPDR). A Commentary (Oxford, Oxford University Press). https://doi.org/10.1093/oso/9780198826491.003.0034

Drouard, E., et al. (2024). The Interplay between the AI Act and the GDPR: Part I – When and How to Comply with Both. Journal of AI Law and Regulation, Volume 1, Issue 2, pp. 164-176. https://doi.org/10.21552/aire/2024/2/4

EUneedsAI (2024). AN OPEN LETTER Europe needs regulatory certainty on AI [recurso web]. Disponible en: https://euneedsai.com/?utm_source=substack&utm_medium=email#signatories (último acceso el 21 de marzo de 2025).

Eutoriteir Persoonsgegevens (2024) Scraping door particulieren en private organisaties. Informe disponible en: https://autoriteitpersoonsgegevens.nl/system/files?file=2024-05/Handreiking%20scraping%20door%20particulieren%20en%20private%20organisaties.pdf (último acceso el 21 de marzo de 2025)

Ferretti, F. (2014). Data Protection and the Legitimate Interest of Data Controllers: Much Ado about Nothing or the Winter of Rights? 51 Common Market Law Review, Volume 51, Issue 3, pp. 843 – 868. https://doi.org/10.54648/COLA2014063

Financial Times (2024). AI start-up Anthropic accused of ‘egregious’ data scraping [recurso web]. Disponible en: https://www.ft.com/content/07611b74-3d69-4579-9089-f2fc2af61baa?ref=platformer.news (último acceso el 21 de marzo de 2025)

Floridi, L. (2012). Big data and their epistemological challenge. Philos Technol 25:435–437. https://doi.org/10.1007/s13347-012-0093-4

Folberth, A. Jahnel, J. Bareis, J. Orwat, C. y Wadephul, C. (2022). Tackling Problems, Harvesting Benefits: A Systematic Review of the Regulatory Debate Around AI. Karlsruher Institut für Technologie (KIT). https://doi.org/10.5445/IR/1000150432.

Guadamuz, A. (2024). A Scanner Darkly: Copyright Liability and Exceptions in Artificial Intelligence Inputs and Outputs, GRUR International, Volume 73, Issue 2, pp 111–127, https://doi.org/10.1093/grurint/ikad140

An Coimisiún um Chosaint Sonraí (2019) Guidance Note: Legal Bases for Processing Personal Data [recurso web]. Disponible en: https://www.dataprotection.ie/sites/default/files/uploads/2020-04/Guidance%20on%20Legal%20Bases.pdf (último acceso el 21 de marzo de 2025).

Hacker, P. (2021). A legal framework for AI training data–from first principles to the Artificial Intelligence Act. Law, Innovation and Technology, Vol 13, No. 2, 257-301. https://doi.org/10.1080/17579961.2021.1977219

Hagendorff, T. (2020). The Ethics of AI Ethics: An Evaluation of Guidelines. Minds and Machines, 30 (1), 99–120. https://doi.org/10.1007/s11023-020-09517-8

HmbBfDI (2024). Discussion Paper: Large Language Models and Personal Data [recurso web]. Disponible en: https://datenschutz-hamburg.de/fileadmin/user_upload/HmbBfDI/Datenschutz/Informationen/240715_Discussion_Paper_Hamburg_DPA_KI_Models.pdf (último acceso el 21 de marzo de 2025).

Henrys, K (2021). Importance of web scraping in e-commerce. http://dx.doi.org/10.2139/ssrn.3769593

IAPP (2023). Training AI on personal data scraped from the web [recurso web]. Disponible en: https://iapp.org/news/a/training-ai-on-personal-data-scraped-from-the-web (último acceso el 21 de marzo de 2025).

Information Commissioner’s Office (2024a). Generative AI first call for evidence: The lawful basis for web scraping to train generative AI models [recurso web]. Disponible en: https://ico.org.uk/about-the-ico/what-we-do/our-work-on-artificial-intelligence/generative-ai-first-call-for-evidence/ (último acceso el 21 de marzo de 2025).

Information Commissioner’s Office (2024b). Information Commissioner’s Office response to the consultation series on generative AI [recurso web]. Disponible en:https://ico.org.uk/about-the-ico/what-we-do/our-work-on-artificial-intelligence/response-to-the-consultation-series-on-generative-ai/ (último acceso el 21 de marzo de 2025).

Jayachandran, J. y Arni, V. (2023). Traversing the Ethical Landscape of Data Scraping for AI http://dx.doi.org/10.2139/ssrn.4666354

Jobin, A. Ienca, M. y Vayena, E. (2019) The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1 (9), 389–99.

Khder, M. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and its Applications. https://doi.org/10.15849/IJASCA.211128.11

Koops, B. (2021). The concept of function creep. Law, Innovation and Technology. 13. 1-28. https://doi.org/10.1080/17579961.2021.1898299

Kumar, M. et al. (2023). Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus. 15. https://doi.org/10.7759/cureus.43313.

Llano Alonso, F. (2024). Artículo 14. Supervisión Humana, en “Comentarios al Reglamento Europeo de Inteligencia Artificial” Moisés Barrio Andrés (director). Editorial La Ley (Madrid).

Krotov, V y Silva, L., (2018). Legality and ethics of web scraping, Twentyfourth Americas Conference on Information Systems, New Orleans.

Medina, M. (2022). El derecho a conocer los algoritmos utilizados en la toma de decisiones. Aproximación desde la perspectiva del derecho fundamental a la protección de datos personales. Teoría y realidad constitucional, ISSN 1139-5583, Nº 49.

Mészáros y Ho (2018). Big Data and Scientific Research. 59 Hungarian Journal of Legal Studies 403 (405). https://doi.org/10.1556/2052.2018.59.4.5

Milev, P. (2017). Conceptual approach for development of web scraping applications for tracking information. Economic Alternatives, (3), 475-485.

Moerel, L. y Storm, M. (2024). Do LLMs “store” Personal Data? This Is Asking the Wrong Question [recurso web]. Disponible en: https://iapp.org/news/a/do-llms-store-personal-data-this-is-asking-the-wrong-question (último acceso el 21 de marzo de 2025).

Neel, S. (2024). Privacy Issues in Large Language Models: A Survey. arXiv https://doi.org/10.48550/arXiv.2312.06717

Nissenbaum, H (2011). A Contextual Approach to Privacy Online. Daedalus 140 (4), Fall 2011: 32-48, https://doi.org/10.1162/DAED_a_00113

Li, W. et al. (2025) The Quest for Lawful AI Training under Data Protection Frameworks: Global Controversies and Practical Implication http://dx.doi.org/10.2139/ssrn.5162653

Longpre, S. et al. (2024). Consent in Crisis: The Rapid Decline of the AI Data Commons. Cornell University https://doi.org/10.48550/arXiv.2407.14933

OECD (2025). Intellectual Property Issues in Artificial Intelligence Trained of Scraped Data. OCD Artificial Intelligence Papers N° 33. https://doi.org/10.1787/7b245f7e-en

Office of the Privacy Commissioner of Canada (2023). Joint statement on data scraping and the protection of privacy, publicada el 24 de agosto de 2023 [recurso web]. Disponible en: https://www.priv.gc.ca/en/opc-news/speeches-and-statements/2023/js-dc_20230824/ (último acceso el 21 de marzo de 2025).

O’Reilly, T. (2007). What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software. Communications & Strategies, No. 1, p. 17, First Quarter 2007.

O’Reilly, T. y Battele, J. (2009) Web squared: Web 2.0 five years on, O’Reilly and TechWeb [recurso web]. Disponible en: https://www.kimchristen.com/wp-content/uploads/2015/07/web2009_websquared-whitepaper.pdf (último acceso el 21 de marzo de 2025).

Puente Escobar, A. (2019). Principios y licitud del tratamiento, en Rallo Lombarte, Artemi (dir.), Tratado de Protección de Datos (Valencia, Tirant lo Blanch).

Regine, P. (2022). The Politics of Regulating Artificial Intelligence Technologies: A Competition State Perspective. Handbook on Public Policy and Artificial Intelligence, edited by Regine Paul, Emma Carmel and Jennifer Cobbe (Cheltenham Spa: Edward Elgar).

Reuters (2023). Google Says Data-Scraping Lawsuit Would Take ‘Sledgehammer’ to Generative AI, REUTERS [recurso web]. Disponible en: https://www.reuters.com/legal/litigation/google-says-data-scraping-lawsuit-would-take-sledgehammer-generative-ai-2023-10-17/ (último acceso el 10 de noviembre de 2025)

Sellars, A, (2018). Twenty Years of Web Scraping and the Computer Fraud and Abuse Act. Scholarly Commons at Boston University School of Law.

Shumailov, I. et al. (2024) AI models collapse when trained on recursively generated data. Nature 631, 755–759 https://doi.org/10.1038/s41586-024-07566-y

Sirisuriya, D. S., (2015). A comparative study on web scraping. Proceedings of 8th International Research Conference, KDU.

Solove, D. y Hartzog, W. (2024). The Great Scrape: The Clash Between Scraping and Privacy. 113 California Law Review 1521. https://doi.org/10.2139/ssrn.4884485

Supervisor Europeo de Protección de Datos (2024). La IA generativa y el EUDPR: Primeras orientaciones del SEPD para garantizar el cumplimiento de la protección de datos al utilizar sistemas de IA. Disponible en: https://www.edps.europa.eu/system/files/2024-06/24-06-03_genai_orientations_en.pdf (último acceso el 21 de marzo de 2025).

Tribunal de Justicia de la Unión Europea (2010). Casos C-92/09 y C-93/09–Volker und Markus Schecke y Eifert.

Tribunal de Justicia de la Unión Europea. Caso C‑621/22 Koninklijke Nederlandse Lawn Tennisbond. ECLI:EU:C:2024:857

Trigo, P. (2023). Can legitimate interest be an appropriate lawful basis for processing Artificial Intelligence training datasets? Computer Law & Security Review 48–105765. https://doi.org/10.1016/j.clsr.2022.105765

Troncoso, A. (2021). Los principios relativos al tratamiento (comentario al artículo 5 RGPD y al artículo 4 LOPDGDD), en Troncoso Reigada, Antonio (dir), Comentario al Reglamento General de Protección de Datos ya la Ley Orgánica de Protección de Datos Personales y Garantía de los Derechos Digitales (Madrid, Civitas–Thomson Reuters).

Tschider, C. (2021) AI’s Legitimate Interest: Towards a Public Benefit Privacy Model, 21 Hous. J. Health L. & Policy 125, 132

Viollier, P. (2021). Taming the Algorithm: Analyzing EU Enforcement Mechanisms to Enhance Algorithmic Transparency and Accountability [Tesis de maestría, Leiden Law School].

Wilson, D. Lin, X. Longstreet, P. y Sarker, S. (2011). Web 2.0: A Definition, Literature Review, and Directions for Future Research. AMCIS 2011 Proceedings–All Submissions. 368. http://aisel.aisnet.org/amcis2011_submissions/368

Yang Sun, Z, y Lee Giles, C. (2007). A large-scale study of robots.txt. In Proceedings of the 16th international conference on World Wide Web (WWW ‘07). Association for Computing Machinery, New York, NY, USA, 1123–1124. https://doi.org/10.1145/1242572.1242726

La compatibilidad del web scraping con los principios de la protección de datos personales

Autores/as

DOI:

Palabras clave:

Resumen

Descargas

Citas

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

idiomas

indexadores

Palabras clave

dialnet_widget

rrss

Enviar un artículo