To address the growing issue of AI developers scraping content from its platform through automated scripts, Wikipedia is exploring new measures. Recently, the Wikimedia Foundation announced a collaboration with Kaggle, a data science community under Google, to release an optimized dataset specifically designed for training artificial intelligence models.
This beta-version dataset includes structured Wikipedia content in both English and French. The Wikimedia Foundation emphasized that the dataset was created with machine learning workflows in mind, enabling AI developers to access machine-readable article data more easily. Such data can be used for model training, fine-tuning, benchmarking, alignment research, and other analytical studies.
Released under an open license agreement, the dataset contains research abstracts, brief descriptions, image links, infobox data, and article sections as of April 15. However, it excludes non-text elements such as references or audio files. The foundation noted that this "structured JSON representation of Wikipedia content" offers a more appealing alternative to directly scraping or parsing raw article text, reducing server strain caused by continuous bandwidth consumption from AI bots.
Prior to this initiative, the Wikimedia Foundation had already established content-sharing agreements with Google and the Internet Archive. This partnership with Kaggle further expands data accessibility, benefiting small businesses and independent data scientists. Brenda Flynn, Head of Partnerships at Kaggle, stated that as a crucial tool and testing platform in the machine learning field, Kaggle is honored to host the Wikimedia Foundation's data and remains committed to ensuring its accessibility, usability, and practicality.