Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested.
One such dataset is the Colossal Clean Crawled Corpus, or C4, assembled by Google from more than 15m websites and used to train both the search engine’s LaMDA AI as well as Meta’s GPT competitor, LLaMA.
The dataset is public, but its scale has made it difficult to examine the contents: it is supposedly a “clean” version of a more expansive dataset, Common Crawl, with “noisy” content, offensive language and racist slurs removed from the material.
But an investigation by the Washington Post reveals that C4’s “cleanliness” is only skin deep. While it draws on websites such as the Guardian – which makes up 0.05% of the entire dataset - and Wikipedia, as well as large databases such as Google Patents and the scientific journal hub PLOS, it also contains less reputable sites.
The white nationalist site VDARE is in the database, one of the 1,000 largest sites, as is the far-right news site Breitbart. The Russian state-backed propaganda site RT is one of the hundred largest providers of training data to the C4 corpus.
Few of the sites gave explicit consent to be included, although Common Crawl, the non-profit organisation that assembled the scraped data, says it respects requests to be left out of its search. Some, however, push the limits of fair use: b-ok.org, formerly known as Bookzz, was a vast repository of pirated ebooks, until it was seized by the FBI in 2022. Despite that, contents of the site remain in the C4 database.
Such vast collections of data are important to AI creation, because the large language models (LLM) that underpin tools such as ChatGPT need huge datasets to improve.
Assembling the hundreds of gigabytes of text needed to train such a model from explicitly licensed sources would be a difficult task, and many AI researchers choose to ask for forgiveness rather than permission, arguing that their creations are covered by “fair use” defences to copyright.
Some even choose to forgo the “cleaning” Google applied to its dataset, in order to access even more data for their systems to learn from. The London-based Stability AI released its new LLM, StableLM, on Wednesday, trained on the Pile, an 850GB dataset that includes the entire, uncleaned Common Crawl database, as well as 2m pirate ebooks from the BitTorrent site Bibliotik, 100GB of data scraped from the coding site GitHub, and more esoteric sources, such as every internal email sent by the now-defunct energy company Enron and the entire proceedings of the European parliament.
The Pile is hosted publicly by a group of anonymous “data enthusiasts” called the Eye, whose copyright takedown policy links to a video of a choir of clothed women pretending to masturbate imaginary penises while singing.
The version used by Stability, which is currently private, is “three times larger”, the company says. It has released no further details about the extra content of that dataset, which, it says, “gives StableLM surprisingly high performance in conversational and coding tasks”.
“We open-source our models to promote transparency and foster trust,” Stability said. “Researchers can ‘look under the hood’ to verify performance, work on interpretability techniques, identify potential risks, and help develop safeguards.
“Organisations across the public and private sectors can adapt (‘fine-tune’) these open-source models for their own applications without sharing their sensitive data or giving up control of their AI capabilities.”
Google was approached for comment.