Leading AI labs and big tech companies have been accused of using captions from tens of thousands of YouTube videos without permission to train artificial intelligence models.
Google has strict rules in place banning the harvesting of material from YouTube without permission. A new investigation by Proof News found Apple, Nvidia and Anthropic were among those using the subtitles from more than 170,000 videos.
The captions were part of 'the Pile', a massive dataset compiled by non-profit EleutherAI. Originally intended to give smaller companies and individuals a quick way to train their models, big tech and AI companies have also adopted this vast reservoir of information.
While Apple, Nvidia and Anthropic didn’t directly scrape the YouTube videos themselves, the AI models they operate, including Claude and Apple Intelligence, were trained on the information because they used ‘the Pile’ as a source.
A thirst for data
Several studies have now found that two things are essential in making more advanced AI models — data and computing power.
Increasing one or both leads to better responses, improved performance and scale. But data is an increasingly scarce and expensive commodity.
Companies like OpenAI and Google have a combination of their own massive data repositories and deals with major publishing companies or Reddit.
Meta has Facebook, Instagram, Threads and WhatsApp — although it is facing pushback from users. Apple has a vast amount of user data but its own privacy policies makes this less useful in initial model training.
This lack of available data is leading companies to look for new sources of information to train next-generation models and not all of those sources are willing to part with data, or even aware that the information they’re creating is being used to train AI.
There are several lawsuits against AI image and music generation companies underway at the moment over whether there is a copyright fair use for training data.
What went wrong?
While Apple and Anthropic are not directly responsible for the use of these YouTube captions in their model training dataset, the inclusion does raise questions about data provenance and just how hard big tech is checking when assessing rights.
It wasn’t just small creator videos included. The BBC, NPR, Wall Street Journal, Mr Beast and Marques Brownlee all had videos in the dataset.
A total of 48,000 channels and 173,536 videos were in the YouTube Subtitles dataset. Some of the videos included conspiracy theories and parody which could impact the integrity of the final model.
This isn't the first time YouTube has been at the center of an AI training data controversy, with OpenAI CTO Mira Murati unable to confirm or deny whether YouTube was used in training their advanced — but as yet unreleased — AI video model Sora.
Speaking to Wired, Dave Wiskus, CEO of Nebula described it as "theft" and "disrespectful" to use data without consent, especially as studios are already using genative AI to "replace as many of the artists" as they can.
Anthropic said in a statement to Ars Technica that the Pile is just a small subset of YouTube subtitles and that YouTube's terms only cover direct use of its platform. This is distinct from the use of the Pile dataset. "On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors."
What will happen to the AI?
Google says it has taken action over the years to prevent abuse but has given no additional detail of what that might be or even whether this violates the terms.
However, Google isn't entirely blameless having been caught out scanning user documents saved in Google Drive with its Gemini AI even when the user hasn't given permission.
Creators are annoyed at the discovery but with the question of data provenance and copyright when used in training models still very much up for debate — their likely only recourse is if Google decides it violates the YouTube terms.
This instance of potential misuse of data will likely be bundled into the wider story of whether training data is under fair use or requires specific licensing. I suspect we won't get a final decision on that for years.