Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Tom’s Hardware
Tom’s Hardware
Technology
Dallin Grimm

Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos — professional creators frustrated by latest AI training scandal: Report

YouTube.

Some of the world's wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data.  The YouTube transcripts were reportedly accumulated through means that violate YouTube's Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.

While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of "The Pile", an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset's creation in 2020. 

Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.

EleutherAI is a respectably-sized force in the AI training space. The non-profit AI research lab is one of many aiming to "democratize" AI for the masses, with its website stating a goal to "ensure that the ability to study foundation models is not restricted to a handful of companies". The Pile and YouTube Subtitles datasets were created for this purpose, to provide high-quality training data to even the scrappiest of at-home AI coders. However, this idyllic dream of supporting the little guy with The Pile has become another fuel source for major corporations to train AI, rather than DIYers.

However, YouTube Subtitles violates YouTube's Terms of Service based on its use of YouTube's content without permission and its use of "automated means" to access the data. In the research paper about The Pile and YouTube Subtitles, EleutherAI acknowledges its violation of TOS but claims that the tools used to scrape YouTube data were already widespread enough that no additional harm was caused. 

Many of those affected have reacted strongly against the use of their content. Abigail Thorn, producer of YouTube channel Philosophy Tube and actress on House of the Dragon, shared on X (formerly Twitter), "When I was told about this I lay on the floor and cried, it’s so violating, it made me want to quit writing forever. The reason I got back up was because I know my audience come to my show for real connection and ideas, not cheapfake AI garbage." 

She continued, "I’d like to see YouTube do more to prevent theft like this from happening." Thorn and other YouTubers confirm that no one ever requested to initially scrape or later use any of the videos as training data. 

Who to lay fault on is made difficult by the fact that no one will accept blame or responsibility for the use of the transcripts. Apple and other major tech companies who used the training data avoid blame because they weren't the ones doing the scraping, although conversations must be had within such companies about the ethical sourcing of training data. EleutherAI, creators of the dataset, have not responded to any publications' requests for comment and reject any wrongdoing or harm in their initial research paper on Pile.

The tech industry is spending on AI hardware at an unhealthy rate, with the AI market needing to turn $600 billion in profit per year to keep up with its insane hardware purchasing. As companies seek to spend less on AI, more instances of  illicitly obtained data become more likely, like this YouTube theft and Google's Gemini reading files without permission. Before long, it may not be shocking to see web content end with "You have exceeded the GPT rate limit. Don't forget to smash that like button!"

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.