Generative AI has been plagued by controversy over the last few years, with questions around copyright and ethics surrounding every model from Adobe Firefly to ChatGPT. And perhaps the most contentious issue is the unauthorised scraping of data in order to train AI.
A new leak suggests that Nvidia has flagrantly scraped data from YouTube videos and more, with internal Slack and email conversations showing concerned employees being assured that the practice had clearance from the "highest levels" of the company. For more AI content, take a look at our coverage from Creative Bloq's first ever AI week.
According to the report by 404, Nvidia employees were " attempting to download full-length videos from a variety of sources including Netflix, but were focused on YouTube videos. Emails viewed by 404 Media show project managers discussing using 20 to 30 virtual machines in Amazon Web Services to download 80 years-worth of videos per day." What's more, the internal messages
It certainly sounds like industrial scale content scraping, and it's already drawing the ire of notable YouTubers. The leaked messages show employees directly referencing training date on videos by tech vlogger Marques Brownlee, who took to X to share his displeasure (below).
cool cool cool cool cool cool now leaked NVIDIA slack messages discussing which YouTube channels to scrape videos from. MKBHD videos? Yeah grab those too. https://t.co/0XczvTNVBHAugust 5, 2024
Of particular note here seems to be the scraping of Netflix videos. As Reddit users have pointed out, this isn't exactly free and open content. "A big corporation engaging in a piracy scheme is somehow perfectly fine, probably aided by a fancy scraping mechanisms and who knows how DRM avoidance mechanism," one Redditor comments. "Meanwhile, the common folk have to endure the full extent of the law with severe punishment if you ever dare to add 1 minute worth of Shrek footage in a youtube video essay."
Nvidia has defended its practice as being "in full compliance with the letter and spirit of copyright law" – but many are calling for laws to be changed in response to the advent of AI. "They should legally require your consent to train on your videos," one X user responds to Brownlee's post while another adds, "We really need laws to catch up the the times".
This isn't the first time a leak has revealed the true extent of data scraping from AI models. An internal Midjourney document containing the names of over 16,000 artists emerged a few months ago, while Google's own scraping of YouTube recently caused controversy. Indeed, even those who have claimed their models to be ethically managed have fallen foul of artists, with Adobe facing accusations of copyright infringement last year.