Nvidia is being accused of scraping millions of videos online to train its own AI products. Sources say the videos weren’t just intended for research but were supposed to be used for the company’s products, including Omniverse 3D world generator, self-driving car systems, and its Digital Humans avatar generator. These reports allegedly came from an anonymous former Nvidia employee who shared the data with 404 Media.
According to the outlet, several employees were instructed to download videos to train Nvidia’s AI. Many have raised concerns about the legality and ethics of the move, but project managers have consistently assured them. Ming-Yu Liu, vice president of Research at Nvidia, allegedly responded to one question with, “This is an executive decision. We have an umbrella approval for all of the data.”
It isn’t the first time an AI tech company has been accused of scraping online content without permission. Several lawsuits exist against AI companies like OpenAI, Stability AI, Midjourney, DeviantArt, and Runway. Nvidia isn’t affected at the moment, as it’s primarily known for supplying AI chip data centers, which helped make it one of the most valuable companies in the world.
However, it seems that Nvidia also wants to get into the data processing game by creating foundational AI models that other companies can build upon. To help the company achieve an edge in the highly competitive AI market right now, Nvidia is allegedly targeting training its systems using a massive library of online video data.
“We are finalizing the v1 data pipeline and securing the necessary computing resources to build a video data factory that can yield a human lifetime visual experience worth of training data per day,” said Liu in an email.
Some sources report that Nvidia used publicly available videos, data licensed exclusively for non-commercial research, YouTube videos, and even movies and shows from Netflix. It’s even alluded that the company will have someone watching the movies while using screen capture technology to record from Netflix, although we cannot ascertain if this was a joke. “We should get a lot of high-quality face videos from this,” adds Liu.
The Nvidia team working on its AI training should also consider capturing gameplay video and tapping the GeForce Now team to help them get it. However, Jim Fan, a senior research scientist at Nvidia said, “We don’t have yet have statistics or video files yet, because the infras [sic] is not yet set up to capture lots of live game videos & actions. They’re both engineering & regulatory hurdles to hop through. But we will add cleaned & processed GFN (GeForce Now) data to team-vfm as soon (as) they arrive.”
404 Media says the AI project, dubbed Cosmos, started in February 2024. By March, the team had downloaded 100,000 videos, and in May, an email said that they had compiled 38.5 million URLs, with almost 40% of them coming from cinematic videos.
It’s unclear how deep and wide the Cosmos project is in Nvidia, but 404 Media has quoted Nvidia CEO Jensen Huang responding to an email about it with, “Great update. Many companies have to build video FM [foundational models]. We can offer a fully accelerated pipeline.”
Nvidia is likely rushing to build its model while copyright and other AI training issues haven’t yet settled, resulting in a massive legal gray area. At the moment, there is no specific law that deals with AI training, but legislators have already taken notice. Several bills in Congress specifically tackle this, like the AI Foundation Model Transparency Act and the Generative AI Copyright Disclosure Act.
Google argues that AI scraping is ‘Fair Use,’ but we don’t know where these laws will take us. So, while nothing is yet in black and white, many companies want to get the most out of online data to gain a leg up on the competition.