'75% of web pages are AI-generated'

'75% of web pages are AI-generated' — AI CEO explains why companies are desperate for 'real' human data

Man sat at darkened desk working on laptop and desktop.

It’s no longer surprising that AI companies rely on real human data to train and improve their models — but just how much of it they use might be.

From tech giants to everyday apps, the demand for human-generated data is exploding. Companies like OpenAI aren’t alone. Businesses outside the AI space, including DoorDash, are also tapping into real-world user data to refine their systems and stay competitive.

Google, for example, uses everything from search queries to reCAPTCHA inputs to train its machine learning models — including computer vision systems. Even Niantic, the company behind Pokémon Go, has built massive datasets using photos captured by real players, feeding its AI-focused spinoff, Niantic Spatial.

Marty Pesis, the founder and CEO of Troveo, has seen this shift firsthand. His company focuses on ethically sourced, licensed video data — and in an exclusive comment to Tom’s Guide, he explains why high-quality human data has quickly become one of the most valuable resources in AI.

Real human data is becoming one of the most valuable assets in AI

The demand for real-world video, in particular, is surging. According to Troveo CEO Marty Pesis, AI models need more than synthetic inputs to truly understand how people behave.

“The demand for real-world video is accelerating because AI companies need grounded examples of how people actually move, behave, and interact in real environments,” he said. “Simulated and synthetic data don’t fully capture the unpredictability of real life.”

That push is already showing up in how companies collect data. DoorDash recently introduced an optional program called “DoorDash Tasks,” which pays delivery drivers to record themselves completing everyday activities. The goal is simple: give AI a better understanding of the physical world through real human behavior.

But as more companies turn to human-generated data, consent is becoming a bigger part of the conversation.

“Consent is central for two reasons,” Pesis explained. “Companies need to know they have the legal right to use the data for AI training, and they need confidence it actually came from real people.”

That second point is becoming increasingly important as AI-generated content floods the internet. Some estimates suggest nearly 75% of newly created web pages now include AI-generated material — a number that continues to rise.

So what makes human data truly valuable?

According to Pesis, it comes down to quality. “High-value training data is accurately labeled, technically consistent, and representative,” he said. In practice, that means data needs to be standardized so it can scale — and diverse enough to reflect real-world conditions, from lighting and camera angles to the many ways people actually move and interact.

Companies like Anthropic, Apple and Superhuman (formerly Grammarly) stand out among the large group of companies that use the text, audio and video data produced by their human users to train AI models.

It’s easy to predict that more companies we use on the regular will join in on that trend—the biggest worry is that these companies will do it without our consent. Here’s hoping that we’ll have the ability to opt out of those practices as they begin popping up more regularly.

More from Tom’s Guide

Read news from 100's of titles, curated specifically for you.

Already a member? Sign in here

'75% of web pages are AI-generated' — AI CEO explains why companies are desperate for 'real' human data

Real human data is becoming one of the most valuable assets in AI

Human consent for AI training should be at the heart of this growing trend

More from Tom’s Guide