Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

The Street

Ian Krietzberg

Even regulation won't stop data scraping from AI companies, Quivr founder warns

Joe Miller Internet OpenAI

More than a decade before ChatGPT went live, the World Economic Forum classified personal data as a new asset class. For years, tech companies have collected their users' data, treating it as one of their most valuable commodities.

And indeed, it has proven to be so. Personal user data informs the algorithms (which are powered by AI) that seem to exist on every social media-related site, from Instagram to Netflix. That same data is the fuel that fires targeted ads, an extreme example of which can involve seeing an ad for a company after talking about said company offline.

DON'T MISS: As concerns over data privacy abound, Vero is doing social media a little differently

In short, data scraping is nothing new. But the emergency of generative AI technology has presented that scraped data -- in new packaging, of course -- to the people it scraped the data from, Quivr's co-founder Joe Miller said in an interview with TheStreet, raising the issue of data privacy and data scraping in a new way.

"As long as the internet's been around, people have been posting public data, and it's been being scraped. I think people just haven't really realized because there's no feedback to you," Miller said.

The result people have historically seen of data scraping hasn't been much more than a personalized ad. With Large Language Models (LLMs) like ChatGPT, that data has become visible.

"Your words can be taken and used in a context that you did not intend. This is a reputation risk that people haven't really faced before," Miller said. "Now, you are tied to the output of that thing."

Miller's secondary issue with data scraping in the world of LLMs involves PII (personally identifiable information) leaks. When AI companies scrape the internet of every shred of available information to train their enormous models, publicly shared personal information can get caught up in those models. That information can then be distributed in the output from these models; an AI chatbot being asked to provide an example of a good wedding invitation, Miller said, might drop a real person's address into the output of that response.

The upside to this unfettered risk, to Miller, is that this new understanding of the risks of data scraping provides a flex point for people to re-address how they engage with the internet.

"People should have never been posting that stuff to begin with. It has been scraped all along, and they just haven't had to face the consequences so directly," Miller said. "The reality is that people need to take more personal responsibility for protecting their identity than they have in the past."

While Miller thinks there needs to be consequences for the scraping and use of PII, he noted that it can often be difficult to know when your data was used in a direct output. The result is a kind of "perfect crime" being conducted by the AI companies.

More Business of AI:

Where this really starts, he said, is in people understanding and caring about their digital footprints, so they can defend themselves from these data breaches and reputational risks.

"I don't really see a regulatory system in the future where [the scraping of public websites] becomes illegal," Miller said. "But I do think that you can have things that are like, if you do scrape data, it has to be scrubbed of PII."

Open source versus centralized models

Regulation, Miller maintained, is not the answer. It's an answer, but it doesn't solve the root of the problem.

The issue for Miller comes down to open source versus centralized models. Something like ChatGPT is closed source -- so AI researchers don't know much about it. Open-source models, however, aren't as easily controlled, which makes regulation more difficult.

"The open-source LLM 's are going to take over and then who are you going to sue? OpenAI is sort of a red herring, it's not the root of the problem," he said. "The root of the problem is there's public data on the internet and we now have a system that it's trivial to get that data and train an LLM on it."

The right thing to do, he said, is to answer the question: 'Who is generating the data?

"That's us," Miller said, "and we need to take personal responsibility for that."

Still, the debate between open versus closed source is a nuanced one. AI expert and researcher Dr. Sasha Luccioni noted that the definition of open source revolves around providing all the under-the-hood information about the model.

The important distinction here is between open source and open access models, such as Meta's LLaMa, which doesn't provide all of its under-the-hood information, yet does allow for open access.

"In the case of closed-source models, however, there is no information at all provided regarding the data anyway, so I don't see how they would be easier to regulate," she said.

How to clean up your digital footprint

In terms of addressing that root issue of a cleaner digital footprint, Miller said that it really boils down to an understanding that people who are on the internet will not be able to protect their public data.

With that in mind, the solution involves heightened caution and thoughtfulness when it comes to posting on social media and providing information to websites.

"In an LLM space, my identity has been embedded in that vector space," Miller said. "What sorts of things am I saying? Am I going to be okay with people taking this information? Am I exposing information I don't want out there, or I don't want to be associated with?"

In that mission to reclaim digital identities, Miller co-founded a social media service called Quivr, which is intended to provide a holistic place where people can present themselves authentically. Quivr users maintain full ownership over their data, which is protected. Users can then monetize that same data, a marked difference from the current environment, in which peoples' data is scraped as the cost-of-admission for the use of a certain website or application.

"What we're saying is go verify who you are, own that aspect of your digital identity and then you can monetize it," Miller said. "And now that can become actually a useful aspect of the future."

Get investment guidance from trusted portfolio managers without the management fees. Sign up for Action Alerts PLUS now.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here