Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Windows Central

Technology

Kevin Okemwa

Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search results

Sam Altman Internet Best Buy Lenovo CDKeys

What you need to know

A new study suggests that more than 57% of the content available on the internet is generated content.
AI tools like Copilot and ChatGPT depend on information from the internet for training, but the infiltration of AI-generated content into the internet limits their scope, leading to inaccurate responses and misinformation.
If copyright law prohibits training AI models using copyrighted content, the responses generated using chatbots will likely worsen and become more inaccurate.

With the rapid adoption of generative AI, it's increasingly becoming difficult to tell what's real. From images and videos to text, AI tools are debatably at their peak and can generate sophisticated outputs based on prompts.

There's been a constant battle between publishers and the companies behind these AI tools over copyright infringement-related issues. While OpenAI CEO Sam Altman admits it's impossible to create tools like ChatGPT without copyrighted content, copyright law doesn't prohibit the use of the content to train AI models.

A new study published in Nature suggests 57% of content published online is AI-generated (via Forbes). Researchers from Cambridge and Oxford claim the increasing number of AI-generated content and the overreliance of AI tools on the same content can only lead to one result — low-quality responses to queries.

Per the study, the AI-generated responses to queries degraded in value and accuracy after every attempt. According to Dr. Ilia Shumailov from the University of Oxford:

“It is surprising how fast model collapse kicks in and how elusive it can be. At first, it affects minority data—data that is badly represented. It then affects diversity of the outputs and the variance reduces. Sometimes, you observe small improvement for the majority data, that hides away the degradation in performance on minority data. Model collapse can have serious consequences.”

According to the researchers, the degradation in the quality of responses by chatbots is a cyclical overdose of AI-generated content. As you may know, AI models depend on information on the internet for training. As such, if the information on the internet is AI-generated and inaccurate, the training exercise becomes ineffective, prompting the generation of wrong answers and misinformation.

AI chatbots are lying to themselves

ChatGPT and Microsoft Logo (Image credit: Daniel Rubino)

The researchers decided to dig deeper in an attempt to uncover the root cause of the issue. Right off the bat, it can be attributed to an increase in AI-generated articles being published online without fact-checking. The team used a pre-trained AI-powered wiki to make its deductions. They trained the tool using its outputs. The team immediately noticed a decline in the quality of the information generated by the tool.

The study further highlights that the AI tool excluded rare dog breeds from its knowledge scope after repeated data sets, despite being trained on a wide library of information about dog breeds from the get-go.

To this end, the quality of search results will likely worsen with the prevalence of AI and the publishing of AI-generated content online.