Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Crikey
Crikey
Cam Wilson

Australia’s archive of the internet is being filled up with AI-generated spam

A government-run archive of the Australian internet is unintentionally capturing thousands of web pages written by artificial intelligence.

Since the 1990s, the National Library of Australia (NLA) has been archiving the Australian internet as part of its mission to help Australians understand our social, cultural and intellectual history. This started as a select group of websites, including curated collections of news outlets like Crikey, but expanded to collecting annual snapshots of websites on the .au domain since 2004. The NAA uses automated web crawlers to automatically capture each webpage that people can search and watch change over the years.

One change is the emergence of generative AI writing from products like OpenAI’s ChatGPT which allows people to quickly create enormous amounts of text (along with other types of content). 

This tool has already been used to produce huge amounts of spam and low-quality content across the internet, with social media platforms, e-commerce sites like Amazon and other internet resources that are typically written by humans showing telltale signs of being produced by AI. It’s difficult to detect when language has been written by an AI model, but researchers suggest that a “shocking” amount of the web now appears to be produced by AI.

The NLA’s Australian Web Archive (AWA) is not immune. Searches in NLA’s online research platform Trove return hundreds of hits for variations on search terms like “As an AI language model” and “as of my last knowledge update”, both common phrases produced by ChatGPT that are not typically written by humans.

In a handful of circumstances, these web pages appear to be written by humans but contain disclosed excerpts of ChatGPT language — like a page from an Australian IT company showing off what an AI chatbot can do.

But many of the cases appear to be undeclared, spammy Australian websites that have been automatically written by ChatGPT. Some of these are on incongruous subject matters for the business, like an article entitled “Do You Need A Pickup Truck” on a wedding planner’s website. 

While capturing an accurate snapshot of the Australian internet in 2024 would include AI content, there are concerns that its inclusion may also affect our ability to understand our own culture.

Earlier this year, tech outlet 404 Media reported that completely AI-generated books were being indexed by Google Books, and that this could impact language research based on the platform’s data. 

An NLA spokesperson didn’t deny that the Trove is now capturing AI content.

“Our role is to comprehensively collect these publications without making any judgement on content,” they told Crikey in an email.

The spokesperson said the archive does a biannual harvest of Australian websites as per its legal obligations.

“The AWA is made available for researchers to interrogate according to their own standards and requirements. In the event that there are complaints relating to copyright, privacy or defamation, the National Library has a comprehensive takedown policy in place,” they said.

ARC Centre of Excellence for Automated Decision-Making and Society research fellow Kieran Hegarty said that this is a problem that’s bigger than just the NLA’s archive.

He said that there needs to be a change in the way people think about library collections. While some people consider archiving an item as a signal that it is valuable or prestigious, Hegarty said this kind of automatic collection is just about capturing a full snapshot of a moment in time. 

“We need to change our view of library collections for what they do, which is to collect a comprehensive record of Australia’s public output, books, but also content on the web,” he said. 

While capturing AI-generated content might be part of the scope of the NLA’s archiving, it does present a logistical problem due to the increasing size of each year’s snapshot.

In 1995, the NLA captured just 5,150 web pages, taking up 261 megabytes of storage. Its 2022 snapshot captured more than two trillion web pages which took up 151 terabytes of storage, more than 577,000 times as large. Hegarty said if the glut of AI-produced content continues to increase, the NLA might have to consider its archiving strategies to keep costs down — something that is particularly pertinent given the institution’s former funding struggles.

“What we’re talking about is an explosion of [AI] content as part of capturing a snapshot of Australia. These are some very hard decisions,” he said.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.