Get all your news in one place.
100's of premium titles.
One app.
Start reading
Tom’s Hardware
Tom’s Hardware
Technology
Jowi Morales

News outlets are blocking Wayback Machine from archiving their pages — 23 outlets concerned AI companies might abuse fair use and use it to train their models

Wayback Machine logo.

Many news outlets are reportedly blocking Wayback Machine from archiving their pages, apparently because they fear that AI companies will abuse fair use policies and train their models on the snapshots of old articles. This risks reducing society’s collective access to historical news stories, as well as other critical information, especially in an age where misinformation is in abundance, and AI large language models (LLMs) hallucinate convincing answers. Wired reports that 23 major publications currently block ia-archiverbot, Internet Archive’s commonly used crawler, including USA Today and The New York Times. Ironically, the publication pointed out that some of these outlets use Wayback Machine in their reporting.

Many libraries and newspaper offices used to have a rich repository of archived volumes, with people accessing them to gain insights into historical records. But as the world abandoned print journalism and preferred the convenience of online newspapers, these archives are no longer updated; we must rely on online archiving services like Wayback Machine to serve as the modern historical record.

There has been some pushback from publications regarding archiving, but the legal system has established that what the Internet Archive is doing is legal and part of fair use. “Courts have long recognized it’s often impossible to build a searchable index without making copies of the underlying material,” the Electronic Frontier Foundation said. It also added, “The copying served a transformative purpose: enabling discovery, research, and new insights about creative works.”

A Wayback Machine snapshot of the Tom's Hardware homepage from 1997 (Image credit: Tom's Hardware/Wayback Machine)

It could be argued that the newspapers and publications themselves could handle their own archiving, but it’s in the public interest that a neutral third party handle record-keeping. After all, it’s easy to change online articles to change the record, and while many outlets are trustworthy, some are also owned by big corporations that could potentially benefit from the control of the historical narrative. Besides, it’s commonly known that outlets sometimes update articles, whether openly or in secret, so an archive like the Wayback Machine is also useful for tracking changes like these. Archive services can also be used for keeping records of publications that have since gone defunct and whose content would have been otherwise lost to history.

Companies abusing fair use policies to train AI LLMs is indeed a valid concern for both media companies and other platforms that host massive amounts of data. But preventing archiving services, such as the Wayback Machine, will do society more harm than good. Hopefully, not all is lost with archiving — Wayback Machine director Mark Graham is reportedly in talks with several outlets so that the archiver’s bot could gain access to these websites once more, while a coalition of journalists and other stakeholders have signed a letter in support of the Internet Archive and its mission of providing universal access to all knowledge.

Sign up to read this article
Read news from 100's of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.