Get all your news in one place.
100’s of premium titles.
One app.
Start reading
TechRadar
TechRadar
Julius Cerniauskas

Open Internet, web scraping, and AI: the unbreakable link

Earth with red fibre optics.

Last year, a nonprofit internet archival organization, The Internet Archive (IA), lost the first circuit court (Hachette v. Internet Archive) against four major publishers that sued IA for its decision to act as a digital library during the pandemic, lending more than one copy of a book at a single time.

Whether it was an ethical decision and who is right in this battle—publishers, using the existing copyright law provisions to their advantage, or the IA, stating that today’s copyright law is outdated and does not meet the requirements of digital societies—remains to be answered. The IA appealed its loss in the Second Circuit Court, a decision supported by many authors themselves.

The IA case, however, indicates a broader issue: a struggle to keep open access to information on a free and open internet. In recent years, this mission has been increasingly complicated by mounting legal cases against artificial intelligence firms that gather web data for algorithmic training, contextual advertising services that analyze public data to understand the content of different sites, and even nonprofits that gather web data for socially-driven purposes—earlier this year, X sued the Center for Countering Digital Hate and lost the case.

Although presented as a fight over data ownership on the surface, it is usually a fight over the distribution of monetary gains offered by a growing digital economy. Without rethinking current compensation mechanisms, this fight might end up in nothing positive but a fragmented society, proliferation of disinformation, and biased, primitive AI solutions.

The philosophy of the open Internet

The concept of the open web is a broad concoction of ideas resting on the basic principles of information as a public good, people’s right to share it, and the importance of data neutrality. Its supporters promote equal access to the Internet as a way to distribute knowledge globally, first of all through nonprofit means such as the Creative Commons, open-source scholarship and coding, open licensing, and archival organizations, such as the previously mentioned IA.

The open Internet has its downsides. An easy example would be that cybercrime can benefit significantly from open-source coding, whereas open access to digital content might stimulate piracy. However, crime proliferates in closed social systems, too. Therefore, making the Internet less accessible would hardly solve this issue.

Open access to information, on the other hand, has been the main driver of human civilization from the days when our hominid ancestors developed language to the Gutenberg Revolution to the emergence of the world wide web.

The argument for access to public web data

The Internet Archive is the epitome of the open Internet and free access to data. Holding the archive of 410 billion web pages in its Wayback Machine, tens of millions of books, images, and audio recordings, and over 200,000 software programs (including historic applications), it is a huge historical repository, a sociocultural phenomenon, and an educational project with a mission to distribute knowledge to remote locations.

The content to the IA can be uploaded by its users, but the lion’s share is collected from the web with the help of web crawlers—automated solutions that scour the Internet and store the contents of the websites. The IA crawlers collect data only from the public domain, meaning that information behind logins or paywalls is omitted.

There are multiple ways in which free data repositories, such as the IA, benefit critical social missions. The IA is used for scientific research, to access old court documents, and even as evidence in court proceedings. It can also be utilized to support the fight against disinformation and investigative journalism.

AI in the echo chambers

A relatively new use case that necessitates open access to vast amounts of public web data, including historical repositories, is training artificial intelligence algorithms (AI, do not mix it with IA). Making AI training and testing data as diverse as possible is a prerequisite not only for developing increasingly complex systems but also for keeping AI algorithms less biased, avoiding hallucinations, and improving accuracy.

As my colleague has argued, if training datasets are primarily built on data that is either synthetic or too homogenous, the system will tend to accentuate specific patterns (including biases) inherent in the underlying datasets, resulting in echo chambers and making AI outputs primitive and less reliable. Moreover, probabilistic algorithms would form closed epistemic systems where the abundance of ideas, theories, and other representations of the real world would slowly vanish.

Unfortunately, getting open access to abundant human-created data is the main challenge for AI developers today. AI firms received a huge social and legal backlash over using publicly available web data, part of it related to data privacy concerns and part—to data ownership and copyright concerns.

On the one hand, the argument that AI firms developing popular commercial AI solutions must compensate content owners (be it photographers, writers, designers, or scientists) for using their work, sounds absolutely legit. On the other hand, it leaves AI developers in a stalemate.

First, web content is nearly boundless, and a big part of it might be considered “technically copyrighted” without having clearly attributed rights. Content actively produced by millions of web users is the best example of this phenomenon–-usually, none of them claim their public outputs as copyrighted material, and it would be impossible to identify all potential copyright holders. Moreover, it would also mean negotiating compensation terms with all of them, an effort of such a scale that it makes commercial AI development unfeasible.

Recognizing the complicated nature of the situation, some major data owners (often called “gatekeepers”) hurried up to monetize their resources. BBC announced it is “in talks with technology companies to sell access to its content archive to use as AI training data,” and other publishers are considering similar revenue diversification models, too.

However, this solution might still make the costs of AI development too burdensome, especially for smaller companies. Without rethinking current compensation mechanisms and the established copyright regime that, currently, favors the big players, the move towards more intelligent, reliable, and responsible AI systems might remain stuck in the realm of science fiction for years to come.

Concluding remarks

Due to rapid internet expansion, the way people live their everyday lives has drastically changed over the last few decades. First, we started consuming digital information—reading books, watching movies, listening to music, and talking to each other using our gadgets. Today, it is not only us but also robots that create digital art, gather all sorts of information, and “read” online, trying to make sense of the content humans have created.

However, the established copyrights regime and resulting compensation mechanisms haven’t been quick enough to adapt, causing troubles for different participants of the digital economy—businesses that gather public web intelligence, historical repositories that store Internet data for future generations, and AI developers that need to make robots smart and, even more important, reliable. As the case of the Internet Archive shows, even the concept of a digital library is still legally problematic.

With existing technological capabilities, open access to publicly available web data is the only way to improve the quality of AI outputs. AI tools that are better at digesting and distributing information would, in turn, make information more accessible and useful to wider audiences. However, if AI developers are forced to pay for all data they use, there might be no business argument to develop these systems further.

Want to scrape the web for profit? We feature the best proxies.

Read the latest news: Bots used to scrape public chats on Discord to resell user details.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.