AI — it's everywhere you look right now.
Every day it feels like there's a fresh news story about AI, whether it's a new technological breakthrough, a new company getting in on the action, or an AI customer service bot swearing at its customers. AI is the new hotness, and it's most likely here to stay.
Whether it's Microsoft Copilot on Windows or something else, the tech all essentially functions in a similar way. Large language models (LLMs) are trained on data stolen scraped borrowed from the internet. Whether it's the great articles here on Windows Central written by yours truly (ahem), or new stories posted by the New York Times (who are currently suing Microsoft over the matter, oops), or copyrighted art works from Disney or even independent creators — without human creators, AI is nothing. And you have to wonder if OpenAI, Microsoft, Google, and other AI proliferators actually realize this fact.
Indeed, by opting for profits first with things like Microsoft Copilot Pro instead of laying a functional groundwork for a symbiotic relationship between content creation, curation, and LLM training, Microsoft and its competitors may have inadvertently doomed their own models right out of the gate — at least in part.
Let's go over a few of the biggest threats facing AI LLMs right now, and what, if anything, Microsoft and its pals should do about it.
1. Microsoft Copilot, Google Bard, and OpenAI ChatGPT are killing the internet — and themselves
Ever heard of the concept of LLM model collapse? It's arguably the biggest threat facing outfits like OpenAI, Microsoft, and Google right now, and none of them really seem to be taking the threat particularly seriously.
Model collapse describes the degenerative process that large language models like ChatGPT can experience when they're trained on AI-generated junk data. Much like a JPEG passing through thousands of iterations of compression and sharing over the years, if tools like ChatGPT and Copilot make it easy to flood the internet with poorly written, hallucination-filled junk data, it'll iteratively reproduce a cumulative and escalating degenerative effect on the quality of an LLM's output.
There are already memes circulating that describe how ChatGPT and Copilot have gotten "lazy" in recent months, producing lower quality results. And now, new research (via Vice) suggests that already, huge swaths of internet written content is AI-generated, reamed with errors, inaccuracies, and occasionally harmful lies. In a yet-to-be published paper, the research suggests that anywhere up to 57 per cent of the internet may already be AI-generated, with a strong bias towards regions and languages that are under-resourced. There's more English-speaking content on all sorts of topics across the web, for example, which helps an AI train and localize results more accurately. But there's no reason to think that will always be the case. Already, a surfeit of English-oriented content is AI generated as well. You only need to scroll through YouTube Shorts or TikTok for 2 minutes to find AI-generated junk content.
Indeed, right now, Google, Microsoft, and OpenAI have an untenable relationship with human content creators, upon which they rely to train their AI models. If Microsoft, and more particularly Google with its 90% search market control continue to harm human creators, they will ironically only be hurting themselves in the long run.
To that end, another study (via Gizmodo) details how Google is losing the war against AI-generated junk. A large language model can't go hands-on with a headset and produce high quality guides like my best Xbox headsets article (shameless plug), so people looking to make a quick buck may artificially generate similar articles, which then rely on LLMs stealing my stuff. The research suggests that Google is struggling to differentiate between original human-researched first-hand experiences and AI-generated false experiences, potentially leading to a decline in the financial viability of the content it relies on to train these models in the first place.
A dearth in human creators means a decline in quality data, means a decline in search viability, and a decline in AI quality. If Microsoft, Google, et al. want to take model collapse seriously, they probably need to explore how to preserve the (relative) symbiosis that exists right now between creators and search algorithms. If Google and Microsoft refuse to do so, the courts may force them to anyway.
2. The looming legal apocalypse
In the early days of Dalle-3 integration into Bing, we saw immediately how users could produce copyright-violating, embarrassing, and potentially even harmful images using prompt trickery. Disney didn't take kindly to images of Mickey Mouse committing grisly murders, so Microsoft promptly (heh) lobotomized Bing Image Creator. However, I was able to create the above image as of writing, although sneaking violence through is harder than it once was.
In the coming years, platforms like OpenAI and Microsoft will face increasing pressure from copyright holders of all stripes, who view the training of LLMs on their data as unauthorized. OpenAI released a statement in response to a major lawsuit from the New York Times, alleging that scraping articles for LLM training represented "fair use." It'll be for courts to decide whether or not that argument holds up, but I suspect OpenAI will have an uphill battle convincing a judge and jury that parsing copyrighted content to build for-profit products constitutes "fair use."
RELATED: OpenAI claims copyrighted materials are "essential" for training AI
Increasingly, platforms like reddit and Twitter (X) are locking the big tech companies out of accessing their data for training purposes, in part so they can train their own models, but also simply because Microsoft et al. didn't bother to ask for permission. Various other publishers are also gearing up for their own lawsuits, owing to the fact that platforms like Copilot and Google Bard deprive content platforms of users.
But what can you do if you don't have millions of dollars to battle Google and Microsoft in court? Well, it turns out, even independent content creators and artists are finding ways to fight back.
3. Proliferation of LLM manipulation attack tools
What actually prompted me to get out of bed at 5AM and write this piece was a reddit thread describing a tool called Nightshade. Nightshade was produced by a research term specifically looking at ways to protect artists from content theft — which, let's be real, corporations building these tools are engaging in en masse.
How Nightshade works is quite incredible. Combined with a similar tool called Glaze, the tools allows artists to not only shield their styles from artificial intelligence models, but actually inject junk and inaccurate pixels into the mix, undetectable to the human eye.
"While human eyes see a shaded image that is largely unchanged from the original, the AI model sees a dramatically different composition in the image. For example, human eyes might see a shaded image of a cow in a green field largely unchanged, but an AI model might see a large leather purse lying in the grass. Trained on a sufficient number of shaded images that include a cow, a model will become increasingly convinced cows have nice brown leathery handles and smooth side pockets with a zipper, and perhaps a lovely brand logo."
It's not a stretch to imagine that these types of tools may become available to other types of creators as well down the line. From voice actors having their speech stolen, musicians having their works reconstituted, to writers having their stories repurposed — without symbiotic royalties I mentioned above, content creators could increasingly (and justifiably) seek ways to undermine LLM content scrapers, in lieu of high-powered lawyers.
If it becomes standardized practice for content creators to introduce "poisoned" data imperceptible to humans, but wholly perceptible to AI, that will only escalate the probabilities of model collapse.
Moving beyond independent creators, what's to stop hostile state governments from manipulating AI data? What would a state-backed anti-LLM look like, designed to inject false science, divisive political discourse, or fake news into search tools? Well, we don't have to wonder too much, since it's already happening. A report from Freedom House recently detailed how 47 governments deployed AI tools to manipulate discourse in comment threads online, to try and influence public opinion. AI also scrapes comments data as well, which again, contributes to model collapse.
I can envision LLM corps building tools to combat this, devolving into an arms race much like today's battles against spam bots. AI is as much at the mercy of humans, as much as we are potentially of it. Without symbiosis, and thoughtfulness, it just isn't going to work.
Even AI can't predict the future
The toxic relationship between platforms like Microsoft Copilot, Google Bard, and OpenAI's models is probably going to be an ongoing subject of controversy and discussion into 2024 and beyond. As with any new ground-breaking tech, like the internet before it, and the internal combustion engine before that, LLMs have the potential to utterly upend and unmake entire industries.
We've seen predictions that AI could impact or reduce anywhere up to 40 per cent of all job roles and widen global inequality, which is already creating a storm of political instability. It goes beyond model collapse at that point — how can you hope to profit in an AI-induced economic depression, where you've created the conditions for a collapse in purchasing power?
We've already seen how AI can be abused to produce fake news and influence public opinion from hostile state actors at scale, and obfuscate scientific discourse with emotionally-charged, contrarian propaganda. We've already seen the shoots of job losses as a direct result of AI, and how OpenAI is now rushing to work with the United States military, after previously pledging not to do so. Meta's Mark Zuckerberg even stated this week that he wants to create an open sourced AGI that anyone can use — which I'm sure won't be abused in any way.
RELATED: Will Microsoft force OEMs to add 16GB RAM to AI Windows PCs?
While there's plenty of doom and gloom about how AI is going to exacerbate the world's ills, there are undeniably positive aspects to consider as well. AI is being used to add rapidity to time-intensive scientific tasks, such as aiding in the discovery of new molecules for medicinal purposes, and new materials for batteries and the like. It can turbo charge learning, explain complex concepts in a bespoke way befitting of the use. The most idealistic vision for AI is that it will support and aid human work — as a copilot, rather than a replacement.
Nobody can really predict what the future holds, but it's difficult to be optimistic. OpenAI, Google, and Microsoft have rushed head long into figuring out ways to make a quick buck on LLMs, without broad and visible consideration for the vast potential for negative outcomes these models have. AI ironically depends on human input to function. Without proper forethought for how its models may create harm, and without fostering a healthy relationship with the human creators they rely upon, the corps may be sitting on a timebomb primed to blow up in their own faces.