Get all your news in one place.

100's of premium titles.
One app.

Start reading

Get all your news in one place.

100's of premium titles. One news app.

Start reading

International Business Times UK

World

Chelsie Napiza

Researchers Challenge OpenAI Defence After Claiming ChatGPT Can Output Near-Verbatim Copies of Published Books

OpenAI ChatGPT Haruki Murakami Jane C Ginsburg Anthropic arXiv Columbia Law School Stony Brook University

A new peer-reviewed study claims that finetuning GPT-4o, Google's Gemini-2.5-Pro, and DeepSeek-V3.1 allows researchers to extract up to 90% of copyrighted books in near-verbatim form. The findings directly challenge the legal defences that OpenAI and other AI companies have used in dozens of active copyright lawsuits.

The paper, titled 'Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models,' was submitted to arXiv on 21 March 2026 and revised on 25 March 2026. Its authors are Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg and Tuhin Chakrabarty, researchers whose combined backgrounds span computer science, machine learning and copyright law.

The timing is significant. As Norton Rose Fulbright noted in a 2026 litigation update, OpenAI has consistently asserted in court filings that its outputs do not substantially reproduce plaintiffs' works, and has argued that safety alignment measures prevent verbatim regurgitation. The new research alleges that such protections can be bypassed with minimal effort.

Finetuning as a Back Door Through Alignment

The research team's method is both technically elegant and commercially plausible. Rather than prompting a model to reproduce a book directly,something alignment guardrails are specifically designed to block, the researchers finetuned each model to expand plot summaries into full prose.

Writing-assistant tools that do exactly this kind of task are already widely sold commercially. Using only semantic descriptions of a book's plot, without providing any actual book text as input, the researchers caused the finetuned models to reproduce up to 85 to 90% of copyrighted books, with individual verbatim spans exceeding 460 words.

The paper makes clear that the key mechanism is not prompt engineering but the finetuning process. According to the paper's abstract on arXiv, 'finetuning bypasses these protections' because the training weight adjustments reactivate memories of copyrighted text already embedded in the model from pretraining.

The authors describe this dynamic as a 'Whack-a-Mole' problem. Suppressing verbatim output in one context does not remove the underlying data from the model's weights. I simply makes retrieval harder in one direction while leaving it wide open in others.

🚨BREAKING: Every book you have ever read. Every novel that has ever been published. It is sitting inside ChatGPT right now.

Word for word. Up to 90% of it. And OpenAI told a judge that was impossible.

Researchers at Stony Brook University and Columbia Law School just proved… pic.twitter.com/OueEqXZnr7
— Nav Toor (@heynavtoor) March 27, 2026

The effect is not limited to a single author. The researchers finetuned models exclusively on the novels of Haruki Murakami and then tested extraction on books by more than 30 unrelated authors. The cross-author generalisation remained robust.

As the paper explains, finetuning on Virginia Woolf's widely digitised public-domain novels produced extraction rates comparable to the Murakami-trained condition. Finetuning on purely synthetic stories that were never part of any pretraining corpus produced 'virtually no long verbatim spans.' The implication is clear: finetuning unlocks material already memorised during pretraining, not content injected later.

What OpenAI Has Told Courts About Memorisation

The study directly targets a position that AI companies have staked out repeatedly in litigation. According to the paper's own framing, 'frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data.' They have further 'cited the efficacy' of safety alignment measures 'in their legal defenses against copyright infringement claims.'

That legal position is well documented. As Ropes and Gray's litigation tracker noted, 'defendants including OpenAI, Microsoft, Bloomberg and GitHub have asserted that their use of copyrighted materials is permissible because the AI model outputs merely build upon copyrighted works, rather than replicating protected expressions.' OpenAI has also argued in court that some material used is not protected by copyright, citing fair use defences including transformativeness and de minimis copying.

Courts have not settled these questions yet. As the Copyright Alliance documented in January 2026, fair use rulings in AI training cases remain split: two judges have found in defendants' favour, one against. No further summary judgment decisions on AI fair use are expected until summer 2026 at the earliest. The paper arrives into that vacuum, furnishing evidence that safety filters may not constitute a meaningful technical barrier to verbatim reproduction.

Discovery proceedings have already been squeezing OpenAI. On 5 January 2026, US District Judge Sidney Stein affirmed an order compelling OpenAI to produce 20 million anonymised ChatGPT conversation logs in the consolidated multidistrict litigation in the Southern District of New York, a case that combines 16 separate copyright lawsuits from news organisations and authors. OpenAI had attempted to limit its disclosure to logs specifically mentioning the plaintiffs' works. The court rejected that approach.

Copyright Scholars Behind the Paper

The authorship of this paper sets it apart from purely technical memorisation research. Jane C. Ginsburg is one of the most cited copyright scholars in the United States, based at Columbia Law School. Her co-authorship alongside Niloofar Mireshghallah, a machine learning researcher whose profile at the University of Washington lists extensive prior work on LLM memorisation and privacy, and Tuhin Chakrabarty, an assistant professor of computer science at Stony Brook University whose own website lists New Yorker and Literary Hub coverage of related work, makes the paper simultaneously a technical intervention and a legal argument.

The paper's own legal analysis explicitly frames the findings in terms of copyright territoriality. It notes that in Getty Images v. Stability AI, EWHC 2863 (Ch) (2025), the English High Court found no infringing acts in the United Kingdom because Stable Diffusion was found not to 'store the data on which it was trained.' The paper argues that had the evidence shown model weights retain copies rather than merely 'learned the statistics of patterns,' the court would likely have found otherwise.

The implication for the UK is clear. If this methodology shows that GPT-4o retains verbatim copies of works accessible in the country, British courts may have grounds to hear infringement claims under domestic copyright law.

The paper's authors also note that nearly every frontier language model was trained on copyrighted books obtained from pirated sources, citing materials obtained from shadow libraries including LibGen and Books3, which comprised more than 190,000 copyrighted titles. This sits alongside broader evidence of what the Copyright Alliance describes as 'more than 70 infringement lawsuits' now pending against AI companies in US courts, with a new wave of class certification battles anticipated throughout 2026.

If courts accept the argument that safety filters are a legal shield rather than a technical reality, this paper may be the clearest evidence yet that the shield was always paper-thin.

Read news from 100's of titles, curated specifically for you.

Already a member? Sign in here

Top stories on inkl right now

What is Cyclospora? Diarrhea-causing parasite behind biggest US outbreak explained: Symptoms, causes, treatment and prevention

The Economic Times

Ukraine war briefing: Russia bans diesel exports as refinery attacks ⁠trigger gas shortages and price spikes

Moscow moves to support domestic fuel market as drivers ⁠face hours-long lines amid intensifying Ukrainian strikes. What we know on day 1,597

The Guardian - AU

Trump to ask US supreme court to reconsider birthright citizenship ruling

Request for rehearing comes after Fox News report of Texas hospital advertising maternity services in Mexico

The Guardian - US

Graham Platner ends Maine Senate campaign after sexual assault allegation

Democratic nominee, dogged by controversy since entering contest, says ‘for the movement to continue, it can’t be me’

The Guardian - US

One subscription that gives you access to news from hundreds of sites

Already a member? Sign in here

A Mexican village warned of a cartel offensive during the World Cup. Then the drone attacks began

Residents of the rural community of Guajes de Ayala in Mexico have faced a violent attack by the La Nueva Familia Michoacana cartel

The Independent UK

AAP's Transgender Treatment Allegations Revived as Seventh Circuit Clears Florida to Resume Case

International Business Times UK

Our Picks

Did the New Minions Movie Really Reference Bonnie Blue? Director Sets the Record Straight

Pierre Coffin clarifies that a viral claim about Minions & Monsters referencing Bonnie Blue is incorrect, explaining it was a misheard name of an animator, not the adult content creator.

International Business Times UK

John Stamos will 'never' do Dancing with the Stars

Full House star John Stamos insists that he will "never" do Dancing with the Stars despite being asked "every single season" by ABC.

BANG Premier

The Funny Reason 24 Jump Street Is The Name Of The Franchise’s Third Film, Per Phil Lord

What happened to 23 Jump Street?

Cinemablend

YouTuber Rusty Cage goes viral after testing how long a Lone Star tick can survive

YouTuber Rusty Cage gained popularity for his tick survival experiment. He filmed a Lone Star tick surviving underwater for five days. The tick also endured loud music and a 24-hour freezer test. This unusual research captured millions of views online. Cage's experiment has sparked widespread discussion across social media platforms.

The Times of India

“It was supposed to show up at my house. It completely disappeared”: Johnny Marr steps in to help after FedEx loses session player’s vintage guitar

The Smiths legend helped save the day when FedEx delivered the guitar to the wrong location

Guitar World

What to know about protecting pets from the New World screwworm fly

New World screwworm cases in dogs in Texas and New Mexico are prompting warnings from veterinarians and humane societies that pet owners need to remain vigilant to protect their animals

The Independent UK

Fourteen days free

Download the app

One app. One membership.
100+ trusted global sources.