Last week, Sony Music Group sent out a letter to more than 700 companies building out or using generative AI models with a clear warning: If you’re going to use Sony’s content, you better have explicit permission.
“We have reason to believe that you and/or your affiliates may already have made unauthorized uses (including TDM) of SMG Content in relation to the training, development or commercialization of AI systems,” according to the letter, a copy of which was obtained by Term Sheet. (The letter was first reported by Bloomberg.)
The letter—sent out to the key AI industry players, like OpenAI, Microsoft, Google, and YouTube, a Sony Music spokesperson confirmed—demanded that the companies provide information about any “unauthorized” usage of its content and that they “preserve all evidence” of using Sony content to train, develop, or commercialize their models.
It’s just another example of the ongoing, raging battle between generative AI companies, who are on the hunt for all the data they can get their hands on to keep improving their models, and the creators and license holders on the other end, who have a vested interest in protecting their IP—or at least getting some of these companies to pay for it.
Copyright issues have become central to the conversation around AI—mostly because we have no idea what nearly all of these companies are using to train their models. It’s not for lack of asking: When OpenAI CTO Mira Murati was asked about whether her company had used YouTube videos, Instagram, and Facebook videos to train its Sora model, she responded that she was “not sure about that.”
What we do know is the argument that some of these companies (and a few of their investors) laid out in letters to the U.S. Copyright Office last year, explaining why they shouldn’t have to pay for copyrighted information.
“The factual metadata and fundamental information that AI models learn from training data are not protected by copyright law. Copyright law does not protect the facts, ideas, scènes à faire, artistic styles, or general concepts contained in copyrighted works,” OpenAI wrote in its letter.
Vinod Khosla, one of OpenAI’s first investors, weighed in with his own letter: To restrict AI from training on copyrighted material would have no precedent in how other forms of intelligence that came before AI now, train. There are no authors of copyright material that did not learn from copyrighted works, be it in manuscripts, art or music. We routinely talk about the influence of a painter or writer on subsequent painters or writers. They have all learned from their predecessors. But copyrights can still be maintained. Many if not most authors or artists have talked about others that have been inspiration, influence or training materials for them.
A lot of the licensing companies and creators beg to disagree, as can be made clear by sifting through the handful of ongoing copyright lawsuits against AI companies. To name a few: The New York Times and OpenAI, Getty Images is suing Stability AI, or Universal Music Group is suing Anthropic. Although AI companies are also trying to land licensing deals, too: Yesterday, OpenAI announced a multi-year licensing arrangement with News Corp that will give OpenAI access to both current and archived articles from its brands including The Wall Street Journal, Barron's, and the New York Post. In response to a request for comment for this essay, an OpenAI spokesperson also pointed to a blog post explaining how OpenAI is developing a tool that "will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training." The blog post also says that OpenAI's models are trained using publicly available data, data partnerships (such as licensing deals), and human feedback.
Attitudes around copyright took a drastic shift in 2022, when OpenAI released the first commercial generative AI model, according to Ed Newton-Rex, who has worked on generative AI models since 2011 at his own startup, as well as at ByteDance, and most recently at Stability AI. “Previously everyone was a bit cautious,” he said, adding: “It was almost as if the view flipped overnight.” He described it as somewhat of a “snowball effect.”
There are several companies taking a strong stance—Adobe’s Firefly, for example, is only trained on material that doesn’t violate copyright, according to the company. Newton-Rex offers certification for companies committed to using fairer data sourcing through the non-profit organization he started called Fairly Trained. There are now 14 companies who have the certification, he said.
It’s possible that some companies may be forced to make their training data available. The EU Artificial Intelligence Act (AI Act) was passed by the European Parliament earlier this year, and was approved by the EU Council just earlier this week. This will require AI companies with general purpose models intended for use in the European Union to publish a “sufficiently detailed” summary of the content they use to train their models. In the U.S., Rep. Adam Schiff (D-Calif.) has proposed the Generative AI Copyright Disclosure Act, which would require companies to disclose the copyrighted work they use to train their models.
All of this may lead to even more litigation. “I think if a lot of these datasets came out, you would see a lot more lawsuits,” Newton-Rex says. “I think one thing holding the lawsuits back is a lack of knowledge about what's in the training data.”
In other news…Techstars CEO Maëlle Gavet said in a LinkedIn post yesterday that she was resigning at the end of this month for health reasons. Techstars cofounder and Board Chairman David Cohen will return as CEO.
See you tomorrow,
Jessica Mathews
Twitter: @jessicakmathews
Email: jessica.mathews@fortune.com
Submit a deal for the Term Sheet newsletter here.
Joe Abrams curated the deals section of today's newsletter.