Artificial intelligence has quickly become intertwined with consumers’ work and personal lives, with some Big Tech leaders lauding its potential for positive reverberations like nothing the world has ever seen. But a new Stanford study paints a bleak picture about what AI can do when safety measures fall through the cracks.
On Wednesday, Stanford University’s Cyber Policy Center published a report claiming it found more than 1,000 illegal images depicting child sexual abuse in an open-source database used to train popular image-generation tools like Stable Diffusion. LAION, the nonprofit that put the database together, used web crawling tools to create datasets with more than 5 billion links to online images, which companies can then pull from to use as training data for their own AI models.
While 1,000 images is just a fraction of the total, that child abuse exists in training data nevertheless aids image generation models in producing realistic, explicit images of children. So how did this happen? Experts who spoke with Fortune blame a race to innovate and lack of accountability in the AI space. What’s more, they add, it is a certainty that other illegal or objectionable material exists in training data elsewhere.
“It was a matter of time,” Merve Hickok, president at the Center for AI and Digital Policy, told Fortune. “We opened the floodgates at this time last year, when one company after another released their models without safeguards in place, and the consequence of that race to market—we will have that for a very long time.”
This isn’t the first case of child sexual exploitation through AI. Just last month, New Jersey police began investigating an incident in which male high school students used AI to create and share fake nude images of their female classmates. In September alone, 24 million unique visitors clicked into websites that can “undress” pictured individuals using AI, social media analytics firm Graphika found. Ads for these services appear on mainstream social media platforms, making them more accessible, Graphika reported. Bad actors can use these images to extort, blackmail, and harm the reputations of average people, experts warned. And the ability to create explicit images of children using AI—even if they don’t depict a specific person—can put children at risk in the real world.
“We are in the early innings here, and I’m afraid it can get much worse,” said Yaron Litwin, chief marketing officer of Canopy, a company using AI to filter out inappropriate content for children.
LAION has temporarily taken down its datasets and will ensure they are safe before republishing them, it said in an emailed statement. The nonprofit claimed it has “rigorous filters to detect and remove illegal content…before releasing them.” How 1,000 explicit images bypassed those filters is unclear, and LAION did not respond to additional questions.
How does this happen?
Child safety “is not an issue people necessarily think about when starting their projects,” said David Thiel, the ex-Facebook, Stanford researcher who authored the report. “My impression is that the original dataset was built by AI enthusiasts who didn’t have a ton of experience with the various kinds of safety measures you would want to put in place.”
Thiel first began working on this project in September after being tipped off by someone else in the field. Another researcher had reason to believe child sexual abuse material might exist in a public dataset after viewing keywords in the descriptions of image entries. Thiel then designed a process for finding individual, illegal images in large databases by using PhotoDNA, a technology created by Microsoft that finds pictures similar to an existing one. While Stanford used other datasets for training purposes, it only scanned the LAION one for this report, so explicit images of children may exist in other public databases.
“Like much of the technology sector, there are a lot of things that are overlooked in a rush to get things out there,” Thiel told Fortune. “ That’s something I believe happened here as well. It has echoes of ‘move fast and break things,’” he said, referencing the early-Facebook ideology.
What’s missing here is accountability and regulation, experts agreed. And already, consumers have become less forgiving about the concept of companies scraping the internet for training data. “Most people have realized the ‘crawl the whole web’ methodology is fraught for a number of reasons,” Thiel said. “There’s a shift towards training things that have been licensed.” A number of news organizations have partnered with AI companies to license their content for training purposes, most recently German media giant Axel Springer, which owns Politico and E&E News in the U.S.
While this shift in mindsets offers a positive outlook for the future of AI regulation, Thiel said, “The damage done by those early models will be with us for a bit.”