Get all your news in one place.
100’s of premium titles.
One app.
Start reading
The Guardian - AU
The Guardian - AU
Technology
Josh Taylor

Photos of Australian children used in dataset to train AI, human rights group says

photo of children's legs
HRW found children whose images were in the Laion-5B dataset, used by services such as Stable Diffusion creator Stability AI and Midjourney, were easily identifiable. Photograph: Ian West/PA

Photos of Australian children have been included in the dataset used by several AI image-generating tools without the knowledge or consent of them or their families, research by Human Rights Watch (HRW) has found.

An analysis of less than 0.0001% of the 5.85bn images contained in the Laion-5B dataset, used by services such as Stable Diffusion creator Stability AI and Midjourney, found 190 photos of Australian children scraped from the internet.

Laion-5B has been built by scraping photos off the internet. Germany based Laion does not keep a repository of all of the images it scrapes from the internet, but it contains a list of URLs to the original images, along with the alternate text included on those linked images.

HRW found children whose images were in the dataset were easily identifiable, with some names included in the accompanying caption or the URL where the image was stored.

It also included information on when and where the photo was taken.

One photo found featured two boys in front of a colourful mural, which reveals their names, ages, and the preschool they attended, HRW said, which was information not found anywhere else on the internet.

Hye Jung Han, HRW’s children’s rights and technology researcher, told Guardian Australia the photos were being lifted from photo and video sharing sites, as well as school websites.

“These are not easily findable on school websites,” she said. “They might have been taking images of a school event or like a dance performance or swim meet and wanted a way to share these images with parents and kids.

“It’s not quite a password-protected part of their website, but it’s a part of the website that is not publicly accessible, unless you were sent the link.

“These were not webpages that were indexed by Google.”

HRW also found an unlisted YouTube video of schoolies celebrations in the dataset. Such videos are not searchable on YouTube and scraping YouTube is against its policies, Han said.

Images of Indigenous children were also found, with some photos over a decade old. Han said this raised questions about how images of recently deceased Indigenous people could be protected if they were included in the dataset being used to train AI.

Laion, the organisation behind the open source dataset, told Guardian Australia that any material related to children’s images are “coming from links pointing to publicly available section[s] of the internet.”

“With regard to links to images on public internet available in LAION datasets, we can confirm that we worked together with HRW and remove[d] all the private children data reported by HRW,” a spokesperson said.

“We would like to reiterate …the fact that the most effective way to increase safety is to remove private children[’s] info from [the] public internet. LAION datasets are just a collection of links to images available on public internet. Removing links from LAION datasets DOES NOT result in removal of actual original images hosted by the responsible third parties on public internet.

“As long as those images along with private data remain publicly accessible, any other parties collecting data will be able to obtain those for their own datasets that will remain closed in most cases.”

The organisation has a form where users can submit feedback on issues in the dataset.

Han said that the practice risks harming two groups of children as a result – those who have their photos scraped; and those who potentially have malicious AI tools, such as deepfake apps built on the dataset, used against them.

“Almost all of these free nudify apps have been built on Laion-5B because it is the biggest image and text and training dataset out there,” she said.

“It’s being used by untold numbers of AI developers, and some of those apps were specifically being used to cause harm to children.”

Last month, a teenage boy was arrested then released after nude images, created by AI using the likeness of about 50 female students from Bacchus Marsh Grammar, were circulated online.

The federal government in June introduced legislation to ban the creation and sharing of deepfake pornography, but HRW argued this failed to address the deeper problem that children’s personal data was unprotected from misuse, including where real children’s likeness can be used in deepfakes.

“No one knows how AI is going to evolve tomorrow. I think the root of the harm lays in the fact that children’s personal data are not legally protected, and so they’re not protected from misuse by any actor or any type of technology,” Han said.

The organisation said this should be addressed in legislation to update the Privacy Act, expected in August. HRW said this should prohibit scraping of children’s data into AI, and prohibit the nonconsensual digital replication or manipulation of children’s likeness.

The Australian privacy commissioner in 2021 found Clearview AI’s scraping of images from social media in the use of facial recognition technology “may adversely impact the personal freedoms of all Australians” and the company had breached Australians’ privacy.

Han said it was a strong statement, but now needed to be backed up by law and enforcement of that law.

“There’s still a long way to go.”

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.