AI training dataset contains millions of examples of personal data

18/07/2025 | MIT Technology Review

A new study by researchers from Carnegie Mellon University, published on arXiv, has revealed that a major open-source artificial intelligence (AI) training set, DataComp CommonPool, likely contains millions of images that include personal information. An audit of just 0.1% of CommonPool's 12.8 billion data samples uncovered thousands of identifiable faces and validated identity documents, including passports, credit cards, driver's licenses, and birth certificates. Over 800 validated job application documents, some of which reveal sensitive information such as disability status or background check results, were also found and linked to real individuals.

The researchers estimate that hundreds of millions of images in the full dataset contain personally identifiable information. This discovery reinforces the concern that "anything you put online can and probably has been scraped" for AI training, highlighting significant privacy risks associated with large-scale web scraping for generative AI models. While CommonPool's curators intended it for academic research, its licence permits commercial use. At the time of the research being published, DataComp datasets, which include CommonPool, have been downloaded 2 million times. The dataset has also been used to train image generation models like Midjourney, Stable Diffusion, and Google's Imagen.

Read Full Story
Artificial intelligence, AI training data

What is this page?

You are reading a summary article on the Privacy Newsfeed, a free resource for DPOs and other professionals with privacy or data protection responsibilities helping them stay informed of industry news all in one place. The information here is a brief snippet relating to a single piece of original content or several articles about a common topic or thread. The main contributor is listed in the top left-hand corner, just beneath the article title.

The Privacy Newsfeed monitors over 300 global publications, of which more than 6,250 summary articles have been posted to the online archive dating back to the beginning of 2020. A weekly roundup is available by email every Friday.