3 research papers address bias in AI and synthetic data

25/07/2025 | Various

First, the German Federal Office for Information Security (BSI), has published a white paper (in German) on Bias in Artificial Intelligence (AI). The paper, addressing the unequal treatment of users or companies by AI systems, provides foundational information on AI bias, its causes, and an overview of detection and mitigation techniques. Detection can involve qualitative data analyses or statistical methods, while mitigation uses preprocessing, inprocessing, or postprocessing.

The whitepaper also highlights the links between bias and cybersecurity, noting how bias can be exploited for poisoning attacks or unauthorised model copying. Bias, which can manifest throughout the AI lifecycle, arises from distortions in data or decision spaces, such as overemphasising problematic patterns or insufficient representation of subpopulations. The white paper aims to inform developers, providers, and operators of AI systems about these issues, stressing that addressing bias is vital for ensuring AI functionality, cybersecurity, and equitable user treatment.

Next, a Google Research blog article by Zheng Xu, Research Scientist, and Yanxiang Zhang, Software Engineer at Google, discusses how the success of machine learning (ML) models, particularly language models (LMs), depends on both large-scale and high-quality data. The standard training approach involves pre-training on vast web data followed by post-training on smaller, high-quality datasets. This post-training is crucial for aligning large models with user intent and adapting smaller models to specific user domains. 

However, complex LM training systems pose privacy risks, such as the memorisation of sensitive user instruction data. Google Research proposes privacy-preserving synthetic data as a solution. Large LMs can generate this synthetic data to mimic real user interactions without the risk of memorisation. This allows for its use in model training, simplifying the process of developing privacy-preserving models while still benefiting from diverse data.

In the third article published by Biometric Update, Gaurav Sharma, Director of Operations at Chetu, writes that AI-generated synthetic data is transforming biometric systems by addressing privacy and bias concerns. Synthetic biometric data, which includes algorithm-generated facial images, fingerprints, and voice recordings, is not sourced from real individuals, making it inherently privacy-preserving. This contrasts with traditional datasets that can reflect demographic imbalances or include non-consented data.

Synthetic data allows for balanced representation across diverse demographics, ensuring fairness. Developers can rapidly produce these datasets and flexibly simulate specific issues like facial occlusion or ageing. However, challenges remain, as poorly prepared synthetic datasets can still perpetuate real-world bias if trained on faulty information, and some synthetic outputs may be too similar to real individuals, posing identification risks. Despite this, synthetic data is becoming fundamental for ethical and legally compliant biometric systems, enabling organisations to train fairer models, simulate rare scenarios, and enhance privacy in digital security.

(Translate to English: Google ChromeMozilla FirefoxMicrosoft Edge, or Apple Safari)


Training Announcement: Freevacy offers a range of independently recognised professional AI governance qualifications and AI Literacy short courses that enable specialist teams to implement robust oversight, benchmark AI governance maturity, and establish a responsible-by-design approach across the entire AI lifecycle. Find out more. 

Read Full Story
Bias

What is this page?

You are reading a summary article on the Privacy Newsfeed, a free resource for DPOs and other professionals with privacy or data protection responsibilities helping them stay informed of industry news all in one place. The information here is a brief snippet relating to a single piece of original content or several articles about a common topic or thread. The main contributor is listed in the top left-hand corner, just beneath the article title.

The Privacy Newsfeed monitors over 300 global publications, of which more than 6,250 summary articles have been posted to the online archive dating back to the beginning of 2020. A weekly roundup is available by email every Friday.