Safeguarding Trained Models in Privacy-Preserving Federated Learning

This article is part of a sequence on privacy-preserving federated learning. The sequence is a joint effort between NIST and the UK government’s Responsible Technology Adoption Unit (RTA), previously recognized as the Centre for Data Ethics and Innovation. Explore more and peruse all the articles released so far at NIST’s Privacy Engineering Collaboration Space or RTA’s blog.

In the preceding pair of posts within our series, we discussed techniques for safeguarding input privacy in privacy-preserving federated learning regarding horizontally and vertically segregated data. To establish a full privacy-preserving federated learning system, it is essential to amalgamate these techniques with a strategy for output privacy, which restricts the potential insights into individuals present in the training data post model training.

As elaborated in the latter part of our post concerning privacy attacks in federated learning, trained models have the potential to divulge substantial information pertaining to their training data, encompassing entire images and text fragments.

Training with Differential Privacy

The most robust identified form of output privacy is differential privacy. Differential privacy serves as a formal privacy structure that can be implemented in numerous scenarios; refer to NIST’s post series on this subject for further elaboration, notably the discussion on differentially private machine learning.

Approaches for differentially private machine learning introduce random noise into the model during training to counteract privacy breaches. This random noise impedes the model from memorizing particulars from the training data, ensuring that the training data remains indiscernible from the model subsequently. For instance, Carlini et al. demonstrated that sensitive training data such as social security numbers could be extracted from trained language models, and that training with differential privacy effectively thwarted this intrusion.

Differential Privacy for Privacy-Preserving Federated Learning

In centralized training scenarios, where the training data is consolidated on a central server, the server can undertake the training and incorporate noise for differential privacy simultaneously. However, in privacy-preserving federated learning, identifying the party responsible for introducing the noise and the method of inclusion can pose challenges.

For privacy-preserving federated learning concerning horizontally segregated data, Kairouz et al. introduce a variant of the FedAvg strategy outlined in our fourth post. In this methodology, illustrated in the process, each participant conducts local training, followed by the addition of a slight degree of random noise to their model update prior to amalgamating it with the updates from other participants. Provided every participant accurately integrates noise into their update, the resulting aggregated model will embody adequate noise to ensure differential privacy. This technique guarantees output privacy, even in scenarios with a malevolent aggregator. The Scarlet Pets squad employed a variant of this strategy in their triumphant solution for the UK-US PETs Prize Challenges.

In scenarios involving vertically segregated data, ensuring differential privacy can become intricate. The noise requisite for differential privacy cannot be induced prior to entity alignment, as it would disrupt the synchronization of data attributes. Consequently, the inclusion of noise must transpire post entity alignment, either facilitated by a trusted participant or through methodologies like homomorphic encryption or multiparty computation.

Training Highly Precise Differentially Private Models

The random noise mandated for differential privacy can influence model precision. Greater noise typically translates to enhanced privacy but diminished accuracy. This balance between accuracy and privacy is commonly referred to as the privacy-utility tradeoff.

For certain types of machine learning models, such as linear regression models, logistic regression models, and decision trees, managing this tradeoff is relatively straightforward – the prior approach oftentimes proves effective in training highly precise models with differential privacy. In the UK-US PETs Prize Challenges, both the PPMLHuskies and Scarlet Pets teams leveraged akin methodologies to train exceedingly accurate models with differential privacy.

Conversely, for neural networks and deep learning, the sheer scale of the model itself complicates training with differential privacy – larger models necessitate more noise to uphold privacy, potentially leading to substantial accuracy reduction. Though these models did not feature in the UK-US PETs Prize Challenges, they are growingly pivotal across various applications of generative AI, including extensive language models.

Recent studies have demonstrated that models pre-trained on public datasets (devoid of differential privacy) and subsequently fine-tuned with differential privacy can attain accuracy levels nearly on par with models trained without differential privacy. For example, Li et al. illustrate how pre-trained language models can be fine-tuned with differential privacy to achieve nearly equivalent accuracy to models trained minus differential privacy. These outcomes imply that in domains where publicly available data can be utilized for pre-training—including language and image recognition models – privacy-preserving federated learning that concurrently offers privacy and utility is practical.

This approach does not confer privacy protection for the public data engaged during pre-training, underscoring the significance of ensuring that the application of such data adheres to pertinent privacy regulations and intellectual property rights (the legal and ethical implications surrounding this topic fall beyond the purview of this article series).

Upcoming Content

In the subsequent installment, we will explore the challenges encountered when implementing privacy-preserving federated learning in real-world scenarios.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts