One article forms part of a sequence concerning privacy-protecting federated learning. This sequence is a partnership between NIST and the UK government’s Responsible Technology Adoption Unit (RTA), formerly recognized as the Centre for Data Ethics and Innovation. Discover more and read all previous publications at NIST’s Privacy Engineering Collaboration Space or RTA’s blog.  Â
Previous posts in our series discussed strategies for input confidentiality in privacy-preserving federated learning within the scope of horizontally and vertically partitioned data. To establish a comprehensive privacy-preserving federated learning framework, these methods need to be integrated with a method for output confidentiality, which restricts the insights gained about individuals in the training data post-model training.
As outlined in the subsequent part of our piece on privacy infringements in federated learning, trained models can divulge significant data about their training data – including complete images and text excerpts.
Training with Differential Privacy
Differential privacy stands as the most robust known form of output confidentiality. It represents a structured privacy framework that’s applicable in various contexts; refer to NIST’s series of articles on this topic for detailed insights, particularly the discussion on differentially private machine learning.
Methods for differentially private machine learning introduce random noise to the model during training to defend against privacy compromises. This random noise inhibits the model from memorizing specifics from the training data, ensuring that extracting training data post-training is unfeasible. For instance, Carlini et al. demonstrated that sensitive training data like social security numbers could be deduced from trained language models, but training with differential privacy successfully averted this breach.
Differential Privacy for Privacy-Preserving Federated Learning
In centralized training, where the training data is gathered on a central server, the server can conduct training and introduce noise for differential privacy concurrently. However, in privacy-preserving federated learning, identifying who should introduce the noise and how to do so can be more intricate.
Credit:NIST
For privacy-preserving federated learning on horizontally partitioned data, Kairouz et al. introduce a variation of the FedAvg technique discussed in our fourth article. In this depiction of the method, every participant engages in local training, then appends a slight amount of random noise to their model update before merging it with other participants’ updates. If each participant accurately integrates noise in their update, the new aggregate model will encompass enough noise to ensure differential privacy. This process guarantees output privacy, even when an aggregator is malevolent. The Scarlet Pets team employed a version of this method in their victorious submission for the UK-US PETs Prize Challenges.
In scenarios with vertically partitioned data, achieving differential privacy can be convoluted. The noise indispensable for differential privacy cannot be added before entity alignment, as it could disrupt the correct alignment of data attributes. Instead, the noise must be appended post-entity alignment, either by a trusted participant or through methods like homomorphic encryption or multiparty computation.
Training Highly Accurate Differentially Private Models
The random noise necessary for differential privacy can impact model accuracy. More noise generally enhances privacy but decreases accuracy. This balance between accuracy and privacy is often referred to as the privacy-utility tradeoff.
For specific types of machine learning models such as linear regression models, logistic regression models, and decision trees, navigating this tradeoff is straightforward – the method outlined earlier often successfully trains exceedingly accurate models with differential privacy. In the UK-US PETs Prize Challenges, both the PPMLHuskies and Scarlet Pets teams employed comparable techniques to train highly accurate models with differential privacy.
For neural networks and deep learning, the substantial size of the model itself makes training with differential privacy more challenging – larger models necessitate more noise to uphold privacy, which can notably reduce accuracy. While these models weren’t part of the UK-US PETs Prize Challenges, they are becoming increasingly crucial across all generative AI applications, including expansive language models.
Recent findings have demonstrated that models pre-trained on publicly available data (without differential privacy) and then fine-tuned with differential privacy can achieve almost identical accuracy levels as models trained without differential privacy. For instance, Li et al. exhibit that pre-trained language models, when fine-tuned with differential privacy, can attain nearly equivalent accuracy as models trained devoid of differential privacy. These revelations imply that domains where publicly accessible data can be utilized for pre-training – including language and image recognition models – can accommodate privacy-preserving federated learning that attains both privacy and utility.
This methodology provides no privacy protection for the public data utilized during pre-training; therefore, ensuring this data’s usage adheres to pertinent privacy and intellectual property regulations is vital (the legal and ethical considerations surrounding this issue are beyond the scope of this blog series).
Upcoming Content
In our next entry, we will delve into the challenges faced during the implementation of privacy-preserving federated learning in real-world scenarios.
