Protecting Safeguarded Models in Private Federated Learning

One part of an assortment concerning safeguarded federated learning is highlighted in this article. An alliance between NIST and the UK government’s Responsible Technology Adoption Unit (RTA), previously recognized as the Centre for Data Ethics and Innovation, is behind the series. Explore further and access all published posts up to now on NIST’s Privacy Engineering Collaboration Space or RTA’s blog.

Our previous two posts within the series discussed methods for preserving the privacy of inputs in safeguarded federated learning concerning horizontally and vertically segmented data. To establish a comprehensive safeguarded federated learning system, these methods need to be amalgamated with a strategy for output privacy, which restricts the amount of information obtainable about individuals in the training data once the model has completed its training.

As delineated in the latter portion of our discussion on privacy incursions in federated learning, trained models have the potential to divulge significant details regarding their training data – encompassing complete images and textual excerpts.

Training via Differential Privacy

The most robust prevailing form of output privacy is differential privacy. Differential privacy is a structured privacy paradigm suited to numerous scenarios; refer to NIST’s blog series on this subject for further elucidation, particularly the entry on differentially private machine learning.

Strategies for differentially private machine learning involve the introduction of random perturbations to the model while undergoing training to defend against privacy violations. These random perturbations deter the model from committing to memory specifics from the training data, ensuring that the data cannot be extracted from the model at a later juncture. For instance, Carlini et al. evidenced that sensitive training data, like social security numbers, were obtainable from trained language models, and the application of differential privacy during training effectively counteracted this breach.

Differential Privacy for Safeguarded Federated Learning

In centralized training, where the training data is amassed on a centralized server, the server can undertake the training and incorporate noise for differential privacy concurrently. Conversely, in safeguarded federated learning, determining who should append the noise and the method of its inclusion can prove to be more convoluted.

Regarding safeguarded federated learning on horizontally segmented data, Kairouz et al. introduce a version of the FedAvg technique as illustrated in our fourth post. Visualized in this method, each participant conducts local training and then appends a minor degree of random noise to their model adjustment before merging it with the adjustments of other participants. Assuming each participant correctly integrates noise into their augmentation, the ensuing collective model will contain adequate noise to preserve differential privacy. This tactic ensures output privacy, even in scenarios involving a malevolent aggregator. The Scarlet Pets team implemented a variant of this method in their triumphant solution for the UK-US PETs Prize Challenges.

In instances involving vertically segmented data, ensuring differential privacy can be intricate. The noise requisite for differential privacy cannot be incorporated prior to entity alignment as it would disrupt the accurate correspondence of data attributes. Instead, the noise must be introduced post-entity alignment, either by a reliable participant or through means such as homomorphic encryption or multiparty computation.

Training Exceptionally Accurate Differentially Private Models

The random noise necessary for differential privacy can have repercussions on model accuracy. Typically, a greater degree of noise enhances privacy but reduces accuracy. This balance between accuracy and privacy is frequently referenced as the privacy-utility tradeoff.

For certain machine learning models, such as linear regression models, logistic regression models, and decision trees, navigating this balance is simplified – as previously expounded, the approach often proves effective in cultivating highly precise models featuring differential privacy. In the UK-US PETs Prize Challenges, both the PPMLHuskies and Scarlet Pets teams employed analogous procedures to foster precise models encompassing differential privacy.

Concerning neural networks and deep learning, the extensive scale of the model itself complicates the integration of differential privacy during training since larger models necessitate more noise to achieve privacy, greatly impacting accuracy. While these types were not integrated into the UK-US PETs Prize Challenges, they hold increasing significance across all generative AI implementations, including extensive language models.

Recent findings indicate that pre-trained models on publicly available data (sans differential privacy) subsequently fine-tuned models embracing differential privacy can yield almost indistinguishable accuracy compared to models crafted without differential privacy. For example, Li et al. demonstrate that pre-trained language models can be refined employing differential privacy to achieve nearly identical precision as models trained without differential privacy. These outcomes suggest that within domains where publicly available data can be employed for preliminary training – encompassing language and image recognition models – safeguarded federated learning facilitating both privacy and efficiency is viable.

This methodology does not offer any privacy safeguard for the public data used during training, hence it is crucial to ensure the appropriate utilization of this data respects pertinent privacy and intellectual property rights (matters related to legality and ethics in this context surpass the scope of this blog series).

Upcoming Content

Stay tuned for our following publication discussing the difficulties in implementing safeguarded federated learning within real-world scenarios.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts