Taming PETS: Privacy-Enhancing Technologies
Supervisor
Suitable for
Abstract
Prerequisites: Computer Security or Probability
Background
Include:
- A brief motivation for why the project is interesting.
- A summary of the area.
Almost all data of interest to modern analysis and machine learning tasks derives from human behaviour and characteristics, and so is considered sensitive. The area of Privacy Enhancing Technologies (PETS) seeks to understand how technical solutions can contribute to the responsible and private handling of data from individuals, while still ensuring that patterns and models can be extracted accurately. Over recent years, many important methods have been developed under different paradigms, such as Differential Privacy, Synthetic Data Generations, Federated Learning, Secure Multi-party Computation, Homomorphic Encryption, and Zero-knowledge proofs. For any given task, applying one of these modalities comes with a set of tradeoffs: how much trust is placed in different parties? What accuracy can be achieved compared to non-private computation? What are the computational overheads and dependence on other parameters?
Focus
Include:
- Research topic/question and expected contribution.
The goal of this project is to identify and study a data analysis task of interest, e.g., k-means clustering or logistic regression, through the lens of a suitable PET, e.g., differential privacy or federated learning. The student will work under my guidance to study the relevant literature on the topic, and propose privacy-preserving techniques to perform the analysis of interest, by applying methods that have been published or coming up with novel variants. A typical project will involve an implementation of two or more algorithms, to allow their comparison and study of their properties on realistic data sets.
Some examples that would be of interest:
- Secure Aggregation (SecAgg) + Invertible Bloom Look-up Tables (IBLT). Secure Aggregation is a general-purpose primitive that allows multiple parties to pool their information, so that only the sum of all the inputs can be extracted from the protocol, and nothing else. Invertible Bloom Look-up Tables represent a practical data structure that allow small sets to be retrieved, and the tables can be combined via arithmetic addition. SecAgg+IBLT represents a potentially powerful combination for lightweight private computation, allowing sampling, distributional analysis and frequent item mining.
- Tracking the statistical distribution of data that is also spread across multiple parties is important to detect changes in the behaviour of an aggregate population. The goal of this project is to develop techniques for taking statistical estimators for measures like the KL divergence of a dataset, and evaluating them under the models of local differential privacy and distributed differential privacy.
- Synthetic data is a popular way to take a sensitive dataset, and replace it with an entirely fabricated dataset which nevertheless shares many of the same properties. There are two leading approaches to this task: one based on graphical models of correlations between features in the data, and the other based on methods from deep learning (GANs, VAEs, diffusion models etc.). The aim of this project is to compare the strengths and weakness of these two paradigms, and to seek ways to combine them, or unify them into a common framework.
Method
Include:
- References to any papers, libraries or projects which might be used as a starting point.
- List of goals including which goals are essential to the project and which are stretch-goals.
I will work closely with any student wishing to do a project on PETS to help develop a bespoke set of objectives and accompanying reading list. The common goal of any project is to gain a deeper understanding of these methods, and to contribute new insights to the community. A successful project will almost always involve performing an empirical comparison of different algorithms, by making your own implementations (leveraging existing libraries) and evaluating on public benchmark datasets. Some useful initial reading for each possible topic includes:
- Federated learning and Privacy (Bonawitz, Kairouz, McMahan, Ramage, CACM https://dl.acm.org/doi/10.1145/3500240 )
- Synthetic Tabular Data: Methods, Attacks and Defenses (C., Maddock, Ullah, Gade, KDD https://dl.acm.org/doi/10.1145/3711896.3736562 )
- Differential Privacy for Databases (Near, He, Foundations and Trends, https://dpfordb.github.io/dpfordb.pdf )
Proofs, Arguments and Zero-knowledge proofs (Thaler, Foundations and Trends, https://people.cs.georgetown.edu/jthaler/ProofsArgsAndZK.html