Machine Learning Department
School of Computer Science, Carnegie Mellon University
New Statistical Applications for Differential Privacy
Differential privacy is a relatively recent development in the field of
privacy-preserving data mining, which was formulated to give a mathematically
rigorous definition of privacy. The concept has spawned a great deal of work
regarding the development of algorithms which are privacy-preserving
under this definition, and also work which seeks to understand the fundamental
limitations of such algorithms. When the goal is statistical inference it is
important to understand what set of analyses may be carried out in the
privacy-preserving framework with reasonable accuracy, and which data
summaries and results may be reported.
In this work we begin by examining fundamental limitations of differentially private procedures when the goal is to release a sparse histogram, or a contingency table. We also describe error bounds when the goal is model selection for a private contingency table. Through examples we will demonstrate the implications of these lower error bounds for statistical analysis with datasets of different sizes and dimensions.
Thus far, differentially private algorithms appear to be restricted to the release of finite dimensional vectors (e.g., regression coefficients, point estimates, SVM parameters). We next develop methods for releasing functional data while preserving differential privacy. Specifically, we show that when the function of interest lies inside a reproducing kernel Hilbert space, then it satisfies differential privacy to release the function plus a Gaussian process noise term. As examples we consider kernel density estimation, kernel support vector machines, and suitably smooth functions which lie in a particular Sobolev space.
Finally we propose a relaxed privacy definition called random differential privacy (RDP). Differential privacy requires that adding any new observation to a database will have small effect on the output of the data-release procedure. Random differential privacy requires that adding a randomly drawn new observation to a database will have small effect on the output. We show an analog of the composition property of differentially private procedures which applies to our new definition. We show how to release an RDP histogram and we show that RDP histograms are much more accurate than histograms obtained using ordinary differential privacy. We finally show an analog of the global sensitivity framework for the release of functions under our privacy definition. We demonstrate that an extension of this idea and describe how it relates to other looser privacy definitions that have begun appearing in the literature.
||SCS Technical Report Collection
School of Computer Science