Machine Learning Department
School of Computer Science, Carnegie Mellon University


New Statistical Applications for Differential Privacy

Rob Hall

December 2012

Ph.D. Thesis
Joint Machine Learning and Statistics


Keywords: NA

Differential privacy is a relatively recent development in the field of privacy-preserving data mining, which was formulated to give a mathematically rigorous definition of privacy. The concept has spawned a great deal of work regarding the development of algorithms which are privacy-preserving under this definition, and also work which seeks to understand the fundamental limitations of such algorithms. When the goal is statistical inference it is important to understand what set of analyses may be carried out in the privacy-preserving framework with reasonable accuracy, and which data summaries and results may be reported.

In this work we begin by examining fundamental limitations of differentially private procedures when the goal is to release a sparse histogram, or a contingency table. We also describe error bounds when the goal is model selection for a private contingency table. Through examples we will demonstrate the implications of these lower error bounds for statistical analysis with datasets of different sizes and dimensions.

Thus far, differentially private algorithms appear to be restricted to the release of finite dimensional vectors (e.g., regression coefficients, point estimates, SVM parameters). We next develop methods for releasing functional data while preserving differential privacy. Specifically, we show that when the function of interest lies inside a reproducing kernel Hilbert space, then it satisfies differential privacy to release the function plus a Gaussian process noise term. As examples we consider kernel density estimation, kernel support vector machines, and suitably smooth functions which lie in a particular Sobolev space.

Finally we propose a relaxed privacy definition called random differential privacy (RDP). Differential privacy requires that adding any new observation to a database will have small effect on the output of the data-release procedure. Random differential privacy requires that adding a randomly drawn new observation to a database will have small effect on the output. We show an analog of the composition property of differentially private procedures which applies to our new definition. We show how to release an RDP histogram and we show that RDP histograms are much more accurate than histograms obtained using ordinary differential privacy. We finally show an analog of the global sensitivity framework for the release of functions under our privacy definition. We demonstrate that an extension of this idea and describe how it relates to other looser privacy definitions that have begun appearing in the literature.

108 pages

SCS Technical Report Collection
School of Computer Science