Institute for Software Research
School of Computer Science, Carnegie Mellon University


Trail Re-Identification and Unlinkability
in Distributed Databases

Bradley Malin

May 2006

Ph.D. Thesis

Keywords: Privacy, anonymity, record linkage, distributed systems, databses, data mining, secur mulriparty computation, genetic variation, medical privacy, Internet privacy, public policy

The term "re-identification" refers to the correct relation of seemingly anonymous data to explicit identifying information, such as the name or address of people who are the subjects of the data. Historically, re-identification methods were evaluated with respect to a single provider's disclosed databases. Imagine that a data holder discloses two databases: one consists of unidentified data, such as DNA sequences and the other contains identifiable data, such as personal names. The databases appear unrelated and, as a result, existing data protection policies certify sufficient protection from reidentification. However, when multiple locations make such releases available, an individual's data can be tracked across locations, resulting in a locationvisit footprint, or a trail. Unique trails can be leveraged for re-identification. In this dissertation, traditional notions of privacy are extended to account for trail re-identification. The goals of this dissertation are: 1) to prove data protection policies can and do fail when they rely on ad-hoc strategies and 2) to demonstrate policies can be strengthened with formal computational models for privacy technology.

In this work, trails are studied in two principle parts. First, we concentrate on the trail re-identification problem and develop several learning algorithms for discovering re-identifications. The algorithms are evaluated on populations derived from real world databases, including hospital visits derived from medical databases and weblogs derived from Internet databases. It is demonstrated that susceptibility to trail re-identification is neither trivial nor the result of bizarre isolated occurrences. Experimental evidence with real world populations confirms that significant quantities of populations are at trail reidentification risk.

Second, we propose a protocol by which data holders can collaborate to provably prevent trail re-identification. To do so, we introduce a formal model of privacy called k-unlinkability, and several configurable algorithms to render protected trails. To satisfy real world policy constraints, we present a novel secure multiparty computation protocol that embeds the protection procedure. Using real world datasets, it is demonstrated that significant quantities of data can be disclosed with provable privacy guarantees.

253 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by