|
CMU-ISRI-04-134
Institute for Software Research International
School of Computer Science, Carnegie Mellon University
CMU-ISRI-04-134
Protecting DNA Sequence Anonymity
with Generalization Lattices
Bradley Malin
October 2004
CMU-ISRI-04-134.pdf
Keywords: Anonymity, confidentiality, privacy, genomics, genetic
databases, k-anonymity
The increased collection, storage, and analysis of person-specific
DNA sequences poses serious challenges to the protection of the
identities to which such sequences correspond. Compromise of DNA
privacy via re-identification, the inference of explicit identity
of the individual from which the DNA was derived, is dependent on
unique features that may be inferred from a DNA sequence. In this
paper we introduce a com-putational method for anonymizing a collection
of person-specific DNA database sequences. The method is termed
DNA lattice anonymization (DNALA), and is based upon the privacy
protection schema of k-anonymity. Under this model, it is impossible
to observe or learn features that distinguish one genetic sequence
record from k - 1 other entries. We employ a concept generalization
lattice to determine the distance between two residues in a single
nucleotide region, which provides the most similar generalized concept
for two residues (i.e. adenine and guanine are both purines). Each
single nucleotide region is con-sidered independent of each other
region when determining the distance between sequences. The DNALA
method chooses pairs of sequences to be anonymized to a sequence
of minimal distance between the pair, and generalizes the pair
accordingly. The method is tested and evaluated with several
publicly available human population datasets.
13 pages
|