Institute for Software Research International
School of Computer Science, Carnegie Mellon University


Protecting DNA Sequence Anonymity
with Generalization Lattices

Bradley Malin

October 2004


Keywords: Anonymity, confidentiality, privacy, genomics, genetic databases, k-anonymity

The increased collection, storage, and analysis of person-specific DNA sequences poses serious challenges to the protection of the identities to which such sequences correspond. Compromise of DNA privacy via re-identification, the inference of explicit identity of the individual from which the DNA was derived, is dependent on unique features that may be inferred from a DNA sequence. In this paper we introduce a com-putational method for anonymizing a collection of person-specific DNA database sequences. The method is termed DNA lattice anonymization (DNALA), and is based upon the privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence record from k - 1 other entries. We employ a concept generalization lattice to determine the distance between two residues in a single nucleotide region, which provides the most similar generalized concept for two residues (i.e. adenine and guanine are both purines). Each single nucleotide region is con-sidered independent of each other region when determining the distance between sequences. The DNALA method chooses pairs of sequences to be anonymized to a sequence of minimal distance between the pair, and generalizes the pair accordingly. The method is tested and evaluated with several publicly available human population datasets.

13 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by