Computer Science Department
School of Computer Science, Carnegie Mellon University
Algorithms for Analyzing Intraspecific
Analysis of human genetic variation has gained significant momentum due to the success of many large-scale sequencing projects. Within the last five years, about 4 million single nucleotide polymorphisms (SNPs) have been genotyped over hundreds of individuals belonging to several ethnic groups. Recently, researchers have made significant progress in detecting and cataloging copy number polymorphisms (CNPs) as well. These large-scale efforts have now made it possible to analyze genetic variation within a single species, mainly humans, at very fine-scales. In our work, we address a fundamental question: how can we use these genetic differences to make inferences about the recent history of a species and its genome?
In this work, we primarily focus on two methods to analyze SNP variation data within humans. We develop maximum parsimony phylogenetic tree reconstruction algorithms that are specifically catered to work on SNP data. Such a phylogeny should cluster closely related individuals (perhaps an ethnic group) together. Therefore, these techniques are widely used to answer questions of human migration. Recently, these methods have also been applied to find regions in the human genome that are rapidly evolving. A second method to understand genetic variation is to directly cluster the SNP data based on the property that individuals within a cluster would share similar genotypes (and phenotypes).
Mathematically, we work with two variants of the phylogeny reconstruction problem, both of which are NP-complete. The first variant is equivalent to finding a Steiner minimum tree on a hypercube and the second is a generalization of this problem. We solve the two variants in polynomial time when the size of the phylogeny (Steiner minimum tree) is 'small' (near-perfect). We also develop integer linear program based algorithms that find the optimal solution fast in practice. We provide extensive empirical analysis to demonstrate the methods' usefulness on large-scale genomic data. For the clustering part, we work with two related problems. We solve the first problem by finding the max-cut of a graph, a well-known NP-complete problem. We can show that if the graph is generated with the clusters 'planted', then our algorithm returns the max-cut in polynomial time given enough data. The second problem is more general in that individuals (vertices) can fractionally belong to either of the clusters. We solve this problem by attempting to maximize the likelihood function using an initial solution that we show to be analytically correct under some theoretical assumptions.