Computer Science Department
School of Computer Science, Carnegie Mellon University
Recent explosion of genomic data have enabled in-depth investigation of complex genetic mechanisms for various applications such as the inference on the human evolutionary history or the search for the genetic basis of phenotypic traits. Although great advances have been made in the analysis of genetic processes underlying such data, most statistical methods developed so far deal with the closely related genetic objects separately using specialized methods, and do not capture the intrinsic relatedness among multiple properties that have resulted from a common inheritance process. Moreover, these approaches often ignore the inherent uncertainty about the genetic complexity of the data and rely on inflexible models resulting from restrictive assumptions.
In this thesis, we develop nonparametric Bayesian models for learning ancestral genetic processes, which provide more flexible control over the complexity of the genetic data, and at the same time, utilize the structured data in a more principled way. Under a unified inheritance framework built on the assumption of hypothetical founder haplotypes that generate modern individual chromosomes, hierarchical Bayesian models based on Dirichlet process are developed for the following related applications in population genetics: the problem of haplotype inference from multipopulation genotype data, joint inference of population structure and the recombination events, and the local ancestry estimation in admixed populations. This new approach allows one to explicitly exploit the shared structural information in the data from multiple populations. The resulting methods have shown to significantly outperform other existing methods that do not utilize such relatedness properly.