CMU-ML-07-107
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-07-107

A Nonparametric Bayesian Approach for
Haplotype Reconstruction from
Single and Multi-Population Data

Eric P. Xing, Kyung-Ah Sohn

April 2007

CMU-ML-07-107.pdf


Keywords: Haplotype inference, Dirichlet process, hierarchical Dirichlet process, mixture model, population genetics


Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far–including those based on approximate coalescence, finite mixtures, and maximal parsimony–often bypass issues such as unknown complexity of haplotype-space and demographic structures underlying multi-population genotype data. In this paper, we propose a new class of haplotype inference models based on a nonparametric Bayesian formalism built on the Dirichlet process, which represents a tractable surrogate to the coalescent process underlying population haplotypes and offers a well-founded statistical framework to tackle the aforementioned issues. Our proposed model, known as a hierarchical Dirichlet process mixture, is exchangeable, unbounded, and capable of coupling demographic information of different populations for posterior inference of individual haplotypes, the size and configuration of haplotype ancestor pools, and other parameters of interest given genotype data. The resulting haplotype inference program, Haploi, is readily applicable to genotype sequences with thousands of SNPs, at a time-cost often two-orders of magnitude less than that of the state-of-the-art PHASE program, with competitive and sometimes superior performance. Haploi also significantly outperforms several other extant algorithms on both simulated and realistic data.

35 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu