Machine Learning Department
School of Computer Science, Carnegie Mellon University
Statistical Methods for Studying
The study of genetic variation in populations is of great interest for
the study of the evolutionary history of humans and other species.
Improvement in sequencing technology has resulted in the availability
of many large datasets of genetic data. Computational methods have
therefore become quite important in analyzing these data. Two important
problems that have been studied using genetic data are population
stratification (modeling individual ancestry with respect to ancestral
populations) and genetic association (finding genetic polymorphisms that
affect a trait). In this thesis, we develop methods to improve our
understanding of these two problems.
For the population stratification problem, we develop hierarchical Bayesian models that incorporate the evolutionary processes that are known to affect genetic variation. By developing mStruct, we show that modeling more evolutionary processes improves the accuracy of the recovered population structure. We demonstrate how nonparametric Bayesian processes can be used to address the question of choosing the optimal number of ancestral populations that describe the genetic diversity of a given sample of individuals. We also examine how sampling bias in genotyping study design can affect results of population structure analysis and propose a probabilistic framework for modeling and correcting sample selection bias.
Genome-wide association studies (GWAS) have vastly improved our understanding of many diseases. However, such studies have failed to uncover much of the variation responsible for a number of common multi-factorial diseases and complex traits. We show how artificial selection experiments on model organisms can be used to better understand the nature of genetic associations. We demonstrate using simulations that using data from artificial selection experiments improves the performance of conventional methods of performing association. We also validate our approach using semi-simulated data from an artificial selection experiment on Drosophila Melanogaster.
||SCS Technical Report Collection
School of Computer Science