Computer Science Department
School of Computer Science, Carnegie Mellon University


Structured Sparse Models and Algorithms for Genetic Analaysis

Seunghak Lee

May 2015

Ph.D. Thesis


Currently Unavailable Electronically

Keywords: Structured Sparsity, Screening, Lasso, Group lasso, Structured Association Mapping, Genome-wide Association Study

Identifying genetic variants (e.g., single nucleotide polymorphisms) associated with phenotypic variations (e.g., disease status) is a fundamental problem in genetics. However, most genetic variants associated with complex phenotypes remain elusive. A major challenge is that the number of samples is much smaller than the number of genetic variants, and thus the statistical power to detect phenotype-associated genetic variants is limited.

In this thesis, to enhance the statistical power, we develop structured sparse models and algorithms to detect genotype-phenotype associations, taking advantage of biological knowledge or structures in the data or problems. We first develop structured sparse models and algorithms, which include adaptive multi-task lasso and structured input-output lasso, that take advantage of genome annotations or group structures in genomes and phenotypic traits. We then develop a sparse piecewise linear model to detect trait-associated interactions between genetic variants, which considers non-linear structures of the problem.

To enable the analysis of large-scale human data, we scale up algorithms for structured sparse models. Specifically, we develop a screening algorithm for overlapping group lasso (i.e., a general form of structured sparse models) that allows us to safely discard irrelevant genetic variants using simple rules. This makes it feasible to solve large structured sparse model problems because the screening algorithm can dramatically reduce the candidate genetic variants prior to solving the original problems. Finally, using the aforementioned models and algorithms, we present a method that integrates genotypic, gene expression, and phenotypic data to detect phenotype-associated genetic variants while unveiling their association mechanisms. Using the integrative approach, we analyze large-scale Alzheimer's disease data and identify genetic variants and genes associated with Alzheimer's disease status. As examples, we investigate the mechanisms of some associations involved in beta-amyloid, estrogen, and nicotine pathways.


Thesis Committee:
Eric P. Xing (Chair)
Ziv Bar-Joseph
Garth Gibson
Larry Wasserman
Matthew Stephens (University of Chicago)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by