CMU-ML-13-101
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-13-101

Probabilistic Models for Collecting, Analyzing,
and Modeling Expression Data

Hai-Son Phuoc Le

May 2013

Ph.D. Thesis

CMU-ML-13-101.pdf


Keywords: Genomics, gene expression, gene regulation, microarray, RNA-Seq, transcriptomics, error correction, comparative genomics, regulatory networks, cross-species, expression database, Gene Expression Omnibus, GEO, orthologs, microRNA, target prediction, Dirichlet Process, Indian Buffet Process, hidden Markov model, immune response, cancer.


Advances in genomics allow researchers to measure the complete set of transcripts in cells. These transcripts include messenger RNAs (which encode for proteins) and microRNAs, short RNAs that play an important regulatory role in cellular networks. While this data is a great resource for reconstructing the activity of networks in cells, it also presents several computational challenges. These challenges include the data collection stage which often results in incomplete and noisy measurement, developing methods to integrate several experiments within and across species, and designing methods that can use this data to map the interactions and networks that are activated in specific conditions. Novel and efficient algorithms are required to successfully address these challenges.

In this thesis, we present probabilistic models to address the set of challenges associated with expression data. First, we present a novel probabilistic error correction method for RNA-Seq reads. RNA-Seq generates large and comprehensive datasets that have revolutionized our ability to accurately recover the set of transcripts in cells. However, sequencing reads inevitably contain errors, which affect all downstream analyses. To address these problems, we develop an efficient hidden Markov modelbased error correction method for RNA-Seq data . Second, for the analysis of expression data across species, we develop clustering and distance function learning methods for querying large expression databases. The methods use a Dirichlet Process Mixture Model with latent matchings and infer soft assignments between genes in two species to allow comparison and clustering across species. Third, we introduce new probabilistic models to integrate expression and interaction data in order to predict targets and networks regulated by microRNAs.

Combined, the methods developed in this thesis provide a solution to the pipeline of expression analysis used by experimentalists when performing expression experiments.

182 pages


SCS Technical Report Collection
School of Computer Science