Machine Learning Department
School of Computer Science, Carnegie Mellon University
Probabilistic Models for Collecting, Analyzing,
Hai-Son Phuoc Le
Advances in genomics allow researchers to measure the complete set of
transcripts in cells. These transcripts include messenger RNAs (which
encode for proteins) and microRNAs, short RNAs that play an important
regulatory role in cellular networks. While this data is a great resource
for reconstructing the activity of networks in cells, it also presents
several computational challenges. These challenges include the data
collection stage which often results in incomplete and noisy measurement,
developing methods to integrate several experiments within and across
species, and designing methods that can use this data to map the interactions
and networks that are activated in specific conditions. Novel and efficient
algorithms are required to successfully address these challenges.
In this thesis, we present probabilistic models to address the set of challenges associated with expression data. First, we present a novel probabilistic error correction method for RNA-Seq reads. RNA-Seq generates large and comprehensive datasets that have revolutionized our ability to accurately recover the set of transcripts in cells. However, sequencing reads inevitably contain errors, which affect all downstream analyses. To address these problems, we develop an efficient hidden Markov modelbased error correction method for RNA-Seq data . Second, for the analysis of expression data across species, we develop clustering and distance function learning methods for querying large expression databases. The methods use a Dirichlet Process Mixture Model with latent matchings and infer soft assignments between genes in two species to allow comparison and clustering across species. Third, we introduce new probabilistic models to integrate expression and interaction data in order to predict targets and networks regulated by microRNAs.
Combined, the methods developed in this thesis provide a solution to the pipeline of expression analysis used by experimentalists when performing expression experiments.
||SCS Technical Report Collection
School of Computer Science