Computational Biology Department
School of Computer Science, Carnegie Mellon University


Computational methods for exploring gene regulation
mechanisms using high-throughput sequencing data

Hao Wang

September 2016

Ph.D. Thesis


Keywords: Algorithms, Computational Biology, Chromosome Conformation Capture, Three-Dimensional Structure, Ribosome Profiling, Translation, Ribosome Load, Isoform-level Estimation, Codon Bias, Wobble Pairing, Ribosome Collision

Gene expression has been studied extensively on the transcript level with the help of RNA-seq technology, however less attention has been paid to gene regulation pretranscription and post-transcription. For example, it is not clear whether genome structure plays an important role in gene functionality, nor is it clear how gene expression is regulated by translational speed on a codon basis. Recently, several high-throughput sequencing techniques have been developed to help answer these questions. Specifically, Chromosome Conformation Capture (3C) was developed to capture spatially close chromatin loci in cell nuclei and enables whole-genome structure studies, and ribosome profiling (ribo-seq) is developed to study ribosome location preferences during translation and enables genome-wide translational studies. However, the complicated experimental pipelines make these data inherently noisy, and typical approaches to process these data are prone to errors and computationally expensive. We developed various computational pipelines to fundamentally process these data to advance downstream analysis regarding gene regulation. Specifically, we developed a graph-based test to identify sets of functionally related genomic loci that are statistically spatially closer than expected by chance using 3C data. Compared to typical methods, our approach is computationally inexpensive and more robust to unmeasured interactions and the inclusion of non-associated loci. We also developed a pipeline to estimate ribosome occupancy preferences on a transcript level from ribo-seq data. This is the first systematic approach to address the ubiquitous multi-mappings in ribo-seq data and quantify ribosome loci on a transcript level. It results in better estimations of both ribosome profiles and ribosome loads. In addition, we designed a mathematical model and algorithm to recover ribosome positions from ribo-seq data. Unlike existing simple heuristics that make inaccurate assumptions on ribo-seq read digestions, our approach captured the complicated digestion pattern in a flexible and data-driven way, and outputs better ribosome profiles that help reveal biologically reasonable observations on translation patterns. Using these improved preprocessing pipelines above, we estimated the codon decoding time in yeast, and showed that both codon usage and wobble pairing play a role in regulating translational speed. Lastly, we performed the first genome-wide analysis on ribosome collisions with the help of a modified ribosome profiling protocol. Our preliminary results indicate that extreme slow-down of local ribosome movements during translation is likely to be random and rare, and the identification of programmed ribosome stalling requires further experiments with deeper sequencing. Together, our algorithms and analysis have helped to build the foundation for exploring pre- and post-transcriptional regulation in gene expression, which will help us understand the mechanism of cell growth and death, the differential gene expression across conditions and cell types, and the development and causes of diseases.

123 pages

Thesis Committee:
Carl Kingsford (Chair)
James R. Faeder (University of Pittsburgh)
Joel McManus
Sridhar Hannehalli (University of Maryland)

Robert F. Murphy, Head, Computational Biology Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by