CMU-CB-20-103 Ray and Stephanie Lane Computational Biology Department School of Computer Science, Carnegie Mellon University
Computational Methods for Multi-Species Comparison Yang Yang December 2020 Ph.D. Thesis
Recent developments in chromatin interaction mapping technologies have greatly advanced the study of higher-order genome organization in the three-dimensional (3D) cell nucleus, which is of vital importance to fundamental genome functions such as DNA replication timing (RT) and gene transcription. However, the principles underlying the 3D genome organization and function and the detailed patterns on how the 3D genome has changed in mammalian evolution remain largely unclear. To directly address these questions, in this Ph.D. dissertation, I developed new machine learning frameworks to advance the methodologies for the comparisons of 3D genome organization across multiple species and for unveiling critical information encoded in the genome that may regulate large-scale chromosome structure and function. First, I developed a new model named phylogenetic hidden Markov Gaussian processes (Phylo-HMGP) to simultaneously infer genome-wide heterogeneous evolutionary patterns of continuous-trait functional genomic features. Phylo-HMGP models both temporal dependencies across species and spatial dependencies along the genome. Real data application to a new RT dataset based on Repli-seq from five primate species demonstrated that Phylo-HMGP greatly refined our understanding of cross-species RT patterns. Next, I developed a new probabilistic model named Phylo-HMRF, a unique framework to compare multi-species 3D genome organizations based on Hi-C data. The method incorporates 3D spatial constraints with continuous-trait evolutionary models. Phylo-HMRF uncovered patterns of 3D genome evolution in primate species that show novel connections to other genome structural and functional features. Finally, I developed a generic interpretable machine learning framework named CONCERT to predict large-scale chromosome domain features with a focus on RT profile directly from genomic sequences. CONCERT enables the identification of genome-wide sequence elements that modulate the RT program by jointly performing predictive element estimation and long-range spatial dependency learning. Application of CONCERT to multiple human and mouse cell types demonstrated the effectiveness of the method. Taken together, the methods developed in this Ph.D. dissertation have established a series of new algorithmic formulations for effective comparison and high-resolution characterization of evolutionary patterns of 3D genome organization, revealing genomic regions with conserved or species-specific structural and functional roles. The methods have the potential to provide critical insights into the regulatory principles of nuclear structures and the sequence determinants underlying the strongly intertwined nature of genome structure and function.
244 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |