CMU-CB-20-103
Ray and Stephanie Lane Computational Biology Department
School of Computer Science, Carnegie Mellon University



CMU-CB-20-103

Computational Methods for Multi-Species Comparison
a 3D Genome Organization and Function

Yang Yang

December 2020

Ph.D. Thesis

CMU-CB-20-103.pdf


Keywords: 3D genome organization, DNA replication timing, Ornstein-Uhlenbeck process, Gaussian process, hidden Markov model, hidden Markov random field, interpretable machine learning

Recent developments in chromatin interaction mapping technologies have greatly advanced the study of higher-order genome organization in the three-dimensional (3D) cell nucleus, which is of vital importance to fundamental genome functions such as DNA replication timing (RT) and gene transcription. However, the principles underlying the 3D genome organization and function and the detailed patterns on how the 3D genome has changed in mammalian evolution remain largely unclear. To directly address these questions, in this Ph.D. dissertation, I developed new machine learning frameworks to advance the methodologies for the comparisons of 3D genome organization across multiple species and for unveiling critical information encoded in the genome that may regulate large-scale chromosome structure and function. First, I developed a new model named phylogenetic hidden Markov Gaussian processes (Phylo-HMGP) to simultaneously infer genome-wide heterogeneous evolutionary patterns of continuous-trait functional genomic features. Phylo-HMGP models both temporal dependencies across species and spatial dependencies along the genome. Real data application to a new RT dataset based on Repli-seq from five primate species demonstrated that Phylo-HMGP greatly refined our understanding of cross-species RT patterns. Next, I developed a new probabilistic model named Phylo-HMRF, a unique framework to compare multi-species 3D genome organizations based on Hi-C data. The method incorporates 3D spatial constraints with continuous-trait evolutionary models. Phylo-HMRF uncovered patterns of 3D genome evolution in primate species that show novel connections to other genome structural and functional features. Finally, I developed a generic interpretable machine learning framework named CONCERT to predict large-scale chromosome domain features with a focus on RT profile directly from genomic sequences. CONCERT enables the identification of genome-wide sequence elements that modulate the RT program by jointly performing predictive element estimation and long-range spatial dependency learning. Application of CONCERT to multiple human and mouse cell types demonstrated the effectiveness of the method. Taken together, the methods developed in this Ph.D. dissertation have established a series of new algorithmic formulations for effective comparison and high-resolution characterization of evolutionary patterns of 3D genome organization, revealing genomic regions with conserved or species-specific structural and functional roles. The methods have the potential to provide critical insights into the regulatory principles of nuclear structures and the sequence determinants underlying the strongly intertwined nature of genome structure and function.

244 pages

Thesis Committee:
Jian Ma (Chair)
Ziv Bar-Joseph
Anne-Ruxandra Carvunis (University of Pittsburgh)
David Haussler (University of California, Santa Cruz)

Russell S. Schwartz, Head, Computational Biology Department
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu