Computer Science Department
School of Computer Science, Carnegie Mellon University
A Statistical Framework
Comparison of the spatial organization of related genomes reveals a wealth of information about how complex biological systems evolve and function. A fundamental task in spatial comparative genomics is identification of homologous genomic regions, regions that have descended from a common region in an ancestral genome. While closely related regions are characterized by conserved gene content and order, in more distantly related genomes homologous regions will be apparent only as gene clusters, pairs of regions with similar, but not identical, gene content and scrambled gene order. As gene content and order diverge, statistical tests to reject the null hypothesis that these regions share genes by chance become essential.
In this thesis, I provide statistical tests to assess the significance of gene clusters for a variety of biological questions and search scenarios. I present the first formal statistical framework for the max-gap cluster, the most widely used cluster definition in genomic analyses. This framework provides statistical tests for two common search scenarios and facilitates principled selection of parameter values prior to conducting a search for gene clusters.
Second, I propose novel statistical tests for clusters spanning three genomic regions, for two comparative genomics applications: analysis of conserved linkage within multiple species and identification of large-scale duplications. Multi-genome clusters are of increasing importance, yet existing tests focus almost exclusively on pairwise comparisons. My results demonstrate that simultaneously considering information from more than two regions dramatically improves sensitivity over pairwise methods.
Third, I demonstrate the importance of incorporating cluster statistics in algorithms for spatial comparative genomics. Orthologs, genes that descended from a common ancestor through speciation, are the fundamental unit of comparison in many comparative genomics applications. Using my statistical framework for evaluating max-gap clusters, I develop a new method for ortholog prediction based on conserved spatial organization. Using statistical significance to rank conserved patterns makes it possible to accommodate a variety of spatial features in a single framework, yielding a method that can be applied to a broad range of genomic data sets. This flexible framework outperforms current spatial ortholog prediction methods, especially on highly diverged genomes.