Lane Center for Computational Biology
School of Computer Science, Carnegie Mellon University
On the Identification and Investigation of
Jacob M. Joseph
This dissertation addresses the identification and characterization of homologous gene families in large-scale, genomic data. Particular emphasis is paid to multidomain gene families, as multidomain sequences represent at least half of the sequence universe, but present an especially challenging case for family classification. Often, these sequences are excluded from analyses because they tend to interfere with classification performed with existing methods. This thesis develops the theoretical context for family classification of datasets that contain multidomain sequences, and demonstrates the implementation necessary for performing classification on large data sets.
Five primary results are presented in this work. First, a definition of homology that encompasses the evolutionary scenarios that result in multidomain families is formulated. Second, the techniques and implementation of family classification are presented. The methodology developed takes protein sequence data as input, and, by explicitly considering the evolutionary signal of gene duplication inherent in a sequence similarity network, derives a network that is an accurate estimate of homology. Third, the structure of this network is examined, and compared to the theoretical construct of a network of homology. Fourth, an approach for predicting families from this network is developed. Importantly, a statistical framework is presented for evaluation of the result using a limited set of curated families. Finally, the interplay between domains and the clustering result is examined using an information-theoretic approach.