CMU-CB-12-102
Lane Center for Computational Biology
School of Computer Science, Carnegie Mellon University



CMU-CB-12-102

Analysis of High Throughput Genomic Datasets Across Species

Guy E. Zinman

May 2012

Ph.D. Thesis

CMU-CB-12-102.pdf


Keywords: Dynamic network analysis, gene expression analysis, cross species analysis, protein-protein interactions, genetic interactions, soft clustering, active sub-networks, conservation and divergence

Genes are highly conserved between closely related species, and biological systems often utilize the same genes across different organisms. This fact has allowed the study of various biological systems using model organisms and the development of many drugs for human diseases by first researching simpler model organisms. New high-throughput technologies have enabled researchers to use interactions and expression data to get a more precise view regarding the roles and functions of biological processes across species. However, combining and comparing these types of data across species is challenging due to several problems including homology assignments, coverage issues, and quality of the data in each of the species.

This thesis studies various aspects of cross species analysis in light of these obstacles and introduces new algorithms and computational tools that specifically address them. First, we performed a global analysis of conservation of interaction and expression data by developing a framework that integrated various data types from four model organisms. This analysis showed that while interactions are often not conserved at the protein level, they are conserved at a higher network organization level. These findings paved the way to developing three tools aimed at analyzing expression data from multiple species concurrently: 1) ExpressionBlast, a search engine for gene expression data, which provides the ability to query experimental results obtained in one species against all public expression studies conducted in the same or in a different species. 2) SoftClust, a new constrained clustering method which integrates expression data with sequence orthology information in a modified k-means model to jointly cluster expression data from several species. 3) ModuleBlast, an active sub-network search tool that makes use of both static interaction data and condition-specific expression data from multiple species to understand conservation and divergence of biological systems dynamics.

The tools introduced in this thesis were incorporated into a web-based expression analysis package with enhanced support for cross species analysis. We hope that these tools will have an impact in elucidating the underlying molecular mechanisms in a variety of organisms.

182 pages



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu