CMU-CS-04-138 Computer Science Department School of Computer Science, Carnegie Mellon University
Protein Similarity from Knot Theory: Michael A. Erdmann May 2004
This report supersedes and enhances Computer Science Department CMU-CS-04-138.pdf
Shape similarity can be formulated (i) in terms of global metrics, such as RMSD or Hausdorff distance, (ii) in terms of subgraph isomorphisms, such as the detection of shared substructures with similar relative locations, or (iii) purely topologically, in terms of the cohomology induced by structure preserving transformations. Existing protein structure detection programs are built on the first two types of similarity. The third forms the foundations of knot theory. The thesis of this paper is: Protein similarity detection leads naturally to an algorithm operating at the metric, relational, and isotopic scales. The paper introduces a definition of similarity based on atomic motions that preserve local backbone topology without incurring significant distance errors. Such motions are motivated by the physical requirements for rearranging subsequences of a protein. Similarity detection then seeks rigid body motions able to overlay pairs of substructures, each related by a substructure-preserving motion, without necessarily requiring global structure preservation. This definition is general enough to span a wide range of questions: One can ask for full rearrangement of one protein into another while preserving global topology, as in drug design; or one can ask for rearrangements of sets of smaller substructures, each of which preserves local but not global topology, as in protein evolution.
In the appendix, we exhibit an algorithm for answering the general question. That algorithm has the complexity of robot motion planning. In the text, we consider a more common case in which one seeks protein similarity by rearrangements of relatively short peptide segments. We exhibit two algorithms, one based on writhing numbers and one based on line weavings. The algorithms have time complexities ranging from O(n^2) to O(s^{11}), depending on level of detail, where n is the number of residues in the protein and s is the number of secondary structure elements. In practice, the running times were nearly interactive. We define and use a new datastructure, called geometric self-convolution, within the writhing-based algorithm.
Contributions: We believe that this is the first paper to consider carefully the need for combining metric and isotopic qualities in seeking protein similarity. We provide a parameterized definition of similarity that leads naturally to a metric in protein space. The underlying topological approach leads further to a representation of proteins by line weavings. We exhibit algorithms for computing the metric and for detecting similarity. We report results obtained with a dozen pairs of proteins, exhibiting a range of typical features. 48 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |