CMU-CB-11-104 Lane Center for Computational Biology School of Computer Science, Carnegie Mellon University
Modeling the Space of Subcellular Location Patterns Luis Pedro Coelho September 2011 Ph.D. Thesis
The study of proteins includes the study of protein location as one of its major areas of interest. This study can be approached one protein at a time, or systematically, in a high-throughput fashion, an approach that has been called location proteomics. Subcellular location can either be predicted, based on the protein sequence, homology, or other circumstantial evidence such as interaction patterns; or determined by direct observation. The prediction approach has the advantage that it requires less data (sometimes only the sequence). On the other hand, its results are not as conclusive as those obtained from direct data. Furthermore, prediction is, at least with the most widely used techniques, obtained from static data (sequence, functional annotations, binding patterns,...). Thus, most systems will predict the same location independently of cell type or cell state.
Direct data is normally in the form of images of fluorescently labeled
proteins. The automatic analysis of such images has by now a decade long
history. Most of the work has been done in the supervised learning mode:
the researcher specifies a set of interesting location classes (corresponding
to the organelles of interest), finds a few examples
of each, and trains a
classifier to recognise them in unlabeled data. Some work has shown usage
of
This work shows that direct and indirect data can be combined into a
single model and inferences can be made which depend on all of it. In
particular, the model can project multiple modalities into the
same space and return a label which is based on all its input data.
I will also propose new image representations for use with subcellular
location images. They are adapted from Speeded-Up Robust Features (SURF),
but adapted to the setting where, in addition to the protein channel, a
reference channel (in the case under study, a DNA marker) is present. I will
use supervised classification as a validation problem and show that SURF
outperforms traditional approaches and that adding DNA information
outperforms traditional SURF.
140 pages | |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |