CMU-CS-15-124
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-15-124

Efficient Learning of Sparse Gaussian Mixture Models
of Protein Conformational Substates

Ji Oh Yoo

July 2015

M.S. Thesis

CMU-CS-15-124.pdf


Keywords: Conformational substates, Molecular Dynamics simulation, Gaussian Mixture Model, Nonparanormal Mixture Model, coreset approximation

Molecular Dynamics (MD) simulations are an important technique for studying the conformational dynamics of proteins in Computational Structural Biology. Traditional methods for the analysis of MD simulation assumes a single conformational state underlying the data. With recent developments in MD simulation technologies, MD simulation now can produce massive and long time-scale trajectories across multiple conformational substates, and new efficient methods to analyze these trajectories and to learn structural dynamics of the substates are needed.

In this thesis, we develop new methods to learn parametric and semi-parametric, sparse generative models from the positional fluctuations of amino acid residues in the simulation. Specifically, our methods learn a mixture of sparse Gaussian or nonparanormal distributions. Each mixing component encodes the statistics of a different substate. L1 regularization is used to produce sparse graphical models that are easier to interpret than a simple covariance analysis, because the topology of the graphical model reveals the coupling structure between different parts of the molecule. Our method also employs coreset sampling to enhance scalability.

We demonstrate that our methods produce models that have a number of advantages over traditional Gaussian Mixture Models (GMM). Experiments on synthetic data show substantial improvements over GMMs on the recovery of the true network structure, while remaining competitive in terms of test likelihood and imputation error. Experiments on a large real MD data set are consistent with the results on synthetic data. We also demonstrate the benefits of using semi-parametric models in terms of likelihood and imputation metrics.

65 pages

Thesis Committee:
Christopher James Langmead (Chair)
Wei Wu

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu