CMU-CS-23-145 Computer Science Department School of Computer Science, Carnegie Mellon University
Yudong LiuIhita Mandal M.S. Thesis December 2023
Efficiently searching and dereplicating know entities from raw databases of biological extracts has been one of the major difficulties in natural product discoveries. Due to the wide usage of high-throughput mass spectrometry technique (MS)for building NP databases, there has been a pressing demand for an efficient infrastructure capable of organizing community-wide available MS libraries into solid datasets that allows cross-referencing between different MS spectral data of the same molecules. While the throughput rate of mass spectrometers and the size of publicly available metabolomics data are growing rapidly, illuminating the molecules presenting untargeted mass spectrometry data remains a challenging task. In the past decade, molecular networking and MASST were introduced to organize and query untargeted mass spectrometry data. While useful for single datasets, these methods cannot scale to searching and clustering billions of mass spectral data in metabolomics repositories, e.g. the Global Natural Product Social (GNPS) molecular networking infrastructure. To address this shortcoming, we developed an efficient strategy for the computation of dot-product between mass spectra, where the relevant information from spectral datasets is stored in an indexing table. Based on this strategy, we designed MASST+ and Networking+, scalable approaches for querying and clustering mass spectra that can process datasets that are up to three orders of magnitude larger than the state-of-the-art. Our method enables querying against 717 millionspectra from the GNPS public data in less than an hour and mapping the chemical. Efficiently searching and dereplicating know entities from raw databases of biological extracts has been one of the major difficulties in natural product discoveries. Due to the wide usage of high-throughput mass spectrometry technique (MS)for building NP databases, there has been a pressing demand for an efficient infrastructure capable of organizing community-wide available MS libraries into solid datasets that allows cross-referencing between different MS spectral data of the same molecules. While the throughput rate of mass spectrometers and the size of publicly available metabolomics data are growing rapidly, illuminating the molecules presenting untargeted mass spectrometry data remains a challenging task. In the past decade, molecular networking and MASST were introduced to organize and query untargeted mass spectrometry data. While useful for single datasets, these methods cannot scale to searching and clustering billions of mass spectral data in metabolomics repositories, e.g. the Global Natural Product Social (GNPS) molecular networking infrastructure. To address this shortcoming, we developed an efficient strategy for the computation of dot-product between mass spectra, where the relevant information from spectral datasets is stored in an indexing table. Based on this strategy, we designed MASST+ and Networking+, scalable approaches for querying and clustering mass spectra that can process datasets that are up to three orders of magnitude larger than the state-of-the-art. Our method enables querying against 717 millionspectra from the GNPS public data in less than an hour and mapping the chemical diversity of all GNPS public data in days. 52 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |