CMU-CS-23-145
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-23-145

Yudong LiuIhita Mandal

M.S. Thesis

December 2023

CMU-CS-23-145.pdf


Keywords: Mass spectrometry, computational biology, bioinformatics, data mining, cluster-mass spectrometry, computational biology, bioinformatics, data mining, clustering

Efficiently searching and dereplicating know entities from raw databases of biological extracts has been one of the major difficulties in natural product discoveries. Due to the wide usage of high-throughput mass spectrometry technique (MS)for building NP databases, there has been a pressing demand for an efficient infrastructure capable of organizing community-wide available MS libraries into solid datasets that allows cross-referencing between different MS spectral data of the same molecules. While the throughput rate of mass spectrometers and the size of publicly available metabolomics data are growing rapidly, illuminating the molecules presenting untargeted mass spectrometry data remains a challenging task. In the past decade, molecular networking and MASST were introduced to organize and query untargeted mass spectrometry data. While useful for single datasets, these methods cannot scale to searching and clustering billions of mass spectral data in metabolomics repositories, e.g. the Global Natural Product Social (GNPS) molecular networking infrastructure. To address this shortcoming, we developed an efficient strategy for the computation of dot-product between mass spectra, where the relevant information from spectral datasets is stored in an indexing table. Based on this strategy, we designed MASST+ and Networking+, scalable approaches for querying and clustering mass spectra that can process datasets that are up to three orders of magnitude larger than the state-of-the-art. Our method enables querying against 717 millionspectra from the GNPS public data in less than an hour and mapping the chemical. Efficiently searching and dereplicating know entities from raw databases of biological extracts has been one of the major difficulties in natural product discoveries. Due to the wide usage of high-throughput mass spectrometry technique (MS)for building NP databases, there has been a pressing demand for an efficient infrastructure capable of organizing community-wide available MS libraries into solid datasets that allows cross-referencing between different MS spectral data of the same molecules. While the throughput rate of mass spectrometers and the size of publicly available metabolomics data are growing rapidly, illuminating the molecules presenting untargeted mass spectrometry data remains a challenging task. In the past decade, molecular networking and MASST were introduced to organize and query untargeted mass spectrometry data. While useful for single datasets, these methods cannot scale to searching and clustering billions of mass spectral data in metabolomics repositories, e.g. the Global Natural Product Social (GNPS) molecular networking infrastructure. To address this shortcoming, we developed an efficient strategy for the computation of dot-product between mass spectra, where the relevant information from spectral datasets is stored in an indexing table. Based on this strategy, we designed MASST+ and Networking+, scalable approaches for querying and clustering mass spectra that can process datasets that are up to three orders of magnitude larger than the state-of-the-art. Our method enables querying against 717 millionspectra from the GNPS public data in less than an hour and mapping the chemical diversity of all GNPS public data in days.

52 pages

Thesis Committee:
Hosein Mohimani (Chair)
Carl Kingsford Wu

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu