CMU-CS-20-139
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-20-139

Sample-Specific Models for Precision Medicine

Benjamin Lengerich

Ph.D. Thesis

December 2020

CMU-CS-20-139.pdf


Keywords: Personalized Machine Learning, Sample-Specific Models, Precision Medicine

Modern applications of artificial intelligence are often characterized by traininglarge machine learning (ML) models on large datasets. These datasets are composed of overlapping groups of samples, either explicitly (e.g. the large dataset is createdby combining multiple datasets) or implicitly (e.g. the samples belong to latent sub-populations). Population models prefer weakly-predictive global patterns overhighly-predictive localized effects, a problem because localized effects are critical to understanding complex processes such as in applications to computational biology (in which samples come from latent cell types) and precision medicine (in which patients come from latent disease subtypes).

In this thesis, we propose that: The performance of intelligent computer systemscan be improved by treating different samples as different tasks. This is especially helpful in domains such as computational biology and precision medicine, in which we care about understanding the highly specific context of each sample.

We propose to solve this problem by estimating a collection of many small models. For large collections, each model is responsible for only a small number of samples, enabling simultaneous interpretability and accuracy. As we show in this thesis, this framework can be scaled to estimate different model parameters for every sample.

This thesis begins by studying the challenges of characterizing real-world datawith population-level models. Next, we develop the methodology of PersonalizedRegression. Finally, we apply sample-specific inference to computational biologyand precision medicine by: (1) Identifying Discriminative Subtypes of Cancers from Histopathology Images and (2) Cell-Specific Transcriptomic Regulatory Network Inference.

103 pages

Thesis Committee:
Eric P. Xing (Chair)
Zico Kolter
Ziv Bar-Joseph
Manolis Kellis (Massachusetts Institute of Technology)
Rich Carunana (Microsoft Research)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu