CMU-CB-21-102
Ray and Stephanie Lane Computational Biology Department
School of Computer Science, Carnegie Mellon University



CMU-CB-21-102

Computational methods for "atlas-scale"
analysis of scRNA-seq data

Amir Alavi

June 2021

Ph.D. Thesis

CMU-CB-21-102.pdf

Keywords: scRNA-seq, Classification, Representation learning, Data integration, Batch-effect correction

Single-cell RNA-sequencing (scRNA-seq) has allowed a higher resolution view into the transcriptional landscape of cells. The large amount of data collected as part of these experiments has brought with it new opportunities and challenges for computational analysis methods. Addressing these issues is a critical step for both, studies using scRNA-Seq to model specific biological processes and systems and recent large scale efforts focused on cellular atlases of human tissues.

One of the first questions that researchers face when analyzing such data is the identification of the cell types in the heterogeneous populations of cells. Supervised solutions for this question require the development of novel methods that can extract a rich set of feature representations and that can account for technical effects across datasets. Once cells are assigned to different types, several additional questions can be addressed. Of particular interest, especially for the large scale atlas efforts, is the issue of comparing cell types and tissues across a large set of samples to identify unique markers and marker combinations.

In this thesis, we present a set of computational methods that address each of these related issues. We first present a new computational approach (scQuery) for comparative analysis of scRNA-seq datasets that utilizes large publicly available scRNA-seq datasets of many cell types, instead of relying on marker gene information. We show that the supervised neural embedding models at the heart of this method can learn rich, compact embeddings that enable efficient comparisons between cells and can serve as an effective tool in exploratory scRNA-seq analysis. We next address the problem of batch effect correction. We propose two approaches: a supervised approach called scDGN as well as an unsupervised approach called SCIPR. In both cases, we show that our methods can accurately align cell type populations from different batches, and that our models utilize biologically relevant genes to apply their transformations. Finally, we extend the supervised classification-based approaches to identify marker genes for cell types across multiple tissues in recently collected HuBMAP consortium scRNA-seq data. Taken together, the methods we have developed in this thesis make significant strides for atlas scale analysis of the single-cell gene expression landscape.

182 pages

Thesis Committee:
Ziv Bar-Joseph (Chair)
Jian Ma
Maria Chikina (University of Pittsburgh)
Guo-Cheng Yuan (Icahn School of Medicine at Mount Sinai)

Russell Schwartz, Head, Computational Biology Department
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu