CMU-CB-20-101
Ray and Stephanie Lane Computational Biology Department
School of Computer Science, Carnegie Mellon University



CMU-CB-20-101

Detecting anomalies in inferred transcript sequences
and expression from RNA-seq

Cong Ma

September 2020

Ph.D. Thesis

Currently Unavailable


Keywords: Anomaly detection, RNA-seq, transcriptomic structural variants, expression quantification, unannotated isoforms

Anomalies are data points that do not follow established or expected patterns. When measuring gene expression, anomalies in RNA-seq are observations or patterns that cannot be explained by the inferred transcript sequences or expressions. Transcript sequences and expression are key indicators for cell status and are used in many phenotypic and disease analyses. Identifying such unexplainable RNA-seq patterns can inspire improvements in the accuracy of inferred transcript sequences and expression of RNA-seq data and beneļ¬t the analyses based on transcripts. We develop computational methods to identify the RNA-seq anomalies that violate inferred sequence variation and expression patterns, and to improve the reconstructed transcripts such that they can explain the anomalies.

The first type of anomaly that we detect is the large-scale sequence variation in transcriptome, or transcriptomic structural variants (TSVs). TSVs are usually induced by genomic structural variants, which can fuse sequences either from a pair of genes or involve intergenic regions. Previous TSV detection methods assume that TSVs only fuse a pair genes and do not consider that some genes are still unknown, thus many RNA-seq reads from the intergenic or intronic regions cannot be explained by gene fusions. We develop a computational method, SQUID, to identify fusions both between a pair of genes and involving non-transcribing regions, thus enlarging the set of explained variants and RNA-seq reads. SQUID is further extended to the MULTIPLE COMPATIBLE ARRANGEMENTS PROBLEM, which is able to detect TSVs in the allele heterogeneity context. The second type of anomaly that we identify are coverage anomalies in estimated expression. The number of RNA-seq reads at each position along each transcript follows a distribution determined by the RNA-seq experiment protocol. We develop a method, Salmon Anomaly Detection (SAD), to identify the transcripts with an unexplainable coverage distribution by RNA-seq protocol. We observe that both quantification algorithm mistakes and incomplete reference transcripts cause abnormal coverage patterns. We also develop an adjustment procedure to correct quantification algorithm mistakes indicated by coverage anomalies and improve the accuracy of estimated expression. Our analysis of the coverage anomalies shows that some of the coverage anomalies are indicators of the regulation efficiency of transcription factors and can explain a part of the variability of the target gene expression. The developed methods introduce novel dimensions to more completely explain RNA-seq data, and can be incorporated into RNA-seq analyses to better characterize phenotype-transcript relationships or used to evaluate future transcript reconstruction methods.

139 pages

Thesis Committee:
Carl Kingsford (Chair)
Russell Schwartz
Xinghua Lu (University of Pittsburgh)
Ben Raphael (Princeton University)

Russell S. Schwartz, Head, Computational Biology Department
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu