Computational Biology Department
School of Computer Science, Carnegie Mellon University


Computational methods for exploring gene regulation
mechanisms using high-throughput sequencing data

David Farrow

July 2016

Ph.D. Thesis


Keywords: Influenza, viral evolution, phylodynamics, epidemiological forecasting, epidemiological nowcasting, sensor fusion

Influenza has been, and continues to be, a significant source of disease burden worldwide. Regular epidemics and sporadic pandemics are incredibly costly to society, not just in terms of the monetary expense of prevention and treatment, but also in terms of reduced productivity, increased absenteeism, and excessive morbidity and mortality. Major obstacles to mitigating these costs include an incomplete understanding of influenza's phylodynamics, the inherent delays of clinical surveillance and reporting, and a lack of outbreak forewarning.

The aim of this thesis is to address each of these obstacles computationally by (1) simulating transmission and evolution of influenza to explore the interplay between human immunity and viral evolution; (2) collecting and integrating a diverse set of real-time digital surveillance signals to track influenza activity; and (3) generating season-wide forecasts of influenza epidemics using an ensemble of statistical models, simulations, and human judgment.

The first part explores the concept of generalized immunity, which was previously hypothesized to be highly protective but short-lasting. Large-scale, long-term simulations based on an extension of an earlier model were used to scan immunity parameter space and indicate that the most plausible definition of generalized immunity is less protective but potentially much longer-lasting than previously assumed. The second part describes how sensor fusion and tracking can be applied to the nowcasting problem. Drawing from control theory, weather forecasting, and econometrics, an optimal filtering methodology is developed to integrate a set of proxies for influenza activity which share one common property: they are available online and in real-time. Otherwise, they are available at different temporal intervals, geographic resolutions, and historical periods, and they are noisy and potentially correlated. The resulting nowcasts are robust to failure of individual proxies and are available up to several weeks before traditional surveillance reports. The third part combines earlier results with novel methodologies to produce probabilistic forecasts of influenza spread and intensity that are timely, accurate, and actionable. In particular, an empirical Bayes method and spline regression are used to produce forecasts which only rely on the availability of historical data and are readily generalizable to other infectious diseases; and a wisdom of crowds approach is used to incorporate human judgment into the forecasting process.

143 pages

Thesis Committee:
Roni Rosenfeld (Chair)
Ryan Tibshirani
Carl Kingsford
John Grefenstette (University of Pittsburgh)
Elodie Ghedin (New York University)

Robert F. Murphy, Head, Computational Biology Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by