CMU-ML-19-109
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-19-109

Accelerating Text-as-Data Research in Computational Social Science

Dallas Card

May 2019

Ph.D. Thesis

CMU-ML-19-109.pdf


Keywords: Machine learning, natural language processing, computational social science, graphical models, interpretability, calibration, conformal methods


Natural language corpora are phenomenally rich resources for learning about people and society, and have long been used as such by various disciplines such as history and political science. Recent advances in machine learning and natural language processing are creating remarkable new possibilities for how scholars might analyze such corpora, but working with textual data brings its own unique challenges, and much of the research in computer science may not align with the desiderata of social scientists. In this thesis, I present a line of work on developing methods for computational social science focused primarily on observational research using natural language text. Throughout, I take seriously the concerns and priorities of the social sciences, leading to a focus on aspects of machine learning which are otherwise sometimes secondary, including calibration, interpretability, and transparency. Two ideas which unify this work are the problems of exploration and measurement, and as a running example I consider the problem of analyzing how news sources frame contemporary political issues. Following the introduction, I devote one chapter to providing the necessary background on computational social science, framing, and the "text as data" paradigm. Subsequent chapters each focus on a particular model or method that strives to address some aspect of research which may be of particular interest to social scientists. Chapters 3 and 4 focus on the unsupervised setting, with the former presenting a model for learning archetypal character representations, and the latter presenting a framework for neural document models which can flexibly incorporate metadata. Chapters 5 and 6 focus on the supervised setting and present alternately, a method for measuring label proportions in text in the presence of domain shift, and a variation on deep learning classifiers which produces more transparent and robust predictions. The final chapter concludes with implications for computational social science and possible directions for future work.

163 pages

Thesis Committee:
Noah A. Smith (Chair)
Artur Dubrawski
Geoff Gordon
Dan Jurafsky (Stanford University)

Roni Rosenfeld, Head, Machine Learning Department
Martial Hebert, Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science