MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-19-109 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-19-109 Accelerating Text-as-Data Research in Computational Social Science Dallas Card May 2019 Ph.D. Thesis CMU-ML-19-109.pdf Keywords: Machine learning, natural language processing, computational social science, graphical models, interpretability, calibration, conformal methods Natural language corpora are phenomenally rich resources for learning about people and society, and have long been used as such by various disciplines such as history and political science. Recent advances in machine learning and natural language processing are creating remarkable new possibilities for how scholars might analyze such corpora, but working with textual data brings its own unique challenges, and much of the research in computer science may not align with the desiderata of social scientists. In this thesis, I present a line of work on developing methods for computational social science focused primarily on observational research using natural language text. Throughout, I take seriously the concerns and priorities of the social sciences, leading to a focus on aspects of machine learning which are otherwise sometimes secondary, including calibration, interpretability, and transparency. Two ideas which unify this work are the problems of exploration and measurement, and as a running example I consider the problem of analyzing how news sources frame contemporary political issues. Following the introduction, I devote one chapter to providing the necessary background on computational social science, framing, and the "text as data" paradigm. Subsequent chapters each focus on a particular model or method that strives to address some aspect of research which may be of particular interest to social scientists. Chapters 3 and 4 focus on the unsupervised setting, with the former presenting a model for learning archetypal character representations, and the latter presenting a framework for neural document models which can flexibly incorporate metadata. Chapters 5 and 6 focus on the supervised setting and present alternately, a method for measuring label proportions in text in the presence of domain shift, and a variation on deep learning classifiers which produces more transparent and robust predictions. The final chapter concludes with implications for computational social science and possible directions for future work. 163 pages Thesis Committee: Noah A. Smith (Chair) Artur Dubrawski Geoff Gordon Dan Jurafsky (Stanford University) Roni Rosenfeld, Head, Machine Learning Department Martial Hebert, Dean, School of Computer Science

SCS Technical Report Collection School of Computer Science