CMU-ISR-18-105
Institute for Software Research
School of Computer Science, Carnegie Mellon University



CMU-ISR-18-105

Bias and beyond in digital trace data

Momin M. Malik

August 2018

Ph.D. Thesis
Societal Computing

CMU-ISR-18-105.pdf


Keywords: Computational social science; bias; generalizability; validity; digital trace data; measurement; machine learning; social media; social network analysis; mobile phone sensors; STS; data science; critical technical practice; critical social science; critical algorithm studies; critical data studies

Large-scale digital trace data from sources such as social media platforms, emails, purchase records, browsing behavior, and sensors in mobile phones are increasingly used for business decision-making, scientific research, and even public policy. However, these data do not give an unbiased picture of underlying phenomena. In this thesis, I demonstrate some of the ways in which large-scale digital trace data, despite its richness, has biases in who is represented, what sorts of actions are represented, and what sorts of behaviors are captured. I present three critiques, demonstrating respectively that geotagged tweets exhibit heavy geographic and demographic biases, that social media platforms’ attempts to guide user behavior are successful and have implications for the behavior we think we observe, and that sensors built into mobile phones like Bluetooth and WiFi measure proximity and co-location but not necessarily interaction as has been claimed.

In response to these biases, I suggest shifting the scope of research done with digital trace data away from attempts at large-sample statistical generalizability and towards studies that situate knowledge in the contexts in which the data were collected. Specifically, I present two studies demonstrating alternatives to complement each of the critiques. In the first, I work with public health researchers to use Twitter as a means of public outreach and intervention. In the second, I design a study using mobile phone sensors in which I use sensor data and survey data to respectively measure proximity and sociometric choice, and model the relationship between the two.

185 pages

Thesis Committee:
Jürgen Pfeffer (Co-Chair)
Anind K. Dey (Co-Chair, HCII)
Cosma Rohilla Shalizi (Statistics & Data Science)
David Lazer (Northeastern University)

William L. Scherlis, Director, Institute for Software Research
Andrew W. Moore, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu