Institute for Software Research
School of Computer Science, Carnegie Mellon University
Uncovering and Managing the Impact of Methodological Choices for
This thesis is motivated by the need for scalable and reliable methods and technologies that support the construction of network data based on information from text data. Ultimately, the resulting data can be used for answering substantive and graph-theoretical questions about sociotechnical networks.
One main limitation with constructing network data from text data is that the validation of the resulting network data can be hard to infeasible, e.g. in the cases of covert, historical and largescale networks. This thesis addresses this problem by identifying the impact of coding choices that must be made when extracting network data from text data on the structure of networks and network analysis results. My findings suggest that conducting reference resolution on text data can alter the identity and weight of 76% of the nodes and 23% of the links, and can cause major changes in the value of commonly used network metrics. Also, performing reference resolution prior to relation extraction leads to the retrieval of completely different sets of key entities in comparison to not applying this pre-processing technique. Based on the outcome of the presented experiments, I recommend strategies for avoiding or mitigating the identified issues in practical applications.
When extracting socio-technical networks from texts, the set of relevant node classes might go beyond the classes that are typically supported by tools for named entity extraction. I address this lack of technology by developing an entity extractor that combines an ontology for sociotechnical networks that originates from the social sciences, is theoretically grounded and has been empirically validated in prior work, with a supervised machine learning technique that is based on probabilistic graphical models. This thesis does not stop at showing that the resulting prediction models achieve state of the art accuracy rates, but I also describe the process of integrating these models into an existing and publically available end-user product. As a result, users can apply these models to new text data in a convenient fashion.
While a plethora of methods for building network data from information explicitly or implicitly contained in text data exists, there is a lack of research on how the resulting networks compare with respect to their structure and properties. This also applies to networks that can be extracted by using the aforementioned entity extractor as part of the relation extraction process. I address this knowledge gap by comparing the networks extracted by using this process to network data built with three alternative methods: text coding based on thesauri that associate text terms with node classes, the construction of network data from meta-data on texts, such as key words and index terms, and building network data in collaboration with subject matter experts. The outcomes of these comparative analyses suggest that thesauri generated with the entity extractor iv developed for this thesis need adjustments with respect to particular categories and types of errors. I am providing tools and strategies to assist with these refinements. My results also show that once these changes have been made and in contrast to manually constructed thesauri, the prediction models generalize with acceptable accuracy to other domains (news wire data, scientific writing, emails) and writing styles (formal, casual). The comparisons of networks constructed with different methods show that ground truth data built by subject matter experts are hardly resembled by any automated method that analyzes text bodies, and even less so by exploiting existing meta-data from text corpora. Thus, aiming to reconstruct social networks from text data leads to largely incomplete networks. Synthesizing the findings from this work, I outline which types of information on socio-technical networks are best captured by what network data construction method, and how to best combine these methods in order to gain a more comprehensive view on a network.
When both, text data and relational data, are available as a source of information on a network, people have previously integrated these data by enhancing social networks with content nodes that represent salient terms from the text data. I present a methodological advancement to this technique and test its performance on the datasets used for the previously mentioned evaluation studies. By using this approach, multiple types of behavioral data, namely interactions between people as well as their language use, can be taken into account. I conclude that extracting content nodes from groups of structurally equivalent agents can be an appropriate strategy for enabling the comparison of the content that people produce, perceive or disseminate. These equivalence classes can represent a variety of social roles and social positions that network members occupy. At the same time, extracting content nodes from groups of structurally coherent agents can be suitable for enabling the enhancement of social networks with content nodes. The results from applying the latter approach to text data include a comparison of the outcome of topic modeling; an efficient and unsupervised information extraction technique, to the outcomes of alternative methods, including entity extraction based on supervised machine learning. My findings suggest that key entities from meta-data knowledge networks might serve as proper labels for unlabeled topics. Also, unsupervised and supervised learning leads to the retrieval of similar entities as highly likely members of highly likely topics, and key nodes from text-based knowledge networks, respectively.
In summary, the contributions made with this thesis help people to collect, manage and analyze rich network data at any scale. This is a precondition for asking substantive and graph-theoretical questions, testing hypotheses, and advancing theories about networks. This thesis uses an interdisciplinary and computationally rigorous approach to work towards this goal; thereby advancing the intersection of network analysis, natural language processing and computing.