Computer Science Department
School of Computer Science, Carnegie Mellon University
Relation Extraction using Distant Supervision,
Malcolm W. Greaves
We are drowning in information and having difficulty finding knowledge: useful and actionable information. Recent studies estimate that humanity has stored in excess of 295 exabytes (295*1018 bytes) of data. Much data is stored in the form of unstructured text, such as news articles, message boards and forums, texts, emails, status updates, tweets, and nearly a billion webpages.
In this thesis, we present a solution to extracting knowledge present in untold amounts of unstructured text. We define our problem as one of relation extraction: given a document, extract all instantiations of well-defined binary relations present in the text. To this end, we use distant supervision and a novel probabilistic first order logic system combined with co-reference resolution to identify candidate relation instances. These candidates are then classified by a series of cost augmented, soft-margin, binary Support Vector Machines to produce the final relation extractions. Results on a corpus of 5.7 million newswire articles over 27 different relations results in an across-relation, microaveraged F1 of 42.02%. Results on a smaller, targeted search, consisting of 10 thousand documents, achieve F1 of 33.15%.