Computer Science Department
School of Computer Science, Carnegie Mellon University


Relation Extraction using Distant Supervision,
SVMs, and Probabilistic First Order Logic

Malcolm W. Greaves

May 2014

M.S. Thesis


Keywords: Information extraction, Machine Learning, Natural Language Processing, Probabilistic First-Order Logic, Relation Extraction, Big Data, Large Scale Machine Learning, Support Vector Machines, Cost-Sensitive Learning

We are drowning in information and having difficulty finding knowledge: useful and actionable information. Recent studies estimate that humanity has stored in excess of 295 exabytes (295*1018 bytes) of data. Much data is stored in the form of unstructured text, such as news articles, message boards and forums, texts, emails, status updates, tweets, and nearly a billion webpages.

In this thesis, we present a solution to extracting knowledge present in untold amounts of unstructured text. We define our problem as one of relation extraction: given a document, extract all instantiations of well-defined binary relations present in the text. To this end, we use distant supervision and a novel probabilistic first order logic system combined with co-reference resolution to identify candidate relation instances. These candidates are then classified by a series of cost augmented, soft-margin, binary Support Vector Machines to produce the final relation extractions. Results on a corpus of 5.7 million newswire articles over 27 different relations results in an across-relation, microaveraged F1 of 42.02%. Results on a smaller, targeted search, consisting of 10 thousand documents, achieve F1 of 33.15%.

62 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by