Machine Learning Department
School of Computer Science, Carnegie Mellon University
Coupled Semi-Supervised Learning
Chapter 3 focuses on coupling the semi-supervised learning of information extractors that extract information from free text using textual extraction patterns (e.g., "mayor of X" and "Y star quarterback X"). We present an approach in which the input to the learner is an ontology defining a set of target categories and relations to be learned, a handful of seed examples for each, and a set of constraints that couple the various categories and relations (e.g., Person and Sport are mutually exclusive). We show that given this input and millions of unlabeled documents, a semi-supervised learning procedure can, by avoiding violations of the constraints in how its learned extractors label unlabeled data, achieve very significant accuracy improvements over semi-supervised methods that do not avoid such violations.
In Chapter 4, we apply the ideas from Chapter 3 to a different type of extraction method, wrapper induction for semi-structured web pages. We also consider how to couple multiple extraction methods that typically make independent errors. To couple pattern-based extraction and wrapperbased extraction, we use a strategy that only promotes instances extracted by both methods. Experimental results on dozens of categories and relations demonstrate that coupling wrapper induction improves the precision of the promoted facts, and that coupling multiple extraction methods leads to higher precision than either of the methods alone.
In Chapter 5, we consider two questions: (1) Can we scale up the number and variety of predicates in our ontology and still maintain high precision with coupled semi-supervised learning methods? and (2) Should we consider adding additional extraction methods beyond textual patterns and wrappers? We first describe a general architecture that can exploit many different extraction methods. We then describe a prototype implementation of our architecture, called Multi-Extractor Coupler (MEC). With an extended ontology of 123 categories and 55 relations, MEC has learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74%.
Chapter 6 considers how to couple the semi-supervised learning of logistic regression models. Specifically, we consider learning many binary logistic regression classifiers when many pairs of classes are known to be mutually exclusive. We present a method that uses unlabeled data through a penalty function that regularizes the training of classifiers by penalizing violations of mutual exclusion. We apply this idea to training classifiers which decide if a noun phrase is a member of some specific category. Semi-supervised training of such classifiers is shown to improve performance relative to supervised-only training. We speculate that use of similar penalty functions could provide an alternative to the methods for coupled semi-supervised learning presented in previous chapters, with the advantage that the models being learned are principled, probablistic models that are easy to train and can be applied to any example to provide a prediction of posterior probabilities.
||SCS Technical Report Collection
School of Computer Science homepage
This page maintained by firstname.lastname@example.org