CMU-ML-09-109
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-09-109

Exploting Domain and Task Regularities
for Robust Named Entity Recognition

Andrew O. Arnold

August 2009

Ph.D. Thesis

CMU-ML-09-109.pdf


Keywords: Machine learning, named entity extraction, transfer learning

It is often convenient to make certain assumptions during the learning process. Unfortunately, algorithms built on these assumptions can often break down if the assumptions are not stable between train and test data. Relatedly, we can do better at various tasks (like named entity recognition) by exploiting the richer relationships found in real-world complex systems. By exploiting these kinds of non-conventional regularities we can more easily address problems previously unapproachable, like transfer learning. In the transfer learning setting, the distribution of data is allowed to vary between the training and test domains, that is, the independent and identically distributed (i.i.d.) assumption linking train and test examples is severed. Without this link between the train and test data,traditional learning is difficult.

In this thesis we explore learning techniques that can still succeed even in situations where i.i.d. and other common assumptions are allowed to fail. Specifically, we seek out and exploit regularities in the problems we encounter and document which specific assumptions we can drop and under what circumstances and still be able to complete our learning task. We further investigate different methods for dropping, or relaxing, some of these restrictive assumptions so that we may bring more resources (from unlabeled auxiliary data, to known dependencies and other regularities) to bear on the problem, thus producing both better answers to existing problems, and even being able to begin addressing problems previously unanswerable, such as those in the transfer learning setting.

In particular, we introduce four techniques for producing robust named entity recognizers, and demonstrate their performance on the problem domain of protein name extraction in biological publications:

  • Feature hierarchies relate distinct, though related, features to one another via a natural linguistically-inspired hierarchy.
  • Structural frequency features exploit a regularity based on the structure of the data itself and the distribution of instances across that structure.
  • Snippets link data not by the distribution of the instances or their features, but by their labels. Thus data that have different attributes, but similar labels, will be joined together, while instances that have similar features, but distinct labels, are segregated to allow for variation between domains.
  • Graph relations represent the entities contained in the data and their relationships to each other as a network which is exploited to help discover robust regularities across domains.

Thus we show that learned classifiers and extractors can be made more robust to shifts between the train and test data by using data (both labeled and unlabeled) from related domains and tasks, and by exploiting stable regularities and complex relationships between different aspects of that data.

167 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu