Machine Learning Department
School of Computer Science, Carnegie Mellon University
Exploting Domain and Task Regularities
Andrew O. Arnold
It is often convenient to make certain assumptions during the learning process. Unfortunately, algorithms built on these assumptions can often break down if the assumptions are not stable between train and test data. Relatedly, we can do better at various tasks (like named entity recognition) by exploiting the richer relationships found in real-world complex systems. By exploiting these kinds of non-conventional regularities we can more easily address problems previously unapproachable, like transfer learning. In the transfer learning setting, the distribution of data is allowed to vary between the training and test domains, that is, the independent and identically distributed (i.i.d.) assumption linking train and test examples is severed. Without this link between the train and test data,traditional learning is difficult.
In this thesis we explore learning techniques that can still succeed even in situations where i.i.d. and other common assumptions are allowed to fail. Specifically, we seek out and exploit regularities in the problems we encounter and document which specific assumptions we can drop and under what circumstances and still be able to complete our learning task. We further investigate different methods for dropping, or relaxing, some of these restrictive assumptions so that we may bring more resources (from unlabeled auxiliary data, to known dependencies and other regularities) to bear on the problem, thus producing both better answers to existing problems, and even being able to begin addressing problems previously unanswerable, such as those in the transfer learning setting.
In particular, we introduce four techniques for producing robust named entity recognizers, and demonstrate their performance on the problem domain of protein name extraction in biological publications:
Thus we show that learned classifiers and extractors can be made more robust to shifts between the train and test data by using data (both labeled and unlabeled) from related domains and tasks, and by exploiting stable regularities and complex relationships between different aspects of that data.
||SCS Technical Report Collection
School of Computer Science homepage
This page maintained by firstname.lastname@example.org