Computer Science Department
School of Computer Science, Carnegie Mellon University


Using Unlabeled Data to Improve Text Classification

Kamal Paul Nigam

May 2001

Ph.D. Thesis

Keywords: Text classification, text categorization, unlabeled data, Expectation-Maximization, unsupervised learning

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data---labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse.

Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with pa representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

138 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by