CMU-CS-01-126 Computer Science Department School of Computer Science, Carnegie Mellon University
Using Unlabeled Data to Improve Text Classification Kamal Paul Nigam May 2001 Ph.D. Thesis
CMU-CS-01-126.ps
Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with pa representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima. 138 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |