|
CMU-CS-98-120
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-98-120
Using EM to Classify Text from Labeled and Unlabeled Documents
Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell
May 1998
CMU-CS-98-120.ps
Keywords: Learned text classifiers, algorithmn for learning from
labeled and unlabeled text
This paper shows that the accuracy of learned text classifiers can be
improved by augmenting a small number of labeled training documents
with a large pool of unlabeled documents. This is significant because
in many important text classification problems obtaining
classification labels is expensive, while large quantities of
unlabeled documents are readily available. We present a theoretical
argument showing that, under common assumptions, unlabeled data
contain information about the target function. We then introduce an
algorithm for learning from labeled and unlabeled text, based on the
combination of Expectation-Maximization with a naive Bayes classifier.
The algorithm first trains a classifier using the available labeled
documents, and probabilistically labels the unlabeled documents. It
then trains a new classifier using the labels for all the documents,
and iterates. Experimental results, obtained using text from three
different real-world tasks, show that the use of unlabeled data
reduces classification error by up to 30%.
20 pages
|