|
CMU-CS-97-127
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-97-127
An Evaluation of Statistical Approaches to Text Categorization
Yimin Yang
April 1997
CMU-CS-97-127.ps
Keywords: Text categories, statistical learning, comparative study
This paper is a comparative study of text categorization methods.
Fourteen methods are investigated, based on previously published
results and newly obtained results from additional experiments. Corpus
biases in commonly used document collections are examined using the
performance of three classifiers. Problems in previously published
experiments are analyzed, and the results of flawed experiments are
excluded from the cross-method evaluation. As a result, eleven out of
the fourteen methods are remained. A k-nearest neighbor (kNN)
classifier was chosen for the performance baseline on several
collections; on each collection, the performance scores of other
methods were normalized using the score of kNN. This provides a
common basis for a global observation on methods whose results are
only available on individual collections. Widrow-Hoff, k-nearest
neighbor, neural networks and the Linear Least Squares Fit mapping are
the top-performing classifiers, while the Rocchio approaches had
relatively poor results compared to the other learning methods. KNN
is the only learning method that has scaled to the full domain of
MEDLINE categories, showing a graceful behavior when the target space
grows from the level of one hundred categories to a level of tens of
thousands.
12 pages
|