CALD TECHNICAL REPORT ABSTRACTS

	CMU-CALD-04-106 Center for Automated Learning and Discovery School of Computer Science, Carnegie Mellon University CMU-CALD-04-106 Statistical Models for Frequent Terms in Text Edoardo M. Airoldi, William W. Cohen, Stephen E. Fienberg May 2004 CMU-CALD-04-106.pdf Keywords: Bayesian models, multinomial, binomial, Poisson, negative-binomial In this paper we present statistical models for text which treat words with higher frequencies of occurrence in a sensible manner, and perform better than widely used models based on the multinomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable properties of simplicity and analytic tractability. 12 pages

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu