CMU-CALD-04-106
Center for Automated Learning and Discovery
School of Computer Science, Carnegie Mellon University



CMU-CALD-04-106

Statistical Models for Frequent Terms in Text

Edoardo M. Airoldi, William W. Cohen, Stephen E. Fienberg

May 2004

CMU-CALD-04-106.pdf


Keywords: Bayesian models, multinomial, binomial, Poisson, negative-binomial

In this paper we present statistical models for text which treat words with higher frequencies of occurrence in a sensible manner, and perform better than widely used models based on the multinomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable properties of simplicity and analytic tractability.

12 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu