CMU-ML-06-101
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-06-101

Discovering Latent Patterns with Hierarchical
Bayesian Mixed-Membership Models

Edoardo M. Airoldi, Stephen E. Fienberg*,
Cyrille Joutard**, Tanzy M. Love**

May 2006

CMU-ML-06-101.pdf


Keywords: Disability data, model specification, soft clustering, text analysis, survey data

There has been an explosive growth of data-mining models involving latent structure for clustering and classification. While having related objectives these models use different parameterizations and often very different specifications and constraints. Model choice is thus a major methodological issue and a crucial practical one for applications.

In this paper, we work from a general formulation of hierarchical Bayesian mixed-membership models in Erosheva [15] and Erosheva, Fienberg, and Lafferty [19] and present several model specifications and variations, both parametric and nonparametric, in the context of the learning the number of latent groups and associated patterns for clustering units. Model choice is an issue within specifications, and becomes a component of the larger issue of model comparison.

We elucidate strategies for comparing models and specifications by producing novel analyses of two data sets: (1) a corpus of scientific publications from the Proceedings of the National Academy of Sciences (PNAS) examined earlier by Erosheva, Fienberg, and Lafferty [19] and Griffiths and Steyvers [22]; (2) data on functionally disabled American seniors from the National Long Term Care Survey (NLTCS) examined earlier by Erosheva [15, 16, 17], Erosheva and Fienberg [18].

Our specifications generalize those used in earlier studies. For example, we make use of both text and references to narrow the choice of the number of latent topics in our publications data, in both parametric and nonparametric settings. We then compare our analyses with the earlier ones, for both data sets, and we use them to illustrate some of the dangers associated with the practice of fixing the hyper-parameters in complex hierarchical Bayesian mixed-membership models to cut down the computational burden. Our findings also bring new insights regarding latent topics for the PNAS text corpus and disability profiles for the NLTCS data.

46 pages

*Also affiliated with the Department of Statistics, Carnegie Mellon University
**Department of Statistics, Carnegie Mellon University


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu