Center for Automated Learning and Discovery
School of Computer Science, Carnegie Mellon University


Automatic Discovery of Latent Variable Models

Ricardo Silva

August 2005

Ph.D. Thesis


Keywords: Graphical models, causality, latent variables

Much of our understanding of Nature comes from theories about unobservable entities. Identifying which hidden variables exist given measurements in the observable world is therefore an important step in the process of discovery. Such an enterprise is only possible if the existence of latent factors constrains how the observable world can behave. We do not speak of atoms, genes and antibodies because we see them, but because they indirectly explain observable phenomena in a unique way under generally accepted assumptions.

How to formalize the process of discovering latent variables and models associated with them is the goal of this thesis. More than finding a good probabilistic model that fits the data well, we describe how, in some situations, we can identify causal features common to all models that equally explain the data. Such common features describe causal relations among observed and hidden variables. Although this goal might seem ambitious, it is a natural extension of several years of work in discovering causal models from observational data through the use of graphical models. Learning causal relations without experiments basically amounts to discovering an unobservable fact (does A cause B?) from observable measurements (the joint distribution of a set of variables that include A and B). We take this idea one step further by discovering which hidden variables exist to begin with.

More specifically, we describe algorithms for learning causal latent variable models when ob- served variables are noisy linear measurements of unobservable entities, without postulating a priori which latents might exist. Most of the thesis concerns how to identify latents by describing which observed variables are their respective measurements. In some situations, we will also assume that latents are linearly dependent, and in this case causal relations among latents can be partially identified. While continuous variables are the main focus of the thesis, we also describe how to adapt this idea to the case where observed variables are ordinal or binary. Finally, we examine density estimation, where knowing causal relations or the true model behind a data generating process is not necessary. However, we illustrate how ideas developed in causal discovery can help the design of algorithms for multivariate density estimation.

195 pages

SCS Technical Report Collection
School of Computer Science homepage

This page maintained by