CMU-ML-15-105
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-15-105

Discovering Compact and Informative Structures through Data Partitioning

Madalina Fiterau

February 2015

Ph.D. Thesis

CMU-ML-15-105.pdf


Keywords: Informative projection recovery, cost-based feature selection, ensemble methods, data partitioning, active learning, clinical data analysis, artifact adjudication, nuclear threat detection


In many practical scenarios, prediction for high-dimensional observations can be accurately performed using only a fraction of the existing features. However, the set of relevant predictive features, known as the sparsity pattern, varies across data. For instance, features that are informative for a subset of observations might be useless for the rest. In fact, in such cases, the dataset can be seen as an aggregation of samples belonging to several low-dimensional sub-models, potentially due to different generative processes. My thesis introduces several techniques for identifying sparse predictive structures and the areas of the feature space where these structures are effective. This information allows the training of models which perform better than those obtained through traditional feature selection.

We formalize Informative Projection Recovery, the problem of extracting a set of low-dimensional projections of data which jointly form an accurate solution to a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to a number of machine learning problems, offering solutions to classification, clustering and regression tasks. Experiments show that our method can discover and leverage low-dimensional structure, yielding accurate and compact models. Our method is particularly useful in applications involving multivariate numeric data in which expert assessment of the results is of the essence. Additionally, we developed an active learning framework which works with the obtained compact models in finding unlabeled data deemed to be worth expert evaluation. For this purpose, we enhance standard active selection criteria using the information encapsulated by the trained model. The advantage of our approach is that the labeling effort is expended mainly on samples which benefit models from the hypothesis class we are considering. Additionally, the domain experts benefit from the availability of informative axis aligned projections at the time of labeling. Experiments show that this results in an improved learning rate over standard selection criteria, both for synthetic data and real-world data from the clinical domain, while the comprehensible view of the data supports the labeling process and helps preempt labeling errors.

121 pages

Thesis Committee:
Artur Dubrawski (Chair)
Geoff Gordon
Alex Smola
Andreas Krause (ETH Zürich)

Tom M. Mitchell, Head, Machine Learning Department
Andrew W. Moore, Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science