Computer Science Department
School of Computer Science, Carnegie Mellon University
Building Reliable Metaclassifiers for Text Learning
Paul N. Bennett
Appropriately combining information sources to form a more effective output than any of the individual sources is a broad topic that has been researched in many forms. It can be considered to contain sensor fusion, distributed data-mining, regression combination, classifier combination, and even the basic classification problem. After all, the hypothesis a classifier emits is just a specification of how the information in the basic features should be combined. This dissertation addresses one subfield of this domain: leveraging locality when combining classifiers for text classification. Classifier combination is useful, in part, as an engineering aid that enables machine learning scientists to understand the difference in base classifiers in terms of their local reliability, dependence, and variance -- much as higher-level languages are an abstraction that improves upon assembly language without extending its computational power. Additionally, using such abstraction, we introduce a combination model that uses inductive transfer to extend the amount of labeled data that can be brought to bear when building a text classifier combination model.
We begin by discussing the role calibrated probabilities play when combining classifiers. After reviewing calibration, we present arguments and empirical evidence that the distribution of posterior probabilities from a classifier will give rise to asymmetry. Since the standard methods for recalibrating classifiers have an underlying assumption of symmetry, we present asymmetrical distributions that can be fit efficiently and produce recalibrated probabilities of higher quality than the symmetrical methods. The resulting improved probabilities can either be used directly for a single base classifier or used as part of a classifier combination model.
Reflecting on the lessons learned from the study of calibration, we go on to define local calibration, dependence, and variance and discuss the roles they play in classifier combination. Using these insights as motivation, we introduce a series of reliability-indicator variables which serve as an intuitive abstraction of the input domain to capture the local context related to a classifier's reliability.
We then introduce the main methodology of our work, STRIVE, which uses metaclassifiers and reliability indicators to produce improved classification performance. A key difference from standard metaclassification approaches is that reliability indicators enable the metaclassifier to weigh each classifier according to its local reliability in the neighborhood of the current vi prediction point. Furthermore, this approach empirically outperforms state-ofthe- art metaclassification approaches that do not use locality. We then analyze the contributions of the various reliability indicators to the combination model and suggest promising features to consider when redesigning the base classifiers or new combination approaches. Additionally, we show how inductive transfer methods can be extended to increase the amount of labeled training data available for learning a combination model by collapsing data traditionally viewed as coming from different learning tasks.
Next, we briefly review online-learning classifier combination algorithms that have theoretical performance guarantees in the online setting and consider adaptations of these to the batch settings as alternative metaclassifiers. We then present empirical evidence that they are weaker in the offline setting than methods which employ standard classification algorithms as metaclassifiers, and we suggest future improvements likely to yield more competitive algorithms.
Finally, the combination approaches discussed are broadly applicable to classification problems other than topic classification, and we emphasize this with experiments that demonstrate STRIVE improves performance of actionitem detectors in e-mail -- a task where both the semantics and base classifier performance are significantly different than topic classification.