|
CMU-CS-08-164
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-08-164
Techniques for Exploiting Unlabeled Data
Mugizi Robert Rwebangira
October 2008
Ph.D. Thesis
CMU-CS-08-164.pdf
Keywords: Semi-supervised, regression, unlabeled data, similarity
In many machine learning application domains obtaining labeled data is
expensive but obtaining unlabeled data is much cheaper. For this reason
there has been
growing interest in algorithms that are able to take advantage of
unlabeled data. In this thesis we develop several methods for taking
advantage of unlabeled data in classification and regression tasks.
Specific contributions include:
- A method for improving the performance of the graph mincut algorithm of
Blum and Chawla [12] by taking randomized mincuts. We give theoretical
motivation for this approach and we present empirical results showing that
randomized mincut tends to outperform the original graph mincut algorithm,
especially when the number of labeled examples is very small.
- An algorithm for semi-supervised regression based on manifold
regularization using local linear estimators. This is the first extension
of local linear regression to the semi-supervised setting. In this thesis
we present experimental results on both synthetic and real data and show
that this method tends to perform better than methods which only utilize
the labeled data.
- An investigation of practical techniques for using the Winnow
algorithm (which is not directly kernelizable) together with kernel
functions and general similarity functions via unlabeled data. We expect
such techniques to be particularly useful when we have a large feature
space as well as additional similarity measures that we would like to
use together with the original features. This method is also suited to
situations where the best performing measure of similarity does
not satisfy the properties of a kernel. We present some experiments on
real and synthetic data to support this approach.
114 pages
|