|   | CMU-CS-08-164 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-08-164
 
Techniques for Exploiting Unlabeled Data 
Mugizi Robert Rwebangira 
October 2008 
Ph.D. Thesis 
CMU-CS-08-164.pdf 
Keywords: Semi-supervised, regression, unlabeled data, similarity
 
 
In many machine learning application domains obtaining labeled data is 
expensive but obtaining unlabeled data is much cheaper. For this reason 
there has been
growing interest in algorithms that are able to take advantage of 
unlabeled data. In this thesis we develop several methods for taking 
advantage of unlabeled data in classification and regression tasks. 
Specific contributions include:
 
A method for improving the performance of the graph mincut algorithm of
Blum and Chawla [12] by taking randomized mincuts. We give theoretical
motivation for this approach and we present empirical results showing that
randomized mincut tends to outperform the original graph mincut algorithm,
especially when the number of labeled examples is very small.
An algorithm for semi-supervised regression based on manifold 
regularization using local linear estimators. This is the first extension 
of local  linear regression to the semi-supervised setting. In this thesis 
we present experimental results on both synthetic and real data and show 
that this method tends to perform better than methods which only utilize 
the labeled data.
An investigation of practical techniques for using the Winnow 
algorithm (which is not directly kernelizable) together with kernel 
functions and general similarity functions via unlabeled data. We expect 
such techniques to be particularly useful when we have a large feature 
space as well as additional similarity measures that we would like to
use together with the original features. This method is also suited to 
situations where the best performing measure of similarity does
not satisfy the properties of a kernel. We present some experiments on 
real and synthetic data to support this approach.
 
114 pages 
 
 |