Computer Science Department
School of Computer Science, Carnegie Mellon University


Minimizing Queries for Active Labeling with Sequential Analysis

Jack Paparian

October 2016

M.S. Thesis


Keywords: Cluster-based Active Learning, Sequential Probability Ratio Test, Strong and Weak Oracles, Binary Space-partitioning Trees

When building datasets for supervised machine learning problems, data is often labelled manually by human annotators. In domains like medical imaging, acquiring labels can be prohibitively expensive. Both active learning and crowdsourcing have emerged as ways to frugally label datasets. In active learning, there has been recent interest in algorithms that exploit the data's structure to direct querying. learning from crowds, one must balance the accuracy and cost of different oracles when gathering labels; weak oracles are assumed to be most accurate when labeling samples from label-homogeneous regions of space.

In this thesis, we explore how the data's structure can be leveraged for both of these techniques. The sequential probability ratio test (SPRT) provides the backbone for our contributions. Using the SPRT, we provide a cluster-based active learning algorithm to find a small, homogeneous partitioning of the data. We also use the SPRT to measure the confidence of a weak oracle's label by analyzing its estimates on neighboring labels. The optimality of the SPRT allows the algorithms to inherently minimize the average number of queries required before their termination.

56 pages

Thesis Committee:
Christopher James Langmead (Chair)
Carl Kingsford

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science