Computer Science Department
School of Computer Science, Carnegie Mellon University


Dataset Curation through Renders and Ontology Matching

Yair Movshovitz-Attias

September 2015

Ph.D. Thesis


Keywords: Computer Vision, Viewpoint Estimation, Fine Pose Estimation, Fine-Grained Classification, Deep Learning, Correlation Filters, Rendering, Synthetic Data

In this thesis we demonstrate the benefits of automated labeled dataset creation for fine-grained visual learning tasks. Specifically, we show that utilizing real-world, non-image information can significantly reduce the human effort needed for building large scale datasets.

Computer vision has seen great advances in recent years in a number of complex tasks, such as scene classification, object detection, and image segmentation. A key ingredient in such success stories is the use of large amounts of labeled data. In many cases, the limiting factor is the ability to create these training sets. Issues arise in three forms: (1) The act of labeling the data can be hard for human annotators, (2) in some cases it is hard to get a representative sample of the feature space, and (3) data for infrequent (yet potentially important) instances can be completely absent from the training set.

Business storefront classification is an example of (1). The number of possible labels is large, and assigning all relevant labels to an image is a time consuming task for annotators. Moreover, when the image contains a business from a country other than their own, annotators can get confused by the foreign language and produce erroneous labels. Annotators are also not consistent in their categorization of businesses into categories.

In vehicle viewpoint estimation, the images themselves are hard to come by. Getting sample images of all viewpoints is hard due to bias in the way people photograph cars. Current datasets for this task lack data for many viewpoints. In addition, the labeling task is hard for the annotators.

We address these issues by adding automation to the dataset creation process. Our approach is to utilize external information by matching the images to real world concepts. In the case of businesses, when images are mapped to an ontology of geographical entities, we are able to extract multiple relevant labels per image. For the viewpoint estimation problem, by using 3D CAD models we can render images in the desired viewpoint resolution, and assign precise labels to them. We provide a systematic examination of the rendering process, and conclude that render quality is key for training accurate models.

114 pages

Thesis Committee:
Yaser Sheikh (Co-Chair)
Takeo Kanade (Co-Chair)
Abhinav Gupta
Leonid Sigal (Disney Research)
Trevor Farrell (University of California, Berkeley)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by