CMU-CS-22-111
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-22-111

The Language of Sketches

Xiaoyu Zhang

M.S. Thesis

May 2022

CMU-CS-22-111.pdf


Keywords: Multimodal machine learning, creative AI, text-to-image synthesis, sketch generation, sketch data collection

Creative AI has seen much progress in recent years. Works like DALL-E 2 can generate inspiring art pieces from text descriptions. Instead of synthesizing realistic art works from language, we approach creativity from a different angle and investigate composition of semantic parts and visual concepts in sketches. For example, people can draw a circle to represent the moon, a scoop of icecream, or the face of a cat. Similarly, language descriptors can be composed to create new concepts.

People can draw a large round cat face or a narrow oval cat face. In order to study this reuse of abstract concepts, we construct a dataset of language annotated sketches. We examined current sketch datasets and found that they either lack language annotations or semantic part annotations. Therefore, we collect a dataset of 11,150 (sketch part, text) pairs for 572 face sketches and 787 angel sketches.

To understand the limits of current vision-language models, we fine-tuned CLIP, a model pretrained with a contrastive objective on 400 million (image, text) pairs and can map (image, text) pairs into a joint vision-language embedding space. We observed that (1) CLIP cannot easily generalize to an unseen category on the task of pairing sketches with their descriptions even though similar shapes and descriptions have occurred in training; (2) through fine-tuning, average cosine distance has increased between a pair of descriptors used by annotators to differentiate two sketches. With insights gained about how language and sketches interact in the CLIP embedding space, our aim is to facilitate research into models that can generate sketches in a part-based manner satisfying descriptions given by users of the pictures they have on their minds.

60 pages

Thesis Committee:
Oliver Kroemer (Co-Chair)
Yonatan Bisk (Co-Chair)
Jean Oh

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu