Human-Computer Interaction Institute
School of Computer Science, Carnegie Mellon University


A Multi-Modal Intelligent Agent that Learns from
Demonstrations and Natural Language Instructions

Toby Jia-Jun Li

May 2021

Ph.D. Thesis


Keywords: End user development, end user programming, interactive task learning, programmingby demonstration, programming by example, multi-modal interaction, verbal instruction, natural language programming, task automation, intelligent agent, instructable agent, conversational assistant, human-AI interaction, human-AI collaboration

Intelligent agents that can perform tasks on behalf of users have become increasingly popular with the growing ubiquity of "smart" devices such as phones, wearables, and smart home devices.They allow users to automate common tasks and to perform tasks in contexts where the direct manipulation of traditional graphical user interfaces (GUIs) is infeasible or inconvenient. However, the capabilities of such agents are limited by their available skills (i.e., the procedural knowledge of how to do something) and conceptual knowledge (i.e.,what does a concept mean). Most current agents (e.g., Siri, Google Assistant, Alexa) either have fixed sets of capabilities or mechanisms that allow only skilled third-party developers to extend agent capabilities. As a result, they fall short insupporting "long-tail" tasks and suffer from the lack of customizability and flexibility.

To address this problem, I and my collaborators designed SUGILITE, a new intelligent agent that allows end users to teach new tasks and concepts in a natural way. SUGILITE uses a multi-modal approach that combines programming by demonstration (PBD) and learning from natural language instructions to support end-user development for intelligent agents. The lab usability evaluation results showed that the prototype of SUGILITE allowed users with little or no programming expertise to successfully teach the agent common smartphone tasks such as ordering coffee, booking restaurants, and checking sports scores, as well as the appropriate conditionals for triggering these actions and the task-relevant concepts. My dissertation presents a new human-AI interaction paradigm for interactive task learning, where the existing third-party app GUIs are used as a medium for users to communicate their intents with an AI agent in addition to being the interface for interacting with the underlying computing services.

Through the development of the integrated SUGILITE system over the past five years, this dissertation presents seven main technical contributions including: (i) a new approach to allow the agent to generalize from learned task procedures by inferring task parameters and their associated possible values from verbal instructions and mobile app GUIs, (ii) a new method to address the data description problem in PBD by allowing users to verbally explain ambiguous or vague demonstrated actions, (iii) a new multi-modal interface to enable users to teach the conceptual knowledge used in conditionals to the agent, (iv) a new mechanism to extend mobile app based PBD to smart home and Internet of Things (IoT) automation, (v)a new multi-modal interface that helps users discover, identify the causes of, and recover from conversational breakdowns using existing mobile app GUIs for grounding, (vi) a new privacy-preserving approach that can identify and obfuscate the potential personal information in GUI-based PBD scripts based on the uniqueness of information entries with respect to the corresponding app GUI context, and (vii) a new self-supervised technique for generating semantic representations of GUI screens and components in embedding vectors without requiring manual annotation.

221 pages

Thesis Committee:
Brad A. Myers (C hair)
Tom M. Mitchell
Jeffrey P. Bigham
John Zimmerman
Philip J. Guo (University of California San Diego)

Jodi Forlizzi, Head, Human-Computer Interaction Institute
Martial Hebert, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by