CMU-CS-15-110
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-15-110

Learning to Understand Natural Language
with Less Human Effort

Jayant Krishnamurthy

May 2015

Ph.D. Thesis

CMU-CS-15-110.pdf


Keywords: Semantic parsing, natural language understanding, distant supervision, grounded language understanding

Learning to understand the meaning of natural language is an important problem within language processing that has the potential to revolutionize human interactions with computer systems. Informally, the problem specification is to map natural language text to a formal semantic representation connected to the real world. This problem has applications such as information extraction and understanding robot commands, and also may be helpful for other natural language processing tasks.

Human annotation is a significant bottleneck in constructing language understanding systems. These systems have two components that are both constructed using human annotation: a semantic parser and a knowledge base. Semantic parsers are typically trained on individually-annotated sentences. Knowledge bases are typically manually constructed and given to the system. While these annotations can be provided in simple settings – specifically, when the knowledge base is small – the annotation burden quickly becomes unbearable as the size of the knowledge base increases. More annotated sentences are required to train the semantic parser and the knowledge base itself requires more annotations. Alternative methods to build language understanding systems that require less human annotation are necessary in order to learn to understand natural language in these more challenging settings.

This thesis explores alternative supervision assumptions for building language understanding systems with the goal of reducing the annotation burden described above. I focus on two applications: information extraction and understanding language in physical environments. In the information extraction application, I present algorithms for training semantic parsers using only predicate instances from a knowledge base and an unlabeled text corpus. This algorithm eliminates the requirement for annotated sentences to train the semantic parser. I also present a new approach to semantic parsing that probabilistically learns a knowledge base from entity-linked text. This method reduces the amount of human annotation necessary to construct the knowledge base. Understanding language in physical environments breaks the assumptions of the approach above in that the learning agent must be able to perceive its environment to produce a knowledge base. I present a model that learns to map text to its real world referents that can be trained using annotated referents for entire texts, without requiring annotations of parse structure or the referents of individual words.

187 pages

Thesis Committee:
Tom Mitchell (Chair)
Eduard Hovy
Noah Smith
Luke Zettlemoyer (University of Washington)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu