@device(postscript)
@libraryfile(Mathematics10)
@libraryfile(Accents)
@style(fontfamily=timesroman,fontscale=11)
@pagefooting(immediate, left "@c<Technical Report 1996>",
center "@c<Computer Science>",
right "@c<Carnegie Mellon>")
@heading(GLR*: A Robust Grammar-Focused Parser 
for Spontaneously Spoken Language)
@heading(CMU-CS-96-126)
@center(@b(Alon Lavie))
@center(May 1996 - Ph.D. Thesis)
@center(FTP: Unavailable)
@blankspace(1)
@begin(text,spacing=.90)

The analysis of spoken language is widely considered to be a more
challenging task than the analysis of written text. All of the
difficulties of written language can generally be found in spoken
language as well. Parsing spontaneous speech must, however, also deal
with problems such as speech  disfluencies, the looser notion of
grammaticality, and the lack of clearly marked sentence boundaries.
The contamination of the input with errors of a 
speech recognizer can further exacerbate these problems. Most natural
language parsing algorithms are designed to analyze "clean" grammatical
input.  Because they reject any input which is found to be ungrammatical
in even the slightest way, such parsers are unsuitable for parsing
spontaneous speech,  where completely grammatical input is the exception
more than the rule.

This thesis describes GLR*, a parsing system based on Tomita's Generalized LR 
parsing algorithm, that was designed to be robust to two particular types of 
extra-grammaticality: noise in the input, and limited grammar coverage. GLR* 
attempts to overcome these forms of extra-grammaticality by ignoring the 
unparsable words and fragments and conducting a search for the maximal subset 
of the original input that is covered by the grammar. The parser is coupled
with a beam search heuristic, that limits the combinations of skipped words
considered by the parser, and ensures that the parser will operate within
feasible time and space bounds.

The developed parsing system includes several tools designed to address the
difficulties of parsing spontaneous speech. To cope with high levels of
ambiguity, we developed a statistical disambiguation module, in which 
probabilities are attached directly to the actions in the LR parsing 
table. The parser must also determine the "best" parse from among the
different parsable subsets of an input. We thus designed a general framework 
for combining a collection of parse evaluation measures into an integrated 
heuristic for evaluating and ranking the parses produced by the GLR* parser.
This framework was applied to a set of four parse scoring measures developed 
for the JANUS scheduling domain and the ATIS domain. We added a parse quality
heuristic, that allows the parser to self-judge the quality of the parse 
chosen as best, and to detect cases in which important information is likely 
to have been skipped.

To demonstrate its suitability to parsing spontaneous speech, the GLR* parser
was integrated into the JANUS speech translation system. Our evaluations on
both transcribed and speech recognized input have indicated that the version 
of the system that uses GLR* produces between 15% and 30% more acceptable 
translations, than a corresponding version that uses the original non-robust 
GLR parser. We also developed a version of GLR* that is suitable to parsing 
word lattices produced by the speech recognizer, and investigated how lattice 
parsing can potentially overcome errors of the speech recognizer and further
improve end-to-end performance of the speech translation system.

@blankspace(2line)
@begin(transparent,size=10)
@b(Keywords:@ )@c<Natural language processing, speech understanding, machine
translation, parsing, generalized LR parsing, JANUS>
@end(transparent)
@blankspace(1line)
@end(text)
@flushright(@b[(203 pages)])