|
CMU-CS-02-134
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-02-134
Lightweight Structure in Text
Robert C. Miller
May 2002
Ph.D. Thesis
Also appears as Human Computer Interaction Institute
Technical Report CMU-HCII-02-103
CMU-CS-02-134.ps
CMU-CS-02-134.pdf
Keywords: Text processing, structured text, pattern matching,
regular expressions, grammars, web automation, repetitive text editing,
machine learning, programming-by-demonstration,
simultaneous editing, outlier finding, LAPIS
Pattern matching is heavily used for searching, filtering, and
transforming text, but existing pattern languages
offer few opportunities for reuse. Lightweight structure is a new
approach that solves the reuse problem. Lightweight structure has
three parts: a model of text structure as contiguous segments of
text, or regions; an extensible library of structure abstractions
(e.g., HTML elements, Java expressions, or English sentences) that
can be implemented by any kind of pattern or parser; and a region
algebra for composing and reusing structure abstractions. Lightweight
structure does for text pattern matching what procedure abstraction
does for programming, enabling construction of a reusable library.
Lightweight structure has been implemented in LAPIS, a web browser/text editor
that demon-strates several novel techniques:
- Text constraints is
a new pattern language for composing structure abstractions, based on
the region algebra. Text constraint patterns are simple and high-level, and
user studies have shown that users can generate and comprehend them.
- Simultaneous editing uses multiple selections for repetitive text
editing. Multiple selections are inferred from examples given by the
user, drawing on the lightweight structure library to make fast,
accurate, domain-specific inferences from very few examples. In user
studies, simultaneous editing required only 1.26 examples per selection,
approaching the 1-example ideal.
- Outlier finding draws the users
attention to inconsistent selections or pattern matches both possible
false positives and possible false negatives. When integrated into
simultaneous editing and tested in a user study, outlier finding reduced
user errors.
- Unix tools for structured text extend tools like grep
and sort with lightweight structure, and the browser shell integrates
a Unix command prompt into a web browser, offering new ways to build
pipelines and automate web browsing.
Theoretical contributions include
a formal definition of the region algebra, data structures and algorithms
for efficient implementation, and a characterization of the classes of
languages recognized by algebra expressions.
Lightweight structure enables efficient composition and reuse of structure
abstractions defined by various kinds of patterns and parsers,
bringing improvements to pattern matching, text pro-cessing, web
automation, repetitive text editing, inference of patterns from examples,
and error detection.
341 pages
|