|
CMU-ISR-08-131
Institute for Software Research
School of Computer Science, Carnegie Mellon University
CMU-ISR-08-131
Looking under the Hood of
Stochastic Parts of Speech Tagging
Jana Diesner, Kathleen M. Carley
August 2008
Center for the Computational Analysis of
Social and Organizational Systems (CASOS) Technical Report
Superceded by CMU-ISR-08-131R
CMU-ISR-08-131.pdf
Keywords: Part of speech tagging, hidden Markov models, Viterbi
algorithm, AutoMap
A variety of Natural Language Processing and Information Extraction tasks,
such as question answering and named entity recognition, can benefit from
precise knowledge about a words' syntactic category or
Part of Speech (POS) (Stolz, Tannenbaum et al. 1965;
Church 1988; Rabiner 1989). POS taggers are widely used to assign a single
best POS to every word in text data, with stochastic approaches achieving
accuracy rates of up to 96 to 97 percent (Jurafsky and Martin 2000). When
building a POS tagger, human beings needs to make a set of decisions, some
of which significantly impact the accuracy and other performance aspects
of the resulting engine. In this paper we provide an overview of these
decisions and empirically determine their impact on POS tagging accuracy.
We envision the gained insights to be a valuable contribution for people
who want to design, implement, modify, fine-tune, integrate, or simple
reasonably use a POS tagger. Based on the results presented herein we
built and integrated a POS tagger into AutoMap, a tool that facilitates
Natural Language Processing and relational text analysis, as a stand-alone
feature as well as an auxiliary for other tasks.
34 pages
|