Institute for Software Research
School of Computer Science, Carnegie Mellon University


Looking under the Hood of
Stochastic Parts of Speech Tagging

Jana Diesner, Kathleen M. Carley

August 2008

Center for the Computational Analysis of
Social and Organizational Systems (CASOS) Technical Report

Superceded by CMU-ISR-08-131R


Keywords: Part of speech tagging, hidden Markov models, Viterbi algorithm, AutoMap

A variety of Natural Language Processing and Information Extraction tasks, such as question answering and named entity recognition, can benefit from precise knowledge about a words' syntactic category or Part of Speech (POS) (Stolz, Tannenbaum et al. 1965; Church 1988; Rabiner 1989). POS taggers are widely used to assign a single best POS to every word in text data, with stochastic approaches achieving accuracy rates of up to 96 to 97 percent (Jurafsky and Martin 2000). When building a POS tagger, human beings needs to make a set of decisions, some of which significantly impact the accuracy and other performance aspects of the resulting engine. In this paper we provide an overview of these decisions and empirically determine their impact on POS tagging accuracy. We envision the gained insights to be a valuable contribution for people who want to design, implement, modify, fine-tune, integrate, or simple reasonably use a POS tagger. Based on the results presented herein we built and integrated a POS tagger into AutoMap, a tool that facilitates Natural Language Processing and relational text analysis, as a stand-alone feature as well as an auxiliary for other tasks.

34 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by