CMU-ML-12-107
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-12-107

Part-of-Speech Tagging for Twitter:
Word Clusters and Other Advances

Olutobi Owoputi, Brendan O'Connor, Chris Dyer
Kevin Gimpel*, Nathan Schneider

September 2012

CMU-ML-12-107.pdf


Keywords: Part-of-speech tagging, social media, semi-supervised learning, natural language processing


We present improvements to a Twitter part-of-speech tagger, making use of several new features and largescale word clustering. With these changes, the tagging accuracy increased from 89.2% to 92.8% and the tagging speed is 40 times faster. In addition, we expanded our Twitter tokenizer to support a broader range of Unicode characters, emoticons, and URLs. Finally, we annotate and evaluate on a new tweet dataset, DAILY TWEET547, that is more statistically representative of English-language Twitter as a whole. The new tagger is released as TweetNLP version 0.3, along with the new annotated data and large-scale word clusters at http://www.ark.cs.cmu.edu/TweetNLP.

19 pages

*Toyota Technological Institute at Chicago, Chicago, IL 60637


SCS Technical Report Collection
School of Computer Science