Institute for Software Research
School of Computer Science, Carnegie Mellon University


The Utility of Corporate Comparison
for Generating Delete Lists

Geoffrey P. Morgan

July 2017

Center for the Computational Analysis of Social and Organizational Systems
CASOS Technical Report


Keywords: Text Analysis, Corpus Comparison, Delete Lists, TF-IDF

Delete Lists are lists of words that have been determined to have little useful meaning for textual analysis. One subset of words that are frequently deleted are stop-words. Stop-Words are textual tokens, such as "and", "a", or "the", that provide structural or grammatical impact to a sentence but do not themselves have significant inherent meaning. Identifying stop-words is a routine process in most text-cleaning applications, but frequently is done via user-maintained word lists. I suggest that the corpora comparison technique I devised for word-score polarization can be used to identify low-value words while preserving the bulk of the text tokens. I will use both known and random draw corpora comparisons for this process. By "known" corpora, I mean corpora drawn from explicit data-sources, the emails of one company and the emails of another, for example. "Random-Draw" corpora are created by drawing document sets at random, and therefore this technique could be applied to any sufficiently large text corpus of interest. I use the ability to identify stop words as a proxy for performance in generating useful delete lists. Random-Draw and Known Corpora Comparison techniques outperform an iteration of TF-IDF (Term Frequency - Inverse Document Frequency), which performs quite poorly on this email data.

16 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by