INSTITUTE FOR SOFTWARE RESEARCH TECHNICAL REPORT ABSTRACTS

CMU-ISR-12-108
Institute for Software Research
School of Computer Science, Carnegie Mellon University

CMU-ISR-12-108

Creating, Using and Updating Thesauri Files
for AutoMap and ORA

Abhinav Sangal, Kathleen M. Carley,
Neal Altman, Michael K. Martin

July 2012

CMU-ISR-12-108.pdf

Center for the Computational Analysis of Social and Organizational Systems
CASOS Technical Report

Keywords: Automap, ORA, Thesaurus, Universal Thesaurus, Meta-ontology, Domain Thesaurus, Split, Merge, Algorithm

AutoMap [1] is text analysis software that performs Network Text Analysis by running an automated process on a corpus of raw text data to generate one or more meta-networks which include the nodes and links representing relations among entities described. Automap uses thesaurus files [1] when creating meta-networks. These thesaurus files are list which allows the association of words or phrases found in texts with abstract concepts and/or node classes used in the extracted meta-networks.

Over time, a large number of thesauri have been created. Many of the extant thesauri contain entries that are relevant to new text analysis projects. But thesaurus re-use is difficult due to the number of thesauri. In this report, we describe one approach to making thesaurus re-use easier by combining and reconciling multiple thesauri into one under user control.

With this approach, the process of creating a Meta network out of a raw corpus of text data is more efficient and the user is able to perform a more accurate analysis of the Meta network, as the individual thesauri files can be merged to create a single and large Universal or Master Thesaurus containing all the general abstract concepts, along with several different Domain-specific thesauri.

In the following report, we first discuss the differences between a Universal thesaurus and the domain or the project specific thesauri. We then go on to discuss the evolution in the formats of the thesauri used by AutoMap, followed by a discussion of the standard Dynamic Network Analysis (DNA) meta-ontology [1].

We then detail the process used to create a single universal/master thesaurus and several different Domain thesauri. The process involves a mix of two major processes which we refer to as the Split routine and the Merge routine. We shall discuss the Split routine and the merge routine algorithm along with the process that has been used to merge and create a single thesaurus file by combining a large number of thesauri files. The merge process is not a simple process of combining all the files into one file; it involves some computational functions to make this process more efficient and more accurate. These functions are deleting duplicates, detecting the concept cycles and performing a depth first search for each concept.

The paper concludes by discussing some future improvements which could be made to the process so as to improve and automate the process which is being used at present for the merge and split process.

58 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu