Machine Learning Department
School of Computer Science, Carnegie Mellon University


Data Integration for Many Data Sources
using Context-Sensitive Similarity Metrics

William W. Cohen, Natalie Glance*, Charles Schafer*,
Roy Tromble*, Yuk Wah Wong*

February 2011


Keywords: Machine learning, data integration, similarity metrics

Good similarity functions are crucial for many important subtasks in data integration, such as "soft joins" and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF's most important properties: it can be computed efficiently and stored compactly; it can be "learned" using few passes over a dataset (in experiments, one or three passes are used), and is well-suited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest neighbor classification error by 20% on average.

29 pages

*Google, Inc.

SCS Technical Report Collection
School of Computer Science homepage

This page maintained by