MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-11-101 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-11-101 Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics William W. Cohen, Natalie Glance, Charles Schafer, Roy Tromble, Yuk Wah Wong February 2011 CMU-ML-11-101.pdf Keywords: Machine learning, data integration, similarity metrics Good similarity functions are crucial for many important subtasks in data integration, such as "soft joins" and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF's most important properties: it can be computed efficiently and stored compactly; it can be "learned" using few passes over a dataset (in experiments, one or three passes are used), and is well-suited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest neighbor classification error by 20% on average. 29 pages *Google, Inc.

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu