CMU-CS-12-102
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-12-102

Axiomatic Analysis of Co-occurrence
Similarity Functions

U Kang, Mikhail Bilenko*, Dengyong Zhou*, Christos Faloutsos

February 2012

CMU-CS-12-102.pdf


Keywords: Co-occurrence, similarity function, axiomatic analysis

Finding similar items based on co-occurrence data is an important data mining task with applications ranging from recommender systems to keyword based advertising. A number of co-occurrence similarity functions have been proposed based on graph-theoretic, geometric, and statistical abstractions. Despite the variety of existing algorithms, however, there exists no formal methodology for analyzing their properties and comparing their benefits and limitations. At the same time, the wide range of applications and domains where co-occurrence-based similarity functions are deployed limits the conclusiveness of experimental evaluations beyond the narrow task typically considered by each method.

This paper proposes an axiomatic approach to analyzing co-occurrence similarity functions. The approach is based on formulating general, domain-independent constraints that well-behaved methods must satisfy to avoid producing degenerate results. Such constraints are derived based on the impact that continuous aggregation of the co-occurrence data is expected to have on absolute or relative similarity estimates. Proposed constraint-based analysis is applied to several representative, popular similarity functions and reveals that, surprisingly, none of them satisfy all constraints unconditionally. The analysis leads to the design of a theoretically well-justified similarity function called Random Walk with Sink (RWS). RWS is parameterized to satisfy the constraints unconditionally, with the parameterization having interesting probabilistic and graph-theoretic interpretations.

32 pages


*Microsoft Research, Redmond, Washington



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu