Institute for Software Research International
School of Computer Science, Carnegie Mellon University


Towards an Information Theoretic Framework for
Location-Based Data Linkage

Bradley Malin, Edoardo Airoldi

November 2005

Keywords: Databases, data management, record linkage, trails, information theory

A long-standing challenge for data management is the ability to correctly relate information corresponding to the same entity distributed across databases. Traditional research into record linkage has concentrated on string comparator metrics for records with common, or relatable, attributes. However, spatially distributed data are often devoid of such crucial information for database schema integration. Rather than directly relate schemas, spatially distributed data can be related through location-based linkage algorithms, which link patterns in location-specific attributes (e.g. visit). In this paper we focus on two fundamental algorithms for location-based linkage and we investigate how different distributions of how entities visit locations influence linkage performance. We begin by studying algorithm accuracy for linking real-world data. We then outline a theoretical framework rooted in information theory that allows us to provide insight into observed phenomena. Our framework also provides a useful basis for studying the performance of location-based linkage algorithms: we analyze two opposing cases where location visit patterns arise from uniform and power distributions of entities to locations. We carry out our investigations under both the assumption of complete and incomplete information. Our findings suggest that low skew distributions are more easily linked when complete information is known. In contrast, when information is incomplete high skew distributions lead to higher linkage rates.

22 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by