|
CMU-ISRI-05-131
Institute for Software Research International
School of Computer Science, Carnegie Mellon University
CMU-ISRI-05-131
Towards an Information Theoretic Framework for
Location-Based Data Linkage
Bradley Malin, Edoardo Airoldi
November 2005
CMU-ISRI-05-131.ps
CMU-ISRI-05-131.pdf
Keywords: Databases, data management, record linkage, trails,
information theory
A long-standing challenge for data management is the ability to
correctly relate information corresponding to the same entity
distributed across databases. Traditional research into record
linkage has concentrated on string comparator metrics for records
with common, or relatable, attributes. However, spatially
distributed data are often devoid of such crucial information
for database schema integration. Rather than directly relate
schemas, spatially distributed data can be related through
location-based linkage algorithms, which link patterns in
location-specific attributes (e.g. visit). In this paper we focus
on two fundamental algorithms for location-based linkage and we
investigate how different distributions of how entities visit
locations influence linkage performance. We begin by studying
algorithm accuracy for linking real-world data. We then outline
a theoretical framework rooted in information theory that allows
us to provide insight into observed phenomena. Our framework also
provides a useful basis for studying the performance of
location-based linkage algorithms: we analyze two opposing cases
where location visit patterns arise from uniform and power distributions
of entities to locations. We carry out our investigations under both
the assumption of complete and incomplete information. Our findings
suggest that low skew distributions are more easily linked when
complete information is known. In contrast, when information is
incomplete high skew distributions lead to higher linkage rates.
22 pages
|