|
CMU-ISRI-04-107
Institute for Software Research International
School of Computer Science, Carnegie Mellon University
CMU-ISRI-04-107
Learning Semantically Robust Rules from Data
Yiheng Li, Latanya Sweeney
February 2004
Also appears as
Center for Automated Learning and Discovery
Technical Report CMU-CALD-04-100
Also associated with the Data Privacy Laboratory
CMU-ISRI-04-107.pdf
Keywords: Association rules, rule learning, hierarchical learning, data mining
We introduce the problem of mining robust rules, which are expressive
multi-dimensional generalized association rules. Consider a large
relational table, where associated with each attribute is a hierarchy
whose base values are those originally represented in the data, and
values appearing at higher levels in the hierarchy represent increasingly
more general concepts of base values. Attribute hierarchies provide
meaningful levels of concept aggregation, such as the encoding of postal
codes (ZIP) or dates, or the taxonomy of products. We find the least
general rules formed by combining mixed levels of generalizations across
attributes to convey the maximum expression of information supported by
attribute hierarchies, parameter settings and data tuples. We term these
"robust rules"and introduce a GenTree algorithm as a means to learn
robust rules from a table. An example of a robust rule from a table
having base values {5-digit ZIP, gender, registration date (year/month/day),
party} might be "women living in Cambridge (021**) and registered in the
1970s (197*/xx/xx) tend to be Democrats." Previous studies on mining
generalized association rules have been limited dimensionally (e.g.,
transactional data), by data type (e.g., quantitative data), and/or to
rules expressed from either fixed-level or non-semantic abstractions.
Such approaches limit the kinds of rules that can be learned.
Experiments using GenTree with two real-world datasets, containing
10,000 six-attributed tuples and over 4,000 eight-attributed tuples
each, show that learned rules convey more comprehensive information
than possible with traditional association rule mining algorithms,
because traditional approaches limit the expressivity of the rules
they generate.
20 pages
|