|
CMU-ISRI-05-101
Institute for Software Research International
School of Computer Science, Carnegie Mellon University
CMU-ISRI-05-101
Adding Semantics and Rigor to Association Rule Learning:
the GenTree Approach
Yiheng Li, Latanya Sweeney
January 2005
CMU-ISRI-05-101.pdf
Keywords: Association rules, classification problems, data
mining, knowledge acquisition, rule learning, hierarchical learning.
Learning semantically useful association rules across all attributes
of a relational table requires: (1) more rigorous mining than afforded
by traditional approaches; and, (2) the invention of knowledge ratings
for learned rules, not just statistical ratings. Traditional algorithms
began by learning rules over one attribute expressed in the domain
values of that attribute (Srikant and Agrawal, 1995). "People who buy
diapers tend to buy beer" is an example from grocery purchases. In the
second generation, Srikant and Agrawal, (1996) introduced a hierarchy
hose base values are those originally represented in the data, and
values appearing at higher levels in the hierarchy represent
increasingly more general concepts of base values. Rules learned
over the attribute using the hierarchy are termed generalized
association rules (or crosslevel rules). People who buy
baby products tend to buy controlled substances is an example of
a generalized association rule. Rules that mix the level of
expressions for the same item were still not able to be learned.
People who buy
beverages tend to buy beer is an example of a rule that could not
be properly learned if most beverage purchases were of beer. The
work reported herein continues the evolution in the expressiveness
of association rules to its broadest application -- semantically
rated rules learned from a large relational table having many
attributes. Associated with each attribute is a hierarchy. We find
the rules that are formed by combining mixed levels of generalizations
across all attributes and that convey the maximum expression of
information supported by attribute hierarchies, parameter settings
and data tuples. We term these robust rule. An example of a robust
rule is "People who buy baby products tend to buy controlled substances,
use credit cards and make purchases on Saturdays." Multiple attributes
from the transaction are included and cross-level and mix-level
rules are expressed. Because of the improved expressiveness, many
more statistically relevant rules are learned. To identify semantically
useful rules among them, we compare rule features appearing on the left
side (before "tend to") to those appearing on the right side (after
"tend to") to identify learning in breadth (more attributes appearing
on the right-side than on the left) and learning in depth (values on
the right side appearing at lower levels in the corresponding hierarchies
than those on the left). For a set of learned rules surpassing statistical
thresholds of relevance, the most semantically relevant rules are those
expressing more depth and/or breadth in the terms of their expression.
The other rules are merely subset expressions. We also introduce a
GenTree algorithm as an efficient means to learn robust rules from a
table. Experiments using GenTree with two real-world datasets,
containing 10,000 six-attributed tuples and over 4,000 eightattributed
tuples each, show that learned rules convey more comprehensive
information than possible with traditional association rule mining algorithms.
25 pages
|