Center for Automated Learning and Discovery
School of Computer Science, Carnegie Mellon University


Learning Semantically Robust Rules from Data

Yiheng Li, Latanya Sweeney

February 2004

Also appears as
Institute for Software Research International
Technical Report CMU-ISRI-04-107


Keywords: Association rules, rule learning, hierarchical learning, data mining.

We introduce the problem of mining robust rules, which are expressive multi-dimensional generalized association rules.  Consider a large relational table, where associated with each attribute is a hierarchy whose base values are those originally represented in the data, and values appearing at higher levels in the hierarchy represent increasingly more general concepts of base values.  Attribute hierarchies provide meaningful levels of concept aggregation, such as the encoding of postal codes (ZIP) or dates, or the taxonomy of products.  We find the least general rules formed by combining mixed levels of generalizations across attributes to convey the maximum expression of information supported by attribute hierarchies, parameter settings and data tuples.  We term these "robust rules"and introduce a GenTree algorithm as a means to learn robust rules from a table.  An example of a robust rule from a table having base values {5-digit ZIP, gender, registration date (year/month/day), party} might be "women living in Cambridge (021**) and registered in the 1970s (197*/xx/xx) tend to be Democrats." Previous studies on mining generalized association rules have been limited dimensionally (e.g., transactional data), by data type (e.g., quantitative data), and/or to rules expressed from either fixed-level or non-semantic abstractions.  Such approaches limit the kinds of rules that can be learned.  Experiments using GenTree with two real-world datasets, containing 10,000 six-attributed tuples and over 4,000 eight-attributed tuples each, show that learned rules convey more comprehensive information than possible with traditional association rule mining algorithms, because traditional approaches limit the expressivity of the rules they generate.

21 pages

SCS Technical Report Collection
School of Computer Science homepage

This page maintained by