This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
January 17, 2001 Data Mining: Concepts and Techniques 1
where p.item1 =q. i tem1 , …, p.i temk -2 =q.i temk- 2, p.i temk- 1 < q.i temk-1
n Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
3
January 17, 2001 Data Mining: Concepts and Techniques 13
How to Count Supports of Candidates?
n Why counting supports of candidates a problem?
n The total number of candidates can be very huge
n One transaction may contain many candidates
n Method:
n Candidate itemsets are stored in a hash-tree
n Leaf node of hash-tree contains a list of itemsets and counts
n Interior node contains a hash table
n Subset function: finds all the candidates contained in a transaction
January 17, 2001 Data Mining: Concepts and Techniques 14
Example of Generating Candidates
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
January 17, 2001 Data Mining: Concepts and Techniques 15
Methods to Improve Apriori’s Efficiency
n Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
n Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
n Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
n Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
n Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent
January 17, 2001 Data Mining: Concepts and Techniques 16
Is Apriori Fast Enough? — Performance Bottlenecks
n The core of the Apriori algorithm:n Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsetsn Use database scan and pattern matching to collect counts for the
candidate itemsets
n The bottleneck of Apriori: candidate generation
n Huge candidate sets:n 104 frequent 1-itemset will generate 107 candidate 2-itemsetsn To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ≈ 1030 candidates.
n Multiple scans of database: n Needs (n +1 ) scans, n is the length of the longest pattern
January 17, 2001 Data Mining: Concepts and Techniques 17
Mining Frequent Patterns Without Candidate Generation
n Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
n highly condensed, but complete for frequent pattern mining
n avoid costly database scans
n Develop an efficient, FP-tree-based frequent pattern mining method
n A divide-and-conquer methodology: decompose mining tasks into smaller ones
n Avoid candidate generation: sub-database test only!
January 17, 2001 Data Mining: Concepts and Techniques 18
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 0.5
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p }200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p }
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
4
January 17, 2001 Data Mining: Concepts and Techniques 19
Benefits of the FP-tree Structure
n Completeness: n never breaks a long pattern of any transactionn preserves complete information for frequent pattern
miningn Compactness
n reduce irrelevant information—infrequent items are gonen frequency descending ordering: more frequent items are
more likely to be sharedn never be larger than the original database (if not count
node-links and counts)n Example: For Connect-4 DB, compression ratio could be
over 100January 17, 2001 Data Mining: Concepts and Techniques 20
Mining Frequent Patterns Using FP-tree
n General idea (divide-and-conquer)n Recursively grow frequent pattern path using the FP-
tree
n Method n For each item, construct its conditional pattern-base,
and then its conditional FP-treen Repeat the process on each newly created conditional
FP -tree
n Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
January 17, 2001 Data Mining: Concepts and Techniques 21
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the
FP -tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
§ If the conditional FP-tree contains a single path, simply enumerate all the patterns
January 17, 2001 Data Mining: Concepts and Techniques 22
Step 1: From FP -tree to Conditional Pattern Base
n Starting at the frequent header table in the FP-treen Traverse the FP-tree by following the link of each frequent itemn Accumulate all of transformed prefix paths of that item to form a
January 17, 2001 Data Mining: Concepts and Techniques 38
Mining Multi-Level Associations
n A top_down, progressive deepening approach:n First find high-level strong rules:
milk → bread [20%, 60%].n Then find their lower-level “weaker” rules:
2% milk → wheat bread [6%, 50%].
n Variations at mining multiple-level association rules.n Level-crossed association rules:
2% milk → Wonder wheat bread
n Association rules with multiple, alternative hierarchies:
2% milk → Wonder bread
January 17, 2001 Data Mining: Concepts and Techniques 39
Multi-level Association: Uniform Support vs. Reduced Support
n Uniform Support: the same minimum support for all levelsn + One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum support.
n – Lower level items do not occur as frequently. If support threshold n too high ⇒ miss low level associationsn too low ⇒ generate too many high level associations
n Reduced Support: reduced minimum support at lower levelsn There are 4 search strategies:
n Level-by-level independentn Level-cross filtering by k-itemsetn Level-cross filtering by single itemn Controlled level-cross filtering by single item
January 17, 2001 Data Mining: Concepts and Techniques 40
Uniform Support
Multi-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Back
January 17, 2001 Data Mining: Concepts and Techniques 41
Reduced Support
Multi-level mining with reduced support
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 3%
Back
Milk
[support = 10%]
January 17, 2001 Data Mining: Concepts and Techniques 42
Multi-level Association: Redundancy Filtering
n Some rules may be redundant due to “ancestor” relationships between items.
n hybrid-dimension association rules (repeated predicates )age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
n Categorical Attributesn finite number of possible values, no ordering among
valuesn Quantitative Attributes
n numeric, implicit ordering among valuesJanuary 17, 2001 Data Mining: Concepts and Techniques 48
Techniques for Mining MD Associations
n Search for frequent k-predicate set:n Example: {age, occupation, buys} is a 3-predicate
set.n Techniques can be categorized by how age are
treated.1. Using static discretization of quantitative attributes
n Quantitative attributes are statically discretized by using predefined concept hierarchies.
2. Quantitative association rulesn Quantitative attributes are dynamically discretized
into “bins”based on the distribution of the data.3. Distance-based association rules
n This is a dynamic discretization process that considers the distance between data points.
9
January 17, 2001 Data Mining: Concepts and Techniques 49
Static Discretization of Quantitative Attributes
n Discretized prior to mining using concept hierarchy.
n Numeric values are replaced by ranges.
n In relational database, finding all frequent k-predicate sets will require k or k+1 table scans.
n Data cube is well suited for mining.
n The cells of an n-dimensional
cuboid correspond to the
predicate sets.
n Mining from data cubescan be much faster.
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)January 17, 2001 Data Mining: Concepts and Techniques 50
Quantitative Association Rules
age(X,”30-34”) ∧ income(X,”24K -48K”)
⇒ buys(X,”high resolution TV”)
n Numeric attributes are dynamically discretizedn Such that the confidence or compactness of the rules
mined is maximized.n 2-D quantitative association rules: A quan1 ∧ A quan2 ⇒ A cat
n Cluster “adjacent” association rulesto form general rules using a 2-D grid.
n Example:
January 17, 2001 Data Mining: Concepts and Techniques 51
ARCS (Association Rule Clustering System)
How does ARCS work?
1. Binning
2. Find frequentpredicateset
3. Clustering
4. OptimizeJanuary 17, 2001 Data Mining: Concepts and Techniques 52
Limitations of ARCS
n Only quantitative attributes on LHS of rules.
n Only 2 attributes on LHS. (2D limitation)
n An alternative to ARCS
n Non-grid-based
n equi-depth binning
n clustering based on a measure of partial
completeness.
n “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.
January 17, 2001 Data Mining: Concepts and Techniques 53
Mining Distance-based Association Rules
n Binning methods do not capture the semantics of interval data
n Distance-based partitioning, more meaningful discretization considering:n density/number of points in an intervaln “closeness” of points in an interval
January 17, 2001 Data Mining: Concepts and Techniques 65
Constrained Association Query Optimization Problem
n Given a CAQ = { (S1, S2) | C }, the algorithm should be :n sound: It only finds frequent sets that satisfy the
given constraints Cn complete: All frequent sets satisfy the given
constraints C are foundn A naïve solution:
n Apply Apriori for finding all frequent sets, and thento test them for constraint satisfaction one by one.
n Our approach:n Comprehensive analysis of the properties of
constraints and try to push them as deeply as possible inside the frequent set computation.
January 17, 2001 Data Mining: Concepts and Techniques 66
Anti-monotone and Monotone Constraints
n A constraint Ca is antianti--monotonemonotone iff. for any
pattern S not satisfying Ca, none of the super-
patterns of S can satisfy Ca
n A constraint Cm is monotonemonotone iff. for any pattern
S satisfying Cm, every super-pattern of S also
satisfies it
12
January 17, 2001 Data Mining: Concepts and Techniques 67
Succinct Constraint
n A subset of item Is is a succinct setsuccinct set, if it can be expressed as σp(I) for some selection predicate p, where σ is a selection operator
n SP⊆2I is a succinct power setpower set, if there is a fixed number of succinct set I1, …, Ik ⊆I, s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
n A constraint C s is succinctsuccinct provided SATCs(I) is a succinct power set
January 17, 2001 Data Mining: Concepts and Techniques 68
Convertible Constraint
n Suppose all items in patterns are listed in a total order R
n A constraint C is convertible anticonvertible anti--monotonemonotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C
n A constraint C is convertible monotoneconvertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C
January 17, 2001 Data Mining: Concepts and Techniques 69
Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
January 17, 2001 Data Mining: Concepts and Techniques 70
Property of Constraints: Anti-Monotone
n Anti-monotonicity : If a set S violates the constraint, any superset of S violates the constraint.
n Examples:
n sum(S.Price) ≤ v is anti-monotone
n sum(S.Price) ≥ v is not anti-monotone
n sum(S.Price) = v is partly anti-monotone
n Application:
n Push “sum(S.price) ≤ 1000” deeply into iterative
frequent set computation.
January 17, 2001 Data Mining: Concepts and Techniques 71
January 17, 2001 Data Mining: Concepts and Techniques 75
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 76
Why Is the Big Pie Still There?
n More on constraint -based mining of associations n Boolean vs. quantitative associations
n Association on discrete vs. continuous datan From association to correlation and causal structure
analysis.n Association does not necessarily imply correlation or causal
relationshipsn From intra-trasanction association to inter-transaction
associationsn E.g., break the barriers of transactions (Lu, et al. TOIS’99).
n From association analysis to classification and clustering analysisn E.g, clustering association rules
January 17, 2001 Data Mining: Concepts and Techniques 77
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 78
Summary
n Association rule mining
n probably the most significant contribution from the database community in KDD
n A large number of papers have been published
n Many interesting issues have been explored
n An interesting research direction
n Association analysis in other types of data: spatial data, multimedia data, time series data, etc.
14
January 17, 2001 Data Mining: Concepts and Techniques 79
Referencesn R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent
itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.
n R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.
n R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile.
n R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
n R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington.
n S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona.
n S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
n K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
n D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
n M. Fang, N. Shivakumar, H. Garcia -Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
January 17, 2001 Data Mining: Concepts and Techniques 80
References (2)
n G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.
n Y. Fu and J. Han. Meta-rule -guided mining of association rules in relational databases. KDO OD'95, 39-46, Singapore, Dec. 1995.
n T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.
n E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona.
n J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia.
n J. Han and Y. Fu. Discovery of multiple -level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland.
n J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generat ion. SIGMOD'00, 1-12, Dallas, TX, May 2000.
n T. Imielinski and H. Mannila . A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.
n M. Kamber, J. Han, and J. Y. Chiang. Metarule -guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California.
n M. Klemettinen, H. Mannila , P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
January 17, 2001 Data Mining: Concepts and Techniques 81
References (3)
n F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY.
n B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
n H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle, Washington.
n H. Mannila , H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.
n H. Mannila , H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery, 1:259-289, 1997.
n R. Meo, G. Psaila , and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India.
n R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
n R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
n N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
January 17, 2001 Data Mining: Concepts and Techniques 82
References (4)n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA, May 1995.n J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
n J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000.
n G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
n B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.
n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA.
n S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY..
n S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
n A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.
n A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
January 17, 2001 Data Mining: Concepts and Techniques 83
References (5)n C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
structures. VLDB'98, 594-605, New York, NY.
n R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
n R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada.
n R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
n H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996.
n D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
n K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
n M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
n M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
n O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
January 17, 2001 Data Mining: Concepts and Techniques 84