Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering, e.g., R(K 1 ..K k ,A 1 ..A n ) where K s are structure & A s are feature attributes – What’s the difference between structure and feature attibutes? – Sometimes there is none (i.e., no K’s). Other times there are strictly structural attributes (e.g., X,Y-coordinates of an image. We may want to treat these structure attributes differently from feature attributes such as R, G or B. – Structural attributes are similar to keys (id tuples, typically by position in space) Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C where A is a subset of tuples (tupleset) called the antecedent and C is a subset called the consequent. – Tuplesets for quantitative attributes usually product sets, i=1..n S i (itemsets ) or rectangles: i=1..n [l i ,u i ], l i u i in A i (some may be full-range, [lowval,hival] and the full-range intervals are often left out of the product notation (not listed)). • In Boolean ARM (each A i is Boolean), may be only 1 meaningful subint, [1,1] & antecedent/consequent sets of feature attributes (those with interval = [1,1] ). These notes contain NDSU confidential & Proprietary material. Patents pending
59
Embed
Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Association Rule Mining (ARM) We will look for common models for ARM/Classification/Clustering,
e.g., R(K1..Kk,A1..An) where Ks are structure & As are feature attributes– What’s the difference between structure and feature attibutes?– Sometimes there is none (i.e., no K’s). Other times there are strictly structural attributes (e.g.,
X,Y-coordinates of an image. We may want to treat these structure attributes differently from feature attributes such as R, G or B.
– Structural attributes are similar to keys (id tuples, typically by position in space)
Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C where A is a subset of tuples (tupleset) called the antecedent and C is a subset called the consequent.
– Tuplesets for quantitative attributes usually product sets, i=1..nSi (itemsets) or rectangles:
i=1..n[li,ui], liui in Ai (some may be full-range, [lowval,hival] and the full-range intervals are often left out of the product notation (not listed)).
• In Boolean ARM (each Ai is Boolean), may be only 1meaningful subint, [1,1] & antecedent/consequent setsof feature attributes (those with interval = [1,1] ).
These notes contain NDSU confidential &Proprietary material.Patents pending on bSQ, Ptree technology
Slalom Metaphor for an Itemset The rectangles: i=1..n [li,ui], where liui in Ai can be visualized as a set of
“gates” (e.g., on a ski slope), one gate for each non-full-range attribute. A2 A5 A7 A8 are full-range (l = LowValue or LV and u = HighValue or HV) :
Itemset = set of tuplesthat “ski thru” gates.
Metaphor is related toParallel Coordinatesin the literature.
– Metaphor is also related to some multi-dimensional tuple visualization diagrams:
HV=u6
l1
u1
l3
l6
u3
u4
LV=l4
A1 A2 A3 A4 A5 A6 A7
A1 A2 A3 A4 A5
Parallel diagram
Jewel diagram(Dr. Juell & W. Jockheck)
A1
A2 A3
A4
A5
A1
A5
Mountain diagram
Barrel diagram is similar toMoutain but wrapped arounda barrel (I.e., a 3-D helix)
Slalom Metaphor cont. Simple example to try to get some intuition on how these diagrams might be
used effectively. First, previous configurations:
A1
A5
Mountain diagram (upward orientation)
A1 A2 A3 A4 A5
Parallel diagram Jewel diagram
A1
A2 A3
A4
A5
A1
A5
Mountain diagram (chain orientation)
The upward oriented Mountain appearsto better reflect “closeness in shape”than the others??? (the red and blue should be closer to each other than eitheris to the green???). Therefore it might be better for cluster analysis (later).
Can the polygon formed from the centroids of rectangles provide visual intuition - bounds for itemset support?
Slalom Metaphor for an Association Rule
For a rule, A C or [u1, l1]1 [u3, l3]3 [u6, l6]6 [u4, l4]4
support of A = count of tuples thru A
support of AC = count of tuples
thru both A & C.
confidence of AC = the fraction of
the tuples going thru A
that also go thru C.
• = Supp( A C ) / Supp( A)
l1
u1
l3
l6
u3
u4
0=l4
HV=u6
A1 A2 A3 A4 A5 A6 A7 A8
Precision Ag ARM example
Identifying high and low crop yields
E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectances from the pixel (square area) at (x,y)– Y is the yield at (x,y).
– Assume all are 8-bit values.
High Support and Confidence rules are expected like: – [192,255]G [0,63]R [128,255]Y
How to apply rules?– Obtain rules from previous year’s data.
– Apply rules in the current year after each aerial photo is taken at different stages of plant growth.
– By irrigating/adding Nitrate, Green/Red values can be increased, and therefore Yield may be increased.
Market Basket ARM example Identifying purchasing patterns
• If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?) E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo)
• Tid=transaction id (for a customer going thru checkout). In any field of a tuple there is a 1 if the customer has that product in his/er basket, else 0.• In Boolean ARM we are only interested in Buy/noBuy (not in quantity).
• Therefore, itemsets are hyperboxes, i=1..n[1,1]ji , where Iji
are the items purchased.
Support and Confidence: Given itemsets, A and C,• Supp(A) = ratio of the number of trans supporting A over the total number of transs.• Supp(AC) = ratio of the number of trans supporting AB over the total # trans.• Conf(AC) = ratio of # trans supporting A&C over # trans supporting A
= Supp(AB) / Supp(A) in list notation = Supp(AB)/Supp(A) in vector notation
Thresholds• Frequent Itemsets = Support exceeds a min support threshold (minsupp).
– Lk denotes the set of frequent k-itemsets (sets with k items in them).
• High Confidence Rules = Confidence exceeds a min threshold (minconf).
Lists versus Vectors in MBR In most MBR treatments, we have
– Items, i (purchasable) (I is the universe of all items).
– Transactions, t (customer thru checkout with an itemset, t-itemset)
• t-itemset is usually expressed as a list of items, {i1, i2, …, in}
• t-itemset can expressed as a bit-vector, [0100101…1000]
– where each item is assigned to a bit position and that bit is 1 if t-itemset contains that item and 0 otherwise.
– The Vector version corresponds to the table model we have been using, with R(K1,A1,…,An), K1 = Trans-id and the Ai‘s are the items in the assigned order (the datatype of each is Boolean)
Association Rule Example Given a database of transactions, each trans is a list (or bit vector) of items
purchased by a customer in a visit):
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Tid A B C D E F
2000 1 1 1 0 0 0
1000 1 0 1 0 0 0
4000 1 0 0 1 0 0
5000 0 1 0 0 1 1
Let minsupp=50%, minconf=50% we have AC (50%, 66.6%) CA (50%, 100%)
Boolean vs. quantitative associations (Based on types of values handled)
Find the frequent itemsets: the sets of items that have minimum support
–A subset of a frequent itemset must also be a frequent itemset
•if {AB} is a frequent itemset, both {A} and {B} must be frequent
–Iteratively find frequent itemsets with size from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
Ck will denote the candidate frequent k-itemsets
Lk will denote the frequent k-itemsets.
How to Generate Candidates and Count Support Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q wherep.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
Why counting supports of candidates a problem?–The total number of candidates can be huge– One transaction may contain many candidates
Method:–Candidate itemsets are stored in a hash-tree–Leaf node of hash-tree contains list of itemsets & counts–Interior node contains a hash table–Subset function finds all candidates contained in a trans
Example:
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc & abd
acde from acd and ace
Pruning:
acde removed because
ade is not in L3
C4={abcd}
Spatial Data Pixel – a point in a space Band – feature attribute of the pixels Value – usually one byte (0~255) Images have different numbers of bands
RSI data can be viewed as collection of pixels. Each has a value for each feature attrib.
TIFF image Yield Map
E.g., RSI dataset above has 320 rows and 320 cols of pixels (102,400 pixels) and 4 feature attributes (B,G,R,Y). The (B,G,R) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map.Existing formats
–BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel)
Creating Peano-Count-trees (PC-trees) from RelationsTake any “relation” or table, R(K1,..,Kk, A1, A2, …, An) (Ki structure, Ai feature attributes).
•Eg, Structure attribs of 2-D image = X-Y coords, feature attribs = bands (e.g., B,G,R)•We create BSQ files from it by projection, Bi = R[Ai].•We create bSQ files from each of these BSQ files, Bi1, Bi2 , …, Bin
•We create a Peano Tree, Pij, from each bSQ file, Bij
Peano trees (P-trees):P-tree represents bSQ, BSQ, relational data in a recursive quadrant-by-quadrant,
lossless, compressed, datamining-ready format.P-trees come in many forms
How do we datamine heterogeneous datasets?i.e., R,S,T.. describe same entity class, different keys/attributesUniversal Relation approach: transform into one big relations (union the keys?) (eg universal geneTbl)Key Fusion: R(K,…); S(K’,…) Mine them as separate relations but map keys using a tautology.
The two are methods are related in that the Universal Relation approach usually includes definining a universal key to which all local keys are mapped (using a (possibly fragmented) tautological lookup table)
K | K’----|----- | | | |
An example of PC-tree
Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count
- d=2: Q[x1y1•..•xkyk] is a quadrant. - Q[]=R; Q[x11x21..xd1•x12…xd2•..•x1L..xdL]=single_tuple=1x..x1-polytant.
- imposes a “d-space” structure on R (for RSI, which already has such, can skip this step.)
Quadrant-conditions: On each quadrant, Q, in R define conditions (Q{T,F}) (level=k):
Q-COND DESCRpure1 true if C is true of all Q-tuplespure0 true if C is false of all Q-tuplesmixed true if C is true of some Q-tuples and false of some Q-tuplesp-count true if C is true of exactly p Q-tuples ( 0 p cardQ = 2dk)
Every Ptree is a Quadrant-condition Ptree on R, e.g., Pij, basic Ptree, is Pcond where cond = (SR8-j ( SLj-1 ( t.Ai )))P1i(v) for value, v Ai is Pcond where cond = (t.Ai = v, t Q)NP0(a1..an) is Pcond where cond = ( i : ( t Q : t.Ai = ai ) )
Notation: bSQ files, Pij(cond) ; BSQ files, Pi(cond); Relations, P.
Firmer Mathematical Foundation (HistoTrees)Given R(K, A1, A2, A3 ), form Ptrees for R
Form P-cube of all rcP(t), which forms the
HistoRelation or HyperRelation,
HR( A1, A2, A3, rcP(A1,A2,A3) )
(rootcounts (RC) form the feature attriband Ai’s form the structure attributes)
From HR we will usually intervalize the RC, (eg, 4 intervals, [0,0], [1,8], [9,63], [64,), labelled, 00, 01, 10 ,11 respectively).
Form the HyperPtrees, HP-trees,by forming Ptrees over HR (1 feature attrib and, if we intervalize as above, 4 basic Ptrees).
- |HR| |R| and = iff (A1, A2, A3 ) candidate key for R - what is the relationship to the Haar wavelet low-pass tree?
0 0
0 0
0 0
0 0
0 0
1 5
0 0
0 0
1100 01 10
00
01
10
11
0 0
1 0
0 1
0 0
0 0
14 5
0 0
3 0
1000 01 10
00
01
10
11
0 0
1 0
0 0
0 0
0 0
5 5
0 0
17 0
0100
01
10
11rc
P(0,0,0)
00 01 10 11
11
10
01
00
00
A1
A2
A 3
rcP(1,0,0)
rcP(0,2,0)
rcP(1,2,0)
rcP(2,2,0)
rcP(3,2,0)
rcP(0,3,0)
rcP(1,3,0)
rcP(2,3,0)
rcP(3,3,0)
rcP(0,0,0)
rcP(1,1,0)
rcP(2,1,0)
rcP(3,1,0)
rcP(3,0,0)
rcP(2,0,0)
rcP(0,0,1)
rcP(1,0,1)
rcP(2,0,1)
rcP(3,0,1)
rcP(0,0,2)
rcP(1,0,2)
rcP(2,0,2)
rcP(3,0,2)
rcP(2,0,3)
rcP(1,0,3)
rcP(0,0,3)
rcP(3,0,3)
rcP313
rcP312
rcP311
rcP323
rcP333
rcP322
rcP321
rcP331
rcP332
The P-tree Algebra (Complement, AND, OR, …) Complement Tree = the Ptree for the bit-complemented of the bSQ file) (‘)
– We will use the “prime” notation.– PC-tree of a complement formed by purity-complementing each count.– Truth-tree of a complement: by bit-complementing only the leaves.
Tree Complement = Complement of the tree - each tree entry is complemented. (“)– Not the same as Ptree of a complement!– We will use”double prime” notation.
Pure1-quad-list method: For each operand, list the qids of the pure1 quad’s in depth-first order. Do one multi-cursor scan across the operand lists , for every pure1 quad common to all operands, install it in the result.
Can use either Pure1 (and its complement, P0) or EPM (and its complement, EPM’).
Each bSQ file, Bij generates a Basic Ptree Pij
Each value, v, in BSQ, Bi, generates a Value Ptree, Pi(v).Each tuple (v1,..vn) in a relation, R, generates a Tuple Ptree, P(v1,..vn).Any condition on the attributes of R generates a Condition Ptree, - An interval, [l,u] in a numeric attribute, Bi, generates a condition, v [l,u] which generates an Interval Ptree, Pi([l,u]). - A rectangle or box, [li,ui] generates a Rectangle Ptree or Hyperbox Ptree. (set containment is a common condition for defining Condition Ptrees.)
(each Ptree can beexpressed as PC orP1, P0, PN1, PNP0..)
Basic, Value, Tuple Ptrees,...
Value Ptree (1 if quad contains only that value (pure), else 0) P1(001) = P1’11 ^ P1’12 ^ P113
= NP0”11 ^ NP0”12 ^ P113
Tuple Ptree (1 if quad contains only that tuple, else 0) P(001, 010, 111) = P1(001) ^ P2(010) ^ P3(111)
For P(p)= P(100- ---- , … , 011- ---- ): At each [..]1. swap and take bit comp of each [..]NP0V [..]P1V pair corresponding to 0-bits.2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors.
ANDing in the NP0V-P1V Vector-Pair Format
For P(p)= P(110- ---- , … , ---- ---- ) (previous example, P1(6) at qid[ ] )
At each [..]1. swap and complement each [..]NP0V [..]P1V pair corresponding to 0-bits. Result denoted with *2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors to get [..]PMV(p)
Alternatively, Send to Nodeij if qid starts with qid segment, ij. Is this better? How would the AND code be revised? AND performance?
OR: Send to Nodeij if the largest qid segment divisible by p is ij eg if p=4: [0]->0; [0.3]->0; [0.3.2]->0; [0.3.2.2]->2; [0.3.2.2.3]->2; [0.3.2.2.3.1]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0.1]->1 etc.Similar to fanout 4. Implement by multicasting externally only every 4th segment. More generally, choose any increasing sequence, p=(p1..pL), define x p = {max pi x},then multicast [s1.s2…sk] --> Node k p
Alternatively, The Sequence can be a tree in the most general setting (i.e., a different sequence can be used on different branches, tuned to the very best tree of "multicast delays":Define a function F:{set of qids} --> {0,1,...} where if F([q1.q2...qn]) = p > 0 then F([q1.q2...qn-1]) = p-1 and if F([q1.q2...qn]) = 0 then the there is a multicast at this level. Said another way, there is a "multicast tree that tells you when to multicast (to node corresponding to last segment of the qid), eg:
Each node knows if it is suppose to make a distr. call for the next level or if it is suppose to compute that level (multicast to itself) by consulting the tree (or we could attach that info when we stripe).
IN this way we have full flexibility to tune the multicast-compute balance to minimizeexecution time – on a “per P-tree basis”.
Data Mining in Genomics
• There is (will be?) an explosion of gene expression data.
• Current emphasis is on extracting meaningful information from huge raw data sets.
•Methods employed are Clustering and Classification
• A consistent data store and the use of P-trees to facilitate Assoc Rule Mining as well as Clustering / Classification to extract information from raw data on demand.
•The approach involves treating microarray data as spatial data
Gene regulatory pathway (network) can be represented as a sequence (graph) of {G1..Gn} Gm where {G1..Gn} is the antecedent of an association rule and Gm is the consequent of the rule.
Microarray data is most often represented as a relation G(Gid, T1, T2, ., Tn) where Gid is the gene identifier; T1…. Tn are the various treatments (or conditions) and the data values are gene expression levels. We will call this the " Gene Table”.
Currently, data-mining techniques concentrate on the Gene table, G(Gid, T1, T2, ., Tn) - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table).
Gene Table
….….….….G4
….….….….G3
….….….….G2
….….….….G1
T4T3T2T1 Treatmt-ID
Gene-ID .
Using the Universal Relation approach to mining across different Microarray datasets, one can use a consistent Gene-id. Each Microarray will embedded in a subquadrant. Therefore the data will be sparse and can be handled by Progeny Vector Tables (PVTs) in which the prefix of the subquadrant can be listed only once:
Bp qid NP0 P111[10.00] 111111[10.01] 111111[10.10] 110111[10.11] 1111
Bp qid NP0 P111[11.00] 111111[11.01] 111111[11.10] 111111[11.11] 1111
Bp qid NP0 P111[11] 0000 1111
Node-00Node-00 Bp qid NP0 P111[01.00] 1110
Node-01Node-01 Bp qid NP0 P111[01] 1011 0010
Node-10Node-10 Bp qid NP0 P111[10.10] 110111[10] 1111 1101
Node-11Node-11 Bp qid NP0 P111[01.11] 0001
Node-CNode-C Bp qid NP0 P111[] 0111 1001
Ends the possibilityof a larger pure1 quad.All can be installed inparent/grandparentas a 1-bit.10.10 can be installed.
Ends quad-11.All can be installed inParent as a 1-bit.
Bottom-up bottom-line: Since it is better to use 2-D than 3-D (higher compression), it should be better to use 1-D than 2-D? This should be investigated.
Bp qid NP0 P111[00] 1111 111112[00] 1111 111113[00] 0000 0000
At 00Bp qid NP0 P123[00] 0010 0010
At 01Bp qid NP0 P121[00.01] 111022[00.01] 0001
P-ARM Algorithm
• The P-ARM algorithm assumes a fixed value precision in all bands.
• The p-gen function for numeric spatial data differs from apriori-gen by using additional pruning techniques.
In p-gen function, even if both B3[0,64) and B3[64,127) are frequent 1-itemsets, they’re not joined to candicate 2-itemsets.
•The AND_rootcount function is used to calculate Itemset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases.
• 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).
• 2-bits precision
• Equi-length partition
0
100
200
300
400
500
600
700
800
10%20%30%40%50%60%70%80%90%
Support threshold
Ru
n t
ime
(Sec
.)
P-ARM
Apriori
Compare with Apriori (classical method) and FP-growth (recently proposed).Find all frequent itemsets, not just those containing Yield, for fairness. The images are actual aerial TIFF images with synchronized yield maps.
Scalability with number of transactions
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions(K)
Tim
e (S
ec.)
Apriori
P-ARM
Identical resultsP-ARM is more scalable for lower support thresholds.P-ARM algorithm is more
scalable to large spatial datasets.
P-ARM versus FP-growth
Scalability with support threshold
0
100
200
300
400
500
600
700
800
10% 30% 50% 70% 90%
Support threshold
Ru
n t
ime (
Sec.)
P-ARM
FP-grow th
17,424,000 pixels (transactions)
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions(K)
Tim
e (S
ec.)
FP-growth
P-ARM
Scalability with number of trans
FP-growth = efficient, tree-based frequent pattern mining method (details later)Identical results.For a dataset of 100K bytes, FP-growth runs very fast. But for images of large
size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.
High Confidence Rules Application areas on spatial data
– Yield identification– Identification of agricultural pest infestations
Traditional algorithms are not suitable– Too many frequent itemsets in the case of low support threshold
P-tree P-cube Low-support
– To eliminate rules that result from noise and outliers High confidence Eliminate redundant rules
– Ranked based on confidence, rule-size– Generation relation between rules
• r generalizes r’, if they have same consequent andantecedent of r is properly contained in the antecedent of r’
Confident Rule Mining Algorithm Build the set of confident rules, C (initially empty) as follows:
– Start with 1-bit values, 2 bands; – then 1-bit values and 3 bands; …– then 2-bit values and 2 bands;– then 2-bit values and 3 bands; …– . . .– At each stage defined above, do the following:
• Find all confident rules by rolling-up the T-cube along each potential consequent set using summation.
• Comparing these sums with the support threshold to isolate rule support sets with the minimum support.
• Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules.
• Place any new confident rule in C, but only if the rank is higher than any of its generalizations already in C.
5 19
25 15
1,0 1,1
2,0
2,1
Example
30 34 sums
24 27.2 thresholds
32 40
19.2 24
Assume minimum confidence threshold 80%, minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2
C: B1={0} => B2={0} c = 83.3%
Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count
is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is
useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets
are estimated to be frequent
The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation