-
Chng 6: Khai ph lut kt hpKhai ph d liu(Data mining)Hc k 1
2009-2010Khoa Khoa Hc & K Thut My TnhTrng i Hc Bch Khoa Tp. H
Ch Minh
-
Ni dung6.1. Tng quan v khai ph lut kt hp6.2. Biu din lut kt
hp6.3. Khm ph cc mu thng xuyn6.4. Khm ph cc lut kt hp t cc mu thng
xuyn6.5. Khm ph cc lut kt hp da trn rng buc6.6. Phn tch tng
quan6.7. Tm tt
-
Ti liu tham kho[1] Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, Second Edition, Morgan Kaufmann
Publishers, 2006.[2] David Hand, Heikki Mannila, Padhraic Smyth,
Principles of Data Mining, MIT Press, 2001.[3] David L. Olson,
Dursun Delen, Advanced Data Mining Techniques, Springer-Verlag,
2008.[4] Graham J. Williams, Simeon J. Simoff, Data Mining: Theory,
Methodology, Techniques, and Applications, Springer-Verlag,
2006.[5] ZhaoHui Tang, Jamie MacLennan, Data Mining with SQL Server
2005, Wiley Publishing, 2005.[6] Oracle, Data Mining Concepts,
B28129-01, 2008.[7] Oracle, Data Mining Application Developers
Guide, B28131-01, 2008.
-
6.0. Tnh hung 1 Market basket analysis
-
6.0. Tnh hung 2 - Tip th cho
-
6.0. Tnh hung 2 - Tip th cho
-
6.0. Tnh hung Phn tch d liu gi hng (basket data analysis)Tip th
cho (cross-marketing)Thit k catalog (catalog design)Phn loi d liu
(classification) v gom cm d liu (clustering) vi cc mu ph bin
-
6.1. Tng quan v khai ph lut kt hpQu trnh khai ph lut kt hpCc khi
nim c bnPhn loi lut kt hp
-
6.1. Tng quan v khai ph lut kt hpQu trnh khai ph lut kt hpRaw
DataItems of InterestRelationships among
Items(Rules)UserPre-processingMiningPost-processing
-
6.1. Tng quan v khai ph lut kt hpQu trnh khai ph lut kt
hpAssociationRulesItemsTransactional/Relational DataRaw DataItems
of InterestRelationships among
Items(Rules)UserPre-processingMiningPost-processingTransaction
Items_bought---------------------------------2000 A, B, C1000 A,
C4000 A, D5000 B, E, FA, B, C, D, F, A C (50%, 66.6%)Bi ton phn tch
gi th trng
-
6.1. Tng quan v khai ph lut kt hpD liu mu ca AllElectronics (sau
qu trnh tin x l)
-
6.1. Tng quan v khai ph lut kt hpCc khi nim c bnItem (phn
t)Itemset (tp phn t)Transaction (giao dch)Association (s kt hp) v
association rule (lut kt hp)Support ( h tr)Confidence ( tin
cy)Frequent itemset (tp phn t ph bin/thng xuyn)Strong association
rule (lut kt hp mnh)
-
6.1. Tng quan v khai ph lut kt hpD liu mu ca AllElectronics (sau
qu trnh tin x l)Item: I4Itemsets:{I1, I2, I5}, {I2}, Transaction:
T800
-
6.1. Tng quan v khai ph lut kt hpCc khi nim c bnItem (phn t)Cc
phn t, mu, i tng ang c quan tm.J = {I1, I2, , Im}: tp tt c m phn t
c th c trong tp d liuItemset (tp phn t)Tp hp cc itemsMt itemset c k
items gi l k-itemset.Transaction (giao dch)Ln thc hin tng tc vi h
thng (v d: giao dch khch hng mua hng)Lin h vi mt tp T gm cc phn t c
giao dch
-
6.1. Tng quan v khai ph lut kt hpCc khi nim c bnAssociation (s
kt hp) v association rule (lut kt hp)S kt hp: cc phn t cng xut hin
vi nhau trong mt hay nhiu giao dch.Th hin mi lin h gia cc phn t/cc
tp phn tLut kt hp: qui tc kt hp c iu kin gia cc tp phn t.Th hin mi
lin h (c iu kin) gia cc tp phn tCho A v B l cc tp phn t, lut kt hp
gia A v B l A B.B xut hin trong iu kin A xut hin.
-
6.1. Tng quan v khai ph lut kt hpCc khi nim c bnSupport ( h tr)
o o tn s xut hin ca cc phn t/tp phn t.Minimum support threshold
(ngng h tr ti thiu)Gi tr support nh nht c ch nh bi ngi
dng.Confidence ( tin cy) o o tn s xut hin ca mt tp phn t trong iu
kin xut hin ca mt tp phn t khc.Minimum confidence threshold (ngng
tin cy ti thiu)Gi tr confidence nh nht c ch nh bi ngi dng.
-
6.1. Tng quan v khai ph lut kt hpCc khi nim c bnFrequent itemset
(tp phn t ph bin)Tp phn t c support tha minimum support
threshold.Cho A l mt itemsetA l frequent itemset iff support(A)
>= minimum support threshold.Strong association rule (lut kt hp
mnh)Lut kt hp c support v confidence tha minimum support threshold
v minimum confidence threshold.Cho lut kt hp AB gia A v B, A v B l
itemsetsAB l strong association rule iff support(AB) >= minimum
support threshold v confidence(AB) >= minimum confidence
threshold.
-
6.1. Tng quan v khai ph lut kt hpPhn loi lut kt hpBoolean
association rule (lut kt hp lun l)/quantitative association rule
(lut kt hp lng s)Single-dimensional association rule (lut kt hp n
chiu)/multidimensional association rule (lut kt hp a
chiu)Single-level association rule (lut kt hp n mc)/multilevel
association rule (lut kt hp a mc)Association rule (lut kt
hp)/correlation rule (lut tng quan thng k)
-
6.1. Tng quan v khai ph lut kt hpPhn loi lut kt hpBoolean
association rule (lut kt hp lun l)/quantitative association rule
(lut kt hp lng s)Boolean association rule: lut m t s kt hp gia s
hin din/vng mt ca cc phn t.Computer Financial_management_software
[support=2%, confidence=60%]Quantitative association rule: lut m t
s kt hp gia cc phn t/thuc tnh nh lng.Age(X, 30..39) Income(X,
42K..48K) buys(X, high resolution TV)
-
6.1. Tng quan v khai ph lut kt hpPhn loi lut kt
hpSingle-dimensional association rule (lut kt hp n
chiu)/multidimensional association rule (lut kt hp a
chiu)Single-dimensional association rule: lut ch lin quan n cc phn
t/thuc tnh ca mt chiu d liu.Buys(X, computer) Buys(X,
financial_management_software)Multidimensional association rule:
lut lin quan n cc phn t/thuc tnh ca nhiu hn mt chiu.Age(X, 30..39)
Buys(X, computer)
-
6.1. Tng quan v khai ph lut kt hpPhn loi lut kt hpSingle-level
association rule (lut kt hp n mc) /multilevel association rule (lut
kt hp a mc)Single-level association rule: lut ch lin quan n cc phn
t/thuc tnh mt mc tru tng.Age(X, 30..39) Buys(X, computer)Age(X,
18..29) Buys(X, camera)Multilevel association rule: lut lin quan n
cc phn t/thuc tnh cc mc tru tng khc nhau.Age(X, 30..39) Buys(X,
laptop computer)Age(X, 30..39) Buys(X, computer)
-
6.1. Tng quan v khai ph lut kt hpPhn loi lut kt hpAssociation
rule (lut kt hp)/correlation rule (lut tng quan thng k)Association
rule: strong association rules AB (association rules p ng yu cu
minimum support threshold v minimum confidence
threshold).Correlation rule: strong association rules A B p ng yu
cu v s tng quan thng k gia A v B.
-
6.2. Biu din lut kt hpDng lut: AB [support, confidence]Cho trc
minimum support threshold (min_sup), minimum confidence threshold
(min_conf)A v B l cc itemsetsFrequent
itemsets/subsequences/substructuresClosed frequent itemsetsMaximal
frequent itemsetsConstrained frequent itemsetsApproximate frequent
itemsetsTop-k frequent itemsets
-
6.2. Biu din lut kt hpFrequent
itemsets/subsequences/substructuresItemset/subsequence/substructure
X l frequent nu support(X) >= min_sup.Itemsets: tp cc
itemsSubsequences: chui tun t cc events/itemsSubstructures: cc tiu
cu trc (graph, lattice, tree, sequence, set, )
-
6.2. Biu din lut kt hpClosed frequent itemsetsMt itemset X
closed trong J nu khng tn ti tp cha thc s Y no trong J c cng
support vi X.X J, X closed iff Y J v X Y: support(Y) support (X).X
l closed frequent itemset trong J nu X l frequent itemset v closed
trong J.Maximal frequent itemsetsMt itemset X l maximal frequent
itemset trong J nu khng tn ti tp cha thc s Y no trong J l mt
frequent itemset.X J, X l maximal frequent itemset iff Y J v X Y: Y
khng phi l mt frequent itemset.
-
6.2. Biu din lut kt hpConstrained frequent itemsetsFrequent
itemsets tha cc rng buc do ngi dng nh ngha.Approximate frequent
itemsetsFrequent itemsets dn ra support (xp x) cho cc frequent
itemsets s c khai ph.Top-k frequent itemsetsFrequent itemsets c
nhiu nht k phn t vi k do ngi dng ch nh.
-
6.2. Biu din lut kt hpLut kt hp lun l, n mc, n chiu gia cc tp
phn t ph bin: AB [support, confidence]A v B l cc frequent
itemsetssingle-dimensionalsingle-levelBooleanSupport(AB) =
Support(A U B) >= min_supConfidence(AB) = Support(A U
B)/Support(A) = P(B|A) >= min_conf
-
6.3. Khm ph cc mu thng xuynGii thut Apriori: khm ph cc mu thng
xuyn vi tp d tuynR. Agrawal, R. Srikant. Fast algorithms for mining
association rules. In VLDB 1994, pp. 487-499.Gii thut FP-Growth:
khm ph cc mu thng xuyn vi FP-tree J. Han, J. Pei, Y. Yin. Mining
frequent patterns without candidate generation. In MOD 2000, pp.
1-12.
-
6.3. Khm ph cc mu thng xuynGii thut AprioriDng tri thc bit trc
(prior knowledge) v c im ca cc frequent itemsetsTip cn lp vi qu
trnh tm kim cc frequent itemsets tng mc mt (level-wise
search)k+1-itemsets c to ra t k-itemsets. mi mc tm kim, ton b d liu
u c kim tra.Apriori property gim khng gian tm kim: All nonempty
subsets of a frequent itemset must also be frequent.Chng
minh???Antimonotone: if a set cannot pass a test, all of its
supersets will fail the same test as well.
-
6.3. Khm ph cc mu thng xuynGii thut Apriori
-
6.3. Khm ph cc mu thng xuynGii thut Apriori
-
6.3. Khm ph cc mu thng xuynD liu mu ca AllElectronics (sau qu
trnh tin x l)
-
6.3. Khm ph cc mu thng xuynmin_sup = 2/9minimum support count =
2
-
6.3. Khm ph cc mu thng xuynGii thut Aprioric imTo ra nhiu tp d
tuyn104 frequent 1-itemsets nhiu hn 107 (104(104-1)/2) 2-itemsets d
tuynMt k-itemset cn t nht 2k -1 itemsets d tuyn trc .Kim tra tp d
liu nhiu lnChi ph ln khi kch thc cc itemsets tng ln dn.Nu
k-itemsets c khm ph th cn kim tra tp d liu k+1 ln.
-
6.3. Khm ph cc mu thng xuynGii thut AprioriCc ci tin ca gii thut
AprioriK thut da trn bng bm (hash-based technique)Mt k-itemset ng
vi hashing bucket count nh hn minimum support threshold khng l mt
frequent itemset.Gim giao dch (transaction reduction)Mt giao dch
khng cha frequent k-itemset no th khng cn c kim tra cc ln sau (cho
k+1-itemset).Phn hoch (partitioning)Mt itemset phi frequent trong t
nht mt phn hoch th mi c th frequent trong ton b tp d liu.Ly mu
(sampling)Khai ph ch tp con d liu cho trc vi mt tr support
threshold nh hn v cn mt phng php xc nh tnh ton din (completeness).m
itemset ng (dynamic itemset counting)Ch thm cc itemsets d tuyn khi
tt c cc tp con ca chng c d on l frequent.
-
6.3. Khm ph cc mu thng xuynGii thut FP-GrowthNn tp d liu vo cu
trc cy (Frequent Pattern tree, FP-tree)Gim chi ph cho ton tp d liu
dng trong qu trnh khai phInfrequent items b loi b sm.m bo kt qu
khai ph khng b nh hngPhng php chia--tr (divide-and-conquer)Qu trnh
khai ph c chia thnh cc cng tc nh.1. Xy dng FP-tree2. Khm ph
frequent itemsets vi FP-treeTrnh to ra cc tp d tuynMi ln kim tra mt
phn tp d liu
-
6.3. Khm ph cc mu thng xuynGii thut FP-Growth1. Xy dng
FP-tree1.1. Kim tra tp d liu, tm frequent 1-itemsets1.2. Sp th t
frequent 1-itemsets theo s gim dn ca support count (frequency, tn s
xut hin)1.3. Kim tra tp d liu, to FP-treeTo root ca FP-tree, c gn
nhn null {}Mi giao dch tng ng mt nhnh ca FP-tree.Mi node trn mt
nhnh tng ng mt item ca giao dch.Cc item ca mt giao dch c sp theo
gim dn. Mi node kt hp vi support count ca item tng ng.Cc giao dch c
chung items to thnh cc nhnh c prefix chung.
-
6.3. Khm ph cc mu thng xuynGii thut FP-Growth
-
6.3. Khm ph cc mu thng xuyn
-
6.3. Khm ph cc mu thng xuynGii thut FP-Growth2. Khm ph frequent
itemsets vi FP-tree2.1. To conditional pattern base cho mi node ca
FP-treeTch lu cc prefix paths with frequency ca node 2.2. To
conditional FP-tree t mi conditional pattern baseTch ly frequency
cho mi item trong mi baseXy dng conditional FP-tree cho frequent
items ca base 2.3. Khm ph conditional FP-tree v pht trin frequent
itemsets mt cch quiNu conditional FP-tree c mt path n th lit k tt c
cc itemsets.
-
6.3. Khm ph cc mu thng xuynGii thut FP-Growth
-
6.3. Khm ph cc mu thng xuyn
-
6.3. Khm ph cc mu thng xuynGii thut FP-Growthc imKhng to tp
itemsets d tuynKhng kim tra xem liu itemsets d tuyn c thc l
frequent itemsetsS dng cu trc d liu nn d liu t tp d liuGim chi ph
kim tra tp d liuChi ph ch yu l m v xy dng cy FP-tree lc u Hiu qu v
co gin tt cho vic khm ph cc frequent itemsets di ln ngn
-
6.3. Khm ph cc mu thng xuynSo snh gia gii thut Apriori v gii
thut FP-GrowthCo gin vi support threshold
-
6.3. Khm ph cc mu thng xuynSo snh gia gii thut Apriori v gii
thut FP-GrowthCo gin tuyn tnh vi s giao dch
-
6.4. Khm ph cc lut kt hp t cc mu thng xuynStrong association
rules ABSupport(AB) = Support(A U B) >= min_supConfidence(AB) =
Support(A U B)/Support(A) = P(B|A) >= min_confSupport(AB) =
Support_count(A U B) >= min_supConfidence(AB) = P(B|A) =
Support_count(AUB)/Support_count(A) >= min_conf
-
6.4. Khm ph cc lut kt hp t cc mu thng xuynQu trnh to cc strong
association rules t tp cc frequent itemsetsCho mi frequent itemset
l, to cc tp con khng rng ca l.Support_count(l) >= min_supCho mi
tp con khng rng s ca l, to ra lut s (l-s) nu
Support_count(l)/Support_count(s) >= min_conf
-
6.4. Khm ph cc lut kt hp t cc mu thng xuynMin_conf = 50%I1 I2
I5I1 I5 I2I2 I5 I1I5 I1 I2
-
6.5. Khm ph cc lut kt hp da trn rng bucRng buc (constraints)Hng
dn qu trnh khai ph mu (patterns) v lut (rules)Gii hn khng gian tm
kim d liu trong qu trnh khai phCc dng rng bucRng buc kiu tri thc
(knowledge type constraints)Rng buc d liu (data constraints)Rng buc
mc/chiu (level/dimension constraints)Rng buc lin quan n o
(interestingness constraints)Rng buc lin quan n lut (rule
constraints)
-
6.5. Khm ph cc lut kt hp da trn rng bucRng buc kiu tri thc
(knowledge type constraints)Lut kt hp/tng quanRng buc d liu (data
constraints)Task-relevant data (association rule mining)Rng buc
mc/chiu (level/dimension constraints)Chiu (thuc tnh) d liu hay mc
tru tng/ nimRng buc lin quan n o (interestingness constraints)Ngng
ca cc o (thresholds)Rng buc lin quan n lut (rule constraints)Dng
lut s c khm ph
-
6.5. Khm ph cc lut kt hp da trn rng bucKhm ph lut da trn rng
bucQu trnh khai ph d liu tt hn v hiu qu hn (more effective and
efficient).Lut c khm ph da trn cc yu cu (rng buc) ca ngi s dng.More
effectiveB ti u ha (optimizer) c th c dng khai thc cc rng buc ca
ngi s dng.More efficient
-
6.5. Khm ph cc lut kt hp da trn rng bucKhm ph lut da trn rng buc
lin quan n lut (rule constraints)Dng lut (meta-rule guided
mining)Metarules: ch nh dng lut (v c php syntactic) mong mun c khm
phNi dung lut (rule content)Rng buc gia cc bin trong A v/hoc B
trong lut A BQuan h tp hp cha/conMin trCc hm kt hp (aggregate
functions)
-
6.5. Khm ph cc lut kt hp da trn rng bucMetarulesCh nh dng lut (v
c php syntactic) mong mun c khm ph Da trn kinh nghim, mong i v trc
gic ca nh phn tch d liuTo nn gi thuyt (hypothesis) v cc mi quan h
(relationships) trong cc lut m ngi dng quan tm Qu trnh khm ph lut
kt hp + qu trnh tm kim lut trng vi metarules cho trc
-
6.5. Khm ph cc lut kt hp da trn rng bucMetarulesMu lut (rule
template): P1 P2 Pl Q1 Q2 QrP1, P2, , Pl, Q1, Q2, , Qr: v t c th
(instantiated predicates) hay bin v t (predicate variables)Thng lin
quan n nhiu chiu/thuc tnhV d ca metarulesMetaruleP1(X, Y) P2(X, W)
buys(X, office software)Lut tha metaruleage(X, 30..39) income(X,
41k..60k) buys(X, office software)
-
6.5. Khm ph cc lut kt hp da trn rng bucRng buc gia cc bin S1,
S2, trong A v/hoc B trong lut A BQuan h tp hp cha/con: S1 / S2Min
trS1 value, {=, , =}value / S1ValueSet S1 hoc S1 ValueSet, {=, , ,
, }Cc hm kt hp (aggregate functions)Agg(S1) value, Agg() {min, max,
sum, count, avg}, {=, , =}
-
6.5. Khm ph cc lut kt hp da trn rng bucTnh cht ca cc rng
bucAnti-monotoneMonotoneSuccinctnessConvertible
-
6.5. Khm ph cc lut kt hp da trn rng bucTnh cht ca cc rng
bucAnti-monotoneA constraint Ca is anti-monotone iff. for any
pattern S not satisfying Ca, none of the super-patterns of S can
satisfy Ca.V d: sum(S.Price) = value
- 6.5. Khm ph cc lut kt hp da trn rng bucTnh cht ca cc rng
bucSuccinctnessA subset of item Is is a succinct set, if it can be
expressed as p(I) for some selection predicate p, where is a
selection operator.SP2I is a succinct power set, if there is a
fixed number of succinct set I1, , Ik I, s.t. SP can be expressed
in terms of the strict power sets of I1, , Ik using union and
minus.A constraint Cs is succinct provided SATCs(I) is a succinct
power set. C th to tng minh v chnh xc cc tp tha succinct
constraints.V d: min(S.Price)
- 6.5. Khm ph cc lut kt hp da trn rng bucTnh cht ca cc rng
bucConvertibleCc rng buc khng c cc tnh cht anti-monotone, monotone,
v succinctnessCc rng buc hoc l anti-monotone hoc l monotone nu cc
phn t trong itemset ang kim tra c th t.V d: Nu cc phn t sp theo th
t tng dn th avg(I.price)
-
6.5. Khm ph cc lut kt hp da trn rng buc
-
6.5. Khm ph cc lut kt hp da trn rng bucKhm ph lut (rules)/tp phn
t ph bin (frequent itemsets) tha cc rng bucCch tip cn trc tipp dng
cc gii thut truyn thngKim tra cc rng buc cho tng kt qu t cNu tha
rng buc th tr v kt qu sau cng.Cch tip cn da trn tnh cht ca cc rng
bucPhn tch ton din cc tnh cht ca cc rng bucKim tra cc rng buc cng
sm cng tt trong qu trnh khm ph rules/frequent itemsetsKhng gian d
liu c thu hp cng sm cng tt.
-
6.6. Phn tch tng quanStrong association rules A BDa trn tn s xut
hin ca A v B (min_sup)Da trn xc sut c iu kin ca B i vi A
(min_conf)Cc o support v confidence da vo s ch quan ca ngi s dngLng
rt ln lut kt hp c th c tr v.Trong s 10,000 giao dch, 6,000 giao dch
cho computer games, 7,500 cho videos, v 4,000 cho c computer games
v videosBuys(X, computer games) Buys (X, videos) [support = 40%,
confidence = 66%]
-
6.6. Phn tch tng quanPhn tch tng quan cho lut kt hp A BKim tra s
tng quan v ph thuc ln nhau gia A v BDa vo thng k v d liuCc o khch
quan, khng ph thuc vo ngi s dngTrong s 10,000 giao dch, 6,000 giao
dch cho computer games, 7,500 cho videos, v 4,000 cho c computer
games v videosBuys(X, computer games) Buys (X, videos) [support =
40%, confidence = 66%]P(videos) = 75% > 66%: computer games v
videos tng quan nghch vi nhau.
-
6.6. Phn tch tng quanLut tng quan (correlation rules): A B
[support, confidence, correlation]correlation: o o s tng quan gia A
v B.Cc o correlation: lift, 2 (Chi-square), all_confidence,
cosinelift: kim tra s xut hin c lp gia A v B da trn xc sut (kh
nng)2 (Chi-square): kim tra s c lp gia A v B da trn gi tr mong i v
gi tr quan st call_confidence: kim tra lut da trn tr support cc
icosine: ging lift tuy nhin loi b s ph thuc vo tng s giao dch hin
call_confidence v cosine tt cho tp d liu ln, khng ph thuc cc giao
dch m khng cha bt k itemsets ang kim tra
(null-transactions).all_confidence v consine l cc o
null-invariant.
-
6.6. Phn tch tng quan o tng quan liftlift(A, B) < 1: A tng
quan nghch vi Blift(A, B) > 1: A tng quan thun vi Blift(A, B) =
1: A v B c lp nhau, khng c tng quanlift({game}=>{video}) = 0.89
< 1 {game} v {video} tng quan nghch.
-
6.7. Tm ttKhai ph lut kt hpc xem nh l mt trong nhng ng gp quan
trng nht t cng ng c s d liu trong vic khm ph tri thcCc dng lut: lut
kt hp lun l/lut kt hp lng s, lut kt hp n chiu/lut kt hp a chiu, lut
kt hp n mc/lut kt hp a mc, lut kt hp/lut tng quan thng kCc dng phn
t (item)/mu (pattern): Frequent
itemsets/subsequences/substructures, Closed frequent itemsets,
Maximal frequent itemsets, Constrained frequent itemsets,
Approximate frequent itemsets, Top-k frequent itemsetsKhm ph cc
frequent itemsets: gii thut Apriori v gii thut FP-Growth dng
FP-tree
-
Hi & p
R. Agrawal, R. Srikant. Fast algorithms for mining association
rules. In VLDB 1994, pp. 487-499.J. Han, J. Pei, Y. Yin. Mining
frequent patterns without candidate generation. In MOD 2000, pp.
1-12. J. Hipp, U. Guntzer, G. Nakhaeizadeh (2000). Algorithms for
association rule mining a general survey and comparison, SIGKDD
Explorations 2:1, 58-64.W-J Lee, S-J Lee (2004). Discovery of fuzzy
temporal association rules, IEEE transactions on Systems, Man, and
Cybernetics Part B 34:6, 2330-2342.www.amazon.comVinabook.com