Top Banner
1 An Efficient Algorithm for M ining Frequent Itemests over the Entire History of D ata Streams Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accep Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accep ted for publication in the Proceedings of First ted for publication in the Proceedings of First International Workshop on Knowledge Discovery in International Workshop on Knowledge Discovery in Data Streams, to be held in conjunction with the Data Streams, to be held in conjunction with the 15th European Conference on Machine Learning (EC 15th European Conference on Machine Learning (EC ML 2004). ML 2004). Adviser: Jia-Ling Koh Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Speaker: Shu-Ning Shin Date: 2005.3.4 Date: 2005.3.4
18

An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

Dec 30, 2015

Download

Documents

medge-hendrix

An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

1

An Efficient Algorithm for Mining Frequent Itemestsover the Entire History of Data Streams

Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for publication in the Proceedings of First International Workspublication in the Proceedings of First International Workshop on Knowledge Discovery in Data Streams, to be held in hop on Knowledge Discovery in Data Streams, to be held in conjunction with the 15th European Conference on Machinconjunction with the 15th European Conference on Machine Learning (ECML 2004).e Learning (ECML 2004).

Adviser: Jia-Ling KohAdviser: Jia-Ling KohSpeaker: Shu-Ning ShinSpeaker: Shu-Ning ShinDate: 2005.3.4Date: 2005.3.4

Page 2: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

2

IntroductionIntroduction• This paper proposes a single-pass algorithm,

called DSMFI(Data Stream Mining for Frequent Itemsets), has three major features:– single streaming data scan for counting item

sets’ frequency information.– extended prefix-tree-based compact pattern

representation.– top-down frequent itemset discovery schem

e.

Page 3: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

3

Problem Definition (1)Problem Definition (1)

• Item: Ψ={i1, i2, …, im}• Data Stream: DS=B1, B2, …,BN

• Block: B=[tsi, T1, T2, …, Tk]• Current Length of Data Stream: CL=|B1|+|

B2|+ …+|BN|• Minimum support: • Maximal estimated support error:

)1,0(ms

),0( ms

Page 4: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

4

Problem Definition (2)Problem Definition (2)• True Support of itemset X: Tsup(X)• Estimated Support of X: Esup(X)• X is frequent itemset if

– • X is sub-frequent itemset if

– • X is infrequent itemset if

CLmsXT )sup(

CLXTCLms )sup(

CLXT )sup(

Page 5: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

5

Data Structure: IsFI-forest (1)Data Structure: IsFI-forest (1)• Item-suffix Frequent Itemset forest.• Extended prefix-tree-based summary da

ta structure.• Definition:• 1. Is-FI forest consist of:

– DHT: Dynamic Header Table– a set of CFI-trees (item-suffixes): Candidate

Frequent Itemset trees of item-suffixes.

Page 6: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

6

Data Structure: IsFI-forest (2)Data Structure: IsFI-forest (2)• 2. Entry of DHT:

– item-id, support, block-id, head-link.• 3. Entry of CFI-tree(item-suffix):

– item-id, support, block-id, node-link.• 4. Each CFI-tree(item-suffix) has a specifi

c DHT(item-sufix)

Page 7: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

7

Construction of IsFI-forestConstruction of IsFI-forest• 1. Read a transaction T=(x1x2..xk) from current

block BN.• 2. Item-suffix projection IsProjection(T): transa

ction T is converted into k small transactions (x1x2..xk), (x2..xk),…, (xk-1xk), (xk).

• 3. insert IsProjection(T) into IsFI-forest.

Page 8: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

8

Algorithm: DSM-FIAlgorithm: DSM-FI• DSM-FI is composed of four steps:

– step 1 : reading a block of transactions– step 2: constructing IsFI-forest– step 3: pruning the infrequent information fr

om IsFI-forest– Step 4: top-down frequent itemset discover

y scheme

Page 9: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

9

ExampleExample• ms=30%, ε=25% • DS={B1, B2}

– B1={(acdef), (abe), (cef), (acdf), (cef), (df)}– B2={(def), (bef), (be), (bde)}

• DSM-FI:– Step 1. Read B1 into main memory– Step 2. constructing the IsFI-forest– Step 3. pruning infrequent item– Step 4. mine frequent itemset from IsFI-forest

Page 10: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

10

Example – step 2 (1)Example – step 2 (1)• (1) T1=acdef

– IsProjection(acdef)=acdef, cdef, def, ef, f– [CFT-tree(a), DHT(a)], [CFT-tree(c), DHT(c)], [CFT-tree(d), DHT(d)],

[CFT-tree(e), DHT(e)] [CFT-tree(f), DHT(f)]c 1 1d 1 1e 1 1f 1 1

d 1 1e 1 1f 1 1

DHT(a)

a:1:1

DHT(c)e 1 1f 1 1

f 1 1DHT(d) DHT(e) DHT(f)

c:1:1 d:1:1 e:1:1 f:1:1

a:1:1

c:1:1

d:1:1

e:1:1

f:1:1

c:1:1

d:1:1

e:1:1

f:1:1

d:1:1

e:1:1

f:1:1

e:1:1

f:1:1

f:1:1

CFT-tree(a)

CFT-tree(c)

CFT-tree(d)

CFT-tree(e)

CFT-tree(f)

Page 11: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

11

Example – step 2 (2)Example – step 2 (2)• (2) T2=abe

– IsProjection(abe)=abe, be, e– [CFT-tree(a), DHT(a)], [CFT-tree(b), DHT(b)], [CFT-tree(e), DHT(e)]

c 1 1d 1 1e 2 1f 1 1b 1 1

d 1 1e 1 1f 1 1

DHT(a)

a:2:1

DHT(c)e 1 1f 1 1

f 1 1DHT(d) DHT(e) DHT(f)

c:1:1 d:1:1 e:2:1 f:1:1

a:2:1

c:1:1

d:1:1

e:1:1

f:1:1

c:1:1

d:1:1

e:1:1

f:1:1

d:1:1

e:1:1

f:1:1

e:2:1

f:1:1

f:1:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

Page 12: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

12

Example – step 2 (3)Example – step 2 (3)• (6) T6=df

– IsProjection(abe)=df, f– [CFT-tree(d), DHT(d)], [CFT-tree(f), DHT(f)]

c 2 1d 2 1e 2 1f 2 1b 1 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 1 1f 3 1

f 3 1DHT(d) DHT(e) DHT(f)

c:4:1 d:3:1 e:4:1 f:5:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:3:1

e:1:1

f:1:1

e:4:1

f:3:1

f:5:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

Page 13: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

13

Example – step 3Example – step 3• b is a infrequent item: Esup(b)<0.25*6• delete CFI-tree(b), DHT(b) and entry b.

c 2 1d 2 1e 2 1f 2 1b 1 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 1 1f 3 1

f 3 1DHT(d) DHT(e) DHT(f)

c:4:1 d:3:1 e:4:1 f:5:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:3:1

e:1:1

f:1:1

e:4:1

f:3:1

f:5:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

Page 14: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

14

Example – Step 4Example – Step 4• Start top-down frequent itemset discovery from frequent item a: (frequent: ms*CL=0.3*10=

3)

c 2 1d 2 1e 2 1f 2 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 3 1f 4 1

f 5 1DHT(d) DHT(e) DHT(f)

c:4:1 d:5:1 e:8:1 f:7:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:5:1

e:3:1

f:2:1

e:8:1

f:5:1

f:7:1

e:1:1

e 3 2f 1 2d 1 2

b:3:2

e:2:2

b:3:2

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

f:1:2

d:1:2

e:1:2

Frequent: aMaximal candidate: cef

TSup(cef)=3, frequent{c, e, f, ce, cf, ef, cef}

Maximal candidate: defTwo-candidate: de, dfMaximal candidate: be

Tsup(ef)=2 not frequentde, df are frequent

Maximal candidate: ef

be is frequent

Page 15: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

15

Upper Bound of SpaceUpper Bound of Space• i=1, 2, …, k• Nodes of all DHT(i):

– (k2-k)/2• Nodes of the set of CFI-trees:

k

j

j

i

ijiC

1

2/

1

2/

Page 16: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

16

Maximal Estimated ErrorMaximal Estimated Error• (X, X.support, X.block_id)

– X.block_id=1: • Tsup(X)=Esup(X)

– X.block_id>1:• • k: the average size of block

kidblockXXEXT )1_.()sup()sup(

Page 17: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

17

Performance (1)Performance (1)• IBM Dataset: T10.I5.D1M, T30.I20.D1M.• 20 blocks with size 50K.• Compare execution time and memory usage of

DSM-FI:

Page 18: An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

18

Performance (2)Performance (2)• Compare of DSM-FI and Lossy Counting: