Top Banner
Mining Sequential Patterns Rakesh Agrawal, Ramakrishana Srikant (1995) IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 데데데 데데데 데데데 데데 데데 데데데 2014. 3. 28
37

Agrawal et al, Mining sequential patterns, Data Eng., 1995

Dec 30, 2015

Download

Documents

serinlee9

paper seminar on R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc.11th Int’l Conf. Data Eng., pp. 3-14, Mar. 1995. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Mining Sequential PatternsRakesh Agrawal, Ramakrishana Srikant (1995)

IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120

데이터 연구실 이세린지도 교수 박종수

2014. 3. 28

Page 2: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Contents

• Abstract1. Introduction2. Finding Sequential Patterns3. The Sequence Phase4. Performance5. Conclusions and Future Work

2

Page 3: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Abstract

• Introduces the problem of mining sequential pat-terns over a large database.

• Presents 3 algorithms to solve this problem.

• Shows their results of performance and scale-up experiments.

3

Page 4: Agrawal et al, Mining sequential patterns, Data Eng., 1995

1. Introduction

• 1.1 Problem Statement

• The problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum sup-port.

• Each such maximal sequence represents a se-quential pattern.

4

Page 5: Agrawal et al, Mining sequential patterns, Data Eng., 1995

1. Introduction

• 1.2 Related Work– R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between

sets of items in large databases” (1993)– T. G. Dietterich and R. S. Michalski, “Discovering patterns in sequences of

events, Artificial Intelligence” (1985)– A. Califano and I. Rigoutsos, “Flash: A fast look-up algorithm for string homol-

ogy” (1993)– S. Wu and U. Manber, “Fast text searching allowing errors” (1992)– M. Waterman, “Mathematical Methods for DNA Sequence Analysis” (1989)– S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “A basic local align-

ment search tool” (1990)– M. Roytberg, “Computer Applications in the Biosciences: A search for common

patterns in many sequences” (1992)– M. Vingron and P. Argos, “Computer Applications in the Biosciences: A fast

and sensitive multiple sequence alignment algorithm” (1992)– J. T.-L. Wang, G.-W. Chrin, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang.

“Combinatorial pattern discovery for scientific data: Some preliminary re-sults” (1994)

5

Page 6: Agrawal et al, Mining sequential patterns, Data Eng., 1995

1. Introduction

• 1.2 Related Work– Finding of items bought together in a transac-

tion. (Intra-transaction patterns)– AI prediction of the sequential pattern.– Finding matches for pattern in text subse-

quences.– Discovering similarities in a database of ge-

netic sequences.

6

Compari-son

Page 7: Agrawal et al, Mining sequential patterns, Data Eng., 1995

1. Introduction

• 1.2 Related Work

7

• Element is a character• Not guaranteed to be complete.• Algorithm was main-memory based and was tested

against a database of 150 sequences

“Combinatorial pattern discovery for scien-tific data: Some preliminary results” (1994)

• Element is a set of character• Guarantee that we have discovered all sequential patterns

of interest that are present in a specified minimum number of sequences.

• Our Solution is targeted at millions of customer sequences.

“Mining sequential pattern” (1995)

Page 8: Agrawal et al, Mining sequential patterns, Data Eng., 1995

1. Introduction

• 1.3 Organization of the Paper

• Section 2. Gives this problem decomposition.• Section 3. Examines the sequence phase in detail and presents al-

gorithms for this phase. • Section 4. Empirically evaluate the performance of these algo-

rithms and study their scale-up properties.• Section 5. conclusion - summary and directions for future work.

8

Page 9: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Terminology

9

Length of a sequence

the number of itemsets in the sequence.

K-sequence a sequence of length k.

x.y. sequence formed by the concatenation of two se-quences x and y.

Support for an item

fraction of customers who bought the items in I in a single transaction. (Thus, the itemset I and the 1-sequence <i> have the same support.)

litemset the itemset with minimum support.

Large se-quence

must be a list of litemsets.

Maximal se-quence

Final sequence pattern that satisfy the minimum sup-port.

Page 10: Agrawal et al, Mining sequential patterns, Data Eng., 1995

2. Finding Sequential Patterns

• 2.1 The Algorithm• 1. Sort Phase• Converts the original transaction database into a

database of customer sequences.

10

Page 11: Agrawal et al, Mining sequential patterns, Data Eng., 1995

2. Finding Sequential Patterns

• 2.1 The Algorithm• 2. Litemset Phase• Find the set of all litemsets L including the set of

all 1-sequences.

• The set of litemsets is mapped to a set of con-tiguous integers.

11

Page 12: Agrawal et al, Mining sequential patterns, Data Eng., 1995

2. Finding Sequential Patterns

• 2.1 The Algorithm• 3. Transformation Phase• To process repetitive determination in the follow-

ing step faster,• Each transaction is replaced by the set of all

litemsets contained in that transaction.

12

Page 13: Agrawal et al, Mining sequential patterns, Data Eng., 1995

2. Finding Sequential Patterns

• 2.1 The Algorithm• 4. Sequence Phase• Use the set of litemsets to find the desired se-

quences.

• 5. Maximal Phase• Find the maximal sequences among the set of

large sequences.

13

Page 14: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• Make multiple passes over the data to generate candidate sequences from seed set of large se-quences.

14

Candi-date Seq.

<1>

<2>

<3>

<4>

<5>

Cus-tomer Id

Transformed Cus-tomer Seq. (After

Mapping)

1 <{1} {5}>

2 <{1} {2, 3, 4}>

3 <{1, 3}>

4 <{1} {2, 3, 4}, {5}>

5 <{5}>

25% (Support > 1.25)

Sup-port

4

2

3

2

3

Page 15: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 2 Families of algorithms

Count-all AprioriAll

Count-some

15

AprioriSome

DynamicSome

AprioriSome DynamicSome

Both have forward phase and Backward phase .

using only the large se-quences and then makes a pass over the data to find

their support.

on-the-fly using the large sequences and the cus-tomer sequences read

from the database.

Page 16: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.1 Algorithm AprioriAll

16

Large 1-sequence

<1>

<2>

<3>

<4>

<5>

litemsetphase

<1>

<2>

<3>

<4>

<5>

②Candi-

date Seq.Sup-port

<1 2> 2

<1 3> 2

<1 4> 2

<1 5> 1

︙ ︙<5 4> 0

Large 2-se-

quence

Sup-port

<1 2> 2

<1 3> 3

<1 4> 2

<1 5> 2

<2 3> 2

<2 4> 2

<3 4> 2

③Maximal Se-

quences

<1 2 3 4>

<1 3 5>

<4 5>

Page 17: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.1 Algorithm AprioriAll• 3.1.1 Apriori Candidate Generation

17

p q3-seq

<1 2 3>

<1 2 4>

<1 3 4>

<1 3 5>

<2 3 4>

Candidate Seq.

<1 2 3 4>

<1 2 4 3>

<1 3 4 5>

<1 3 5 4>

3-seq

<1 2 3>

<1 2 4>

<1 3 4>

<1 3 5>

<2 3 4>

Candidate Seq.

Support

<1 2 3 4>

2

<1 2 4 3>

0

<1 3 4 5>

1

<1 3 5 4>

0

𝐶𝑘𝐿𝑘−1 𝐿𝑘−1

Join

Page 18: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.2 Algorithm AprioriSome

18

Page 19: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.2 Algorithm AprioriSome• In the forward pass, we only count sequences of

certain lengths.

length1 length2 length3 length4 length5 length6

Forward phase

Backward phase

• Forward phase procedure

𝐶1 𝐿1 𝐶2 𝐿2 𝐶3 𝐶4 𝐿4 …

19

pruning pruning

Page 20: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.2 Algorithm AprioriSome• Next() takes as parameter the length of se-

quences counted in the last pass and returns the length of sequences to be counted in the next pass.

20

𝒉𝒊𝒕𝒌=|𝐿𝑘|/¿𝐶𝑘∨¿

Page 21: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.3 Algorithm DynamicSome

21

𝐶1 𝐶𝑠𝑡𝑒𝑝∧𝐿1 𝐿𝑠𝑡𝑒𝑝

Has to be initialized

If step = 3,After initialization of 1, 2, 3, Generate 6, 9, 12 …

* Backward phase is same as AprioriSome.

Page 22: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.3 Algorithm DynamicSome

22

𝐶1𝐿1 𝐶2 𝐿2 𝐶3 𝐿3 𝐶4 𝐿4 …

𝐶1𝐿1 𝐶2 𝐿2 𝐶3 𝐶4 𝐿4 …

𝐶1𝐿1 𝐶2 𝐿2 𝐶4 𝐿4 …

Apriori-All

AproriSome

DynamicSome

Page 23: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3. The Sequence Phase

• 3.3 Algorithm DynamicSome

23

otf-generates (On-the-fly) gen-erates more candidates than apriori-generate.

Avoid overlap-ping

𝐿𝑘⋈𝐿 𝑗=𝐶𝑘+ 𝑗

Page 24: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3 Algorithm Example

• AprioriAll / AprioriSome / DynamicSome ①

24

Large 1-sequence

<1>

<2>

<3>

<4>

<5>

Candidate2-se-

quence

Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

︙ ︙

<5 4> 0

Large 2-se-

quence

Sup-port

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Customer Sequences

<(1 5) (2) (3) (4)>

<(1) (3) (4) (3 5)>

<(1) (2) (3) (4)>

<(1) (3) (5)>

<(4) (5)>

Page 25: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3 Algorithm Example

• AprioriAll ②

25

Large 2-se-

quence

Sup-port

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2 Maximal Se-quences

<1 2 3 4>

<1 3 5>

<4 5>

Large3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Candidate 3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 2 5> 0

︙ ︙

<3 4 5> 1

Candidate 4-sequence

Sup-port

<1 2 3 4> 2

<1 2 4 3> 0

<1 3 4 5> 1

<1 3 5 4> 0

Large4-sequence

Sup-port

<1 2 3 4> 2

Page 26: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3 Algorithm Example

• AprioriSome ②

26

Large 2-se-

quence

Sup-port

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Candidate 3-se-

quence

<1 2 3>

<1 2 4>

<1 2 5>

<3 4 5>

Candidate 4-sequence

Sup-port

<1 2 3 4> 2

<1 2 3 5> 0

︙ ︙

<2 3 4 5> 0

Large4-sequence

Sup-port

<1 2 3 4> 2

Page 27: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Large3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Candidate 3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 2 5> 0

︙ ︙

<3 4 5> 1

3 Algorithm Example

• AprioriSome ③

27

Maximal Se-quences

<1 2 3 4>

<1 3 5>

<4 5>

Large4-sequence

Sup-port

<1 2 3 4> 2

Page 28: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3 Algorithm Example

• DynamicSome ② (step = 2)

28

Large 2-se-

quence

Sup-port

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Large4-sequence

Sup-port

<1 2 3 4> 2

Candidate 4-sequence

Sup-port

<1 2 3 4> 2

<1 2 3 5> 0

︙ ︙

<2 3 4 5> 0

Page 29: Agrawal et al, Mining sequential patterns, Data Eng., 1995

3 Algorithm Example

• DynamicSome ③

29

Maximal Se-quences

<1 2 3 4>

<1 3 5>

<4 5>

Large3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Candidate 3-se-

quence

Support

<1 2 3> 2

<1 2 4> 2

<1 2 5> 0

︙ ︙

<3 4 5> 1

Large4-sequence

Sup-port

<1 2 3 4> 2

Page 30: Agrawal et al, Mining sequential patterns, Data Eng., 1995

4. Performance

• 4.1 Generation of Synthetic Data

30

• Customer-sequence sizes are typically clustered around a mean and a few customers may have many transactions.

• Transaction sizes are usually clustered around a mean and a few transactions have many items.

• Setting: = 5,000 = 25,000N = 10,000

Page 31: Agrawal et al, Mining sequential patterns, Data Eng., 1995

4. Performance

• 4.2 Relative Performance• Decreased support by 1% to 0.2%.

31

Page 32: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Advantages Disadvantages

Avoids counting many non-maximal sequences

Faster for lower supports.

1.Number of candidates generated using AprioriSome can be larger. ( ⊇ )

2.Have to generate skipped candidates of some lengths and stay memory res-ident.

4. Performance

• 4.2 Relative Performance• Observation:- Execution time support- DynamicSome performs worse.- AprioriSome shows:

32

Page 33: Agrawal et al, Mining sequential patterns, Data Eng., 1995

4. Performance

• 4.3 Scale-up• Scale-up experiments for the AprioriSome algo-

rithm.(AprioriSome and AprioriAll results to be very similar.)

33

Page 34: Agrawal et al, Mining sequential patterns, Data Eng., 1995

4. Performance

• 4.3 Scale-up

34

Page 35: Agrawal et al, Mining sequential patterns, Data Eng., 1995

5. Conclusions and Future Work

• Introduced a new problem of mining sequential patterns from a database of customer sales transactions.

• Presented 3 algorithms for solving this problem.• AprioriSome and AprioriAll have comparable perfor-

mance. AprioriSome performs a little better for the lower values of

the minimum number of customers that must support a se-quential pattern.

Both scale linearly with the number of customer transactions. Both have excellent scale-up properties with respect to the

number of transactions in a customer sequence and the number of items in a transaction.

• AprioriAll is preferred in some cases that need detail counts of the number of people.

35

Page 36: Agrawal et al, Mining sequential patterns, Data Eng., 1995

5. Conclusions and Future Work

In the future,• Extension of the algorithms to discover

sequential patterns across item cate-gories.

• Transposition of constraints into the dis-covery algorithms. There could be item constraints or time constraints.

36

Page 37: Agrawal et al, Mining sequential patterns, Data Eng., 1995

Thank You

37