7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association.

Post on 12-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

1

7. Sequence Mining

Sequences and Strings

Recognition with Strings

MM & HMM

Sequence Association Rules

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

2

Sequences and Strings

• A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence– Sequences and strings are often used as synonyms

– String elements (characters, letters, or symbols) are nominal

– A type of particularly long string text

• |x| denotes the length of sequence x– |AGCTTC| is 6

• Any contiguous string that is part of x is called a substring, segment, or factor of x– GCT is a factor of AGCTTC

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

3

Recognition with Strings

• String matching– Given x and text, determine whether x is a factor of text

• Edit distance (for inexact string matching)– Given two strings x and y, compute the minimum

number of basic operations (character insertions, deletions and exchanges) needed to transform x into y

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

4

String Matching

• Given |text| >> |x|, with characters taken from an alphabet A– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}

• A shift s is an offset needed to align the first character of x with character number s+1 in text

• Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

5

Naïve (Brute-Force) String Matching

• Given A, x, text, n = |text|, m = |x|

s = 0

while s ≤ n-m

if x[1 …m] = text [s+1 … s+m]

then print “pattern occurs at shift” s

s = s + 1• Time complexity (worst case): O((n-m+1)m)• One character shift at a time is not necessary

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

6

Boyer-Moore and KMP

• See StringMatching.ppt and do not use the following alg• Given Given AA, , xx, , texttext, , nn = | = |texttext|, |, mm = | = |xx||

F(x) = last-occurrence functionF(x) = last-occurrence functionG(x) = good-suffix function; G(x) = good-suffix function; ss = 0 = 0whilewhile s s ≤ n-m≤ n-m

j = mj = mwhile while j>0j>0 and and xx[j] = [j] = texttext [s+j] [s+j] j = j-1j = j-1ifif j = 0 j = 0 thenthen print “pattern occurs at shift” s print “pattern occurs at shift” s

s = s + G(0)s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])]else s = s + max[G(j), j-F(text[s+j0])]

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

7

Edit Distance

• ED between x and y describes how many fundamental operations are required to transform x to y.

• Fundamental operations (x=‘excused’, y=‘exhausted’)

– Substitutions e.g. ‘c’ is replaced by ‘h’

– Insertions e.g. ‘a’ is inserted into x after ‘h’

– Deletions e.g. a character in x is deleted

• ED is one way of measuring similarity between two strings

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

8

Classification using ED

• Nearest-neighbor algorithm can be applied for pattern recognition.– Training: data of strings with their class labels stored

– Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string.

• The key is how to calculate ED.• An example of calculating ED

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

9

Hidden Markov Model

• Markov Model: transitional states• Hidden Markov Model: additional visible states• Evaluation• Decoding• Learning

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

10

Markov Model

• The Markov property: – given the current state, the transition

probability is independent of any previous states.

• A simple Markov Model – State ω(t) at time t– Sequence of length T:

• ωT = {ω(1), ω(2), …, ω(T)}

– Transition probability • P(ω j(t+1)| ω i(t)) = aij

– It’s not required that aij = aji

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

11

Hidden Markov Model

• Visible states – VT = {v(1), v(2), …, v(T)}

• Emitting a visible state vk(t)– P(v k(t)| ω j(t)) = bjk

• Only visible states vk (t) are accessible and states ωi (t) are unobservable.

• A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

12

Three Key Issues with HMM

• Evaluation– Given an HMM, complete with transition probabilities aij

and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model

• Decoding– Given an HMM and a set of observations VT. Determine

the most likely sequence of hidden states ωT that led to VT.

• Learning– Given the number of states and visible states and a set of

training observations of visible symbols, determine the probabilities aij and bjk.

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

13

Other Sequential Patterns Mining Problems

• Sequence alignment (homology) and sequence assembly (genome sequencing)

• Trend analysis– Trend movement vs. cyclic variations, seasonal variations

and random fluctuations

• Sequential pattern mining– Various kinds of sequences (weblogs)– Various methods: From GSP to PrefixSpan

• Periodicity analysis– Full periodicity, partial periodicity, cyclic association rules

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

14

Periodic Pattern

• Full periodic pattern– ABC ABC ABC

• Partial periodic pattern– ABC ADC ACC ABC

• Pattern hierarchy– ABC ABC ABC DE DE DE DE ABC ABC ABC DE

DE DE DE ABC ABC ABC DE DE DE DE

Sequences of transactions

[ABC:3|DE:4]

Guozhu Dong
Full periodic patterns: too restrictive for data mining.Pattern hierarchy: Overall pattern is made from two more detailed patterns, each with a duration

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

15

Sequence Association Rule Mining

• SPADE (Sequential Pattern Discovery using Equivalence classes)

• Constrained sequence mining (SPIRIT)

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

16

Bibliography

• R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience.

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

17

a33

1

3

2

a31

a22

a11

a12

a32

a23

a13

a21

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

18

a33

1

3

2

a31

a22a11

a12

a32

a23

a13

a21

b31

v1

v2

v3

v4

v4

v1

v2

v3

v2

v3

v4

v1

b32

b34

b33

b21b22

b23

b24

b11

b12

b13 b14

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

19

vk

1

3 3 3 3

2 2 2 2

1 1 1 1

c c c c

2

3

c

…………

…………

…………

…………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a12

a22

a32

ac2

b2k

1(2)

2(2)

3(2)

c(2)

1 2 3 T-1 Tt =

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

20

0

0.01 0.0077

0.0002

0

0.09 0.0052

0.0024

0

0 0 0 0.0011

0.2 0.0057

0.0007

0

1

0

0

0 1 2 3 4t =

3

2

1

0

v3 v1 v3 v2 v0

0.2

x 2

0.3 x 0.3

0.1 x 0.1

0.4 x 0.5

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

21

1 2 3 4 5 6 7 0

/v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

22

0

3 3 3 3

2 2 2 2

0 0 0 0

c c c c

2

3

c

…………

…………

…………

…………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 2 3 T-1 Tt =

1 1 1 1 1…………

0

2

3

c

.

.

.

1

4

max(1)

max(2)

max(3) max(T-1)

max(T)

7/03 Data Mining – SequencesH. Liu (ASU) & G Dong (WSU)

23

0

0.01 0.0077

0.0002

0

0.09 0.0052

0.0024

0

0 0 0 0.0011

0.2 0.0057

0.0007

0

1

0

0

0 1 2 3 4t =

3

2

1

0

v3 v1 v3 v2 v0

top related