CPT+: A Compact Model for Accurate Sequence Prediction

Ted Gueniche1, Philippe Fournier-Viger1,

Rajeev Raman2, Vincent S. Tseng3

1University of Moncton, Canada

2University of Leicester, UK

3National Chiao Tung University, Taiwan

CPT+: A Compact Model for Accurate

Sequence Prediction

2015-05-22 – PAKDD 2015, Ho Chi Minth City, Vietnam

The problem of Sequence Prediction

B A C ?

Problem: ◦ Given a set of training sequences, predict the next symbol of a

sequence.

Applications: ◦ webpage prefetching,

◦ analyzing the behavior of customers on websites,

◦ keyboard typing prediction

◦ product recommendation,

◦ stock market prediction,

◦ …

2

General approach for this problem

3

Building a

sequence

prediction model

Training

sequences Prediction

Model

Prediction

algorithm

A sequence

e.g. A,B,C

Prediction

e.g. D

Prediction

Model

Phase 1) Training

Phase 2) Prediction

Sequential pattern mining

Discovery of patterns

Using the patterns for prediction

It is time-consuming to extract patterns.

patterns ignore rare cases,

updating the patterns: very costly!

4

4

PrefixSpan

Minsup = 33 %

sequences Support Pattern

Dependency Graph (DG)

S1: {A,B,C,A,C,B,D}

S2: {C,C,A,B,C,B,C,A}

5

D C

A B

1

1 4 3

3

2

3 3

DG with lookup table of size 2

Dependency Graph (DG)

S1: {A,B,C,A,C,B,D}

S2: {C,C,A,B,C,B,C,A}

6

D C

A B

1

1 4 3

3

2

3 3

P(B|A) = 3 / SUP(A) = 3 / 4 P(C|A) = 3 / SUP(A) = 3 / 4 …

DG with lookup table of size 2

PPM – order 1

(prediction by partial matching)

B

S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A}

A C

B C A B C D C

2 1 3 1 3 1 1

4 4 6

PPM – order 1

(prediction by partial matching)

B


A C

B C A B C D C

2 1 3 1 3 1 1

P(B|A) = 2 / 4 P(C|A) = 1 / 4 …

4 4 6

PPM – order 2

AB AC

C

2

B

1

BC

B A

2 1

2 1 3

…


predictions are inaccurate if there is noise…

All-K-Order Markov

Uses PPM from level 1 to K for prediction.

More accurate than a fixed-order PPM,

But exponential size

B A C

B C A B C D C

2 1 3 1 3 1 1

4 4 6

AB AC

C

2

B

1

BC

B A

2 1

2 1 3

…

P(C|AB) = 2 / 2 P(B|AC) = 1 / 1 P(A|BC) = 2 / 3 …

Example: order 2

Limitations

Several models assume that each event depends

only on the immediately preceding event.

Otherwise, often an exponential complexity

(e.g.: All-K-Markov)

Some improvements to reduce the size of

markovian models, but few work to improve their

accuracy.

Several models are not noise tolerant.

Some models are costly to update

(e.g. sequential patterns).

All the aforementioned models are lossy models.

11

CPT: COMPACT PREDICTION TREE

12

Gueniche, T., Fournier-Viger, P., Tseng, V.-S. (2013). Compact Prediction Tree:

A Lossless Model for Accurate Sequence Prediction. Proc. 9th International

Conference on Advanced Data Mining and Applications (ADMA 2013) Part II,

Springer LNAI 8347, pp. 177-188.

Goal

◦ to provide more accurate predictions,

◦ a model having a reasonable size,

◦ a model that is noise tolerant.

13

Hypothesis

Idea:

◦ build a lossless model (or a model where

the loss of information can be controlled),

◦ use all relevant information to perform

each sequence prediction.

Hypothesis:

◦ this would increase prediction accuracy.

14

Challenges

1) Define an efficient structure in terms of

space to store sequences,

2) The structure must be incrementally

updatable to add new sequences

3) Propose a prediction algorithm that :

◦ offers accurate predictions,

◦ if possible, is also time-efficient.

15

Our proposal

Compact Prediction Tree (CPT)

A tree-structure to store training

sequences,

An indexing mechanism,

Each sequence is inserted one after

the other in the CPT.

Illustration

16

Example

We will consider the four following

training sequences:

1. ABC

2. AB

3. ABDC

4. BC

5. BDE

17

18

Example (construction)

Lookup table

Prediction tree Inverted Index

root

19

Example: Inserting <A,B,C>

root

Lookup table

Inverted Index Prediction tree

20

Example: Inserting <A,B,C>

root Inverted Index

A

B

C

s1

s1

A 1

B 1

C 1

Prediction tree

Lookup table

21

Example: Inserting <A,B>

root Inverted Index

A

B

C

s1

s1 s2

A 1 1

B 1 1

C 1 0

s2

Prediction tree

Lookup table

22

Example: Inserting <A,B,D,C>

root Inverted Index

A

B

C

s1

s1 s2 s3

A 1 1 1

B 1 1 1

C 1 0 1

D 0 0 1

s2 s3

D

C

Prediction tree

Lookup table

23

Example: Inserting <B,C>

root Inverted Index

A

B

C

s1

s1 s2 s3 s4

A 1 1 1 0

B 1 1 1 1

C 1 0 1 1

D 0 0 1 0

s2 s3

B

C

s4

D

C

Prediction tree

Lookup table

24

Example: Inserting <B,D,E>

root Inverted Index

A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

D

C

Prediction tree

Lookup table

25

Example: Inserting <B,D,E>

root Inverted Index

A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

D

C

Prediction tree

Lookup table

Insertion

linear complexity, O(m) where m is the

sequence length.

a reversible operation (sequences can

be recovered from the CPT).

the insertion order of sequences is

preserved in the CPT.

26

Space complexity

Size of the prediction tree

◦ worst case:

O(N * average sequence

length) where N is the number

of sequences.

◦ In general, much smaller,

because sequences overlap.

27

root

A

B

C

B

C D

E D

C

Space complexity (cont’d)

Size of Inverted Index

(n x b) n = sequence count

b = symbol count

small because encoded as bit vectors

28

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

Space complexity (cont’d)

Size of lookup table

n pointers where n is the

sequence count

29

Lookup table

s1 s2 s3 s4 s5

A

B

C

B

C D

E D

C

root

PREDICTION

30

31

Predicting the symbol following <A,B>

root

Lookup table


A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

D

C

32


root

Lookup table


A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

The logical AND indicates

that the sequences common

to A and B are: s1, s2 et s3

D

C

33


root

Lookup table


A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

The Lookup table allows to

traverse the corresponding

sequences in from the end to

the start.

D

C

34


root

Lookup table


A

B

C

s1

s1 s2 s3 s4 s5

A 1 1 1 0 0

B 1 1 1 1 1

C 1 0 1 1 0

D 0 0 1 0 1

E 0 0 0 0 1

s2 s3

B

C D

E

s4 s5

Count table:

C: 2 occurrences after {AB}

D: 1 occurrences after {AB}

D

C

Complexity of prediction

1. Intersection of bit vectors: O(v) where v is the number of symbols.

2. Traversing sequences: O(n) where n is the sequence count

3. Creating the count table: O(x) where x is the number of symbols in sequences after the target sequence.

4. Choosing the predicted symbol: O(y) where y is the number of distinct symbols in the Count Table.

35

EXPERIMENTAL EVALUATION

36

Experimental evaluation

Datasets

BMS, FIFA, Kosarak: sequences of clicks

on webpages.

SIGN: sentences in sign languages.

BIBLE: sequences of characters in a book.

Experimental evaluation (cont’d)

Competitor algorithms

DG (lookup window = 4)

All-K-Order Markov (order of 5)

PPM (order of 1)

10-fold cross-validation

Experimental evaluation (cont’d)

Measures:

Accuracy

= |success count| / |sequence count|

Coverage

= |prediction count| / |sequence count|

Experiment 1 – Accuracy

CPT is the most accurate except for one

dataset.

PPM and DG perform well in some situations.

Experiment 1 – size

CPT is

◦ smaller than All-K-order-Markov

◦ larger than DG and PPM

Experiment 1 – time (cont’d)

CPT’s training time is at least 3 times less than DG and AKOM, and similar to PPM.

CPT’s prediction time is quite high (a trade-off for more accuracy)

Experiment 2 – scalability

CPT shows a trend similar to other algorithms

Experiment 3 – prefix size

prefix size: the number of symbols to be

used for making a prediction

for FIFA:

The accuracy of CPT increases until a prefix size of around 8.

(depends on the dataset)

Optimisation #1 - RecursiveDivider

Example: {A,B,C,D} Level 1 Level 2 Level 3

{B,C,D} {C,D} {D}

{A,C,D} {B,D} {C}

{A,B,D} {B,C} {B}

{A,B,C} {A,D} {A}

{A,C}

{A,B}

Accuracy and coverage are

increasing.

Training time and prediction time

remains more or less the same.

Therefore, a high value for this

parameter is better for all datasets.

Optimisation #2 – sequence splitting

Example: splitting sequence {A,B,C,D,E,F,G} with split_length = 5 gives {C,D,E,F,G}

Conclusion

CPT, a new model for sequence prediction

◦ allows fast incremental updates,

◦ compresses training sequences,

◦ integrates an indexing mechanism

◦ two optimizations,

Results:

◦ in general, more accurate than compared models but prediction time is greater (a trade-off),

◦ CPT is more than twice smaller than AKOM

◦ sequence insertion more than 3 times faster than DG and AKOM

47

CPT+: DECREASING THE TIME/SPACE COMPLEXITY OF CPT

48

Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+:

Decreasing the time/space complexity of the Compact Prediction

Tree. Proc. 19th Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636

Introduction

Two optimisations to reduce the size

of the tree used by CPT:

◦ compressing frequent substrings,

◦ compressing simple branches.

An optimisation to improve prediction

time and noise tolerance.

49

(1) compressing frequent substrings

This strategy is applied during training

◦ it identifies frequent substrings in training

sequences,

◦ it replaces these substrings by new symbols

Discovering substrings is done with a

modified version of the PrefixSpan

algorithm

◦ parameters: minsup, minLength and maxLength

50

51


Prediction tree

Lookup table

Inverted Index

52


Prediction tree

Lookup table

Inverted Index

(1) Compressing simple branches

Time complexity:

◦ training : non negligible cost to discover

frequent substrings,

◦ prediction: symbols are uncompressed on-the-

fly in O(1) time.

Space complexity:

◦ O(m) where m is the number of frequent

substrings.

53


A second optimization to reduce the size of

the tree

A simple branch is a branch where all

nodes have a single child.

Each simple branch is replaced by a single

node representing the whole branch.

54

55


Prediction tree

Lookup table

Inverted Index

56


Prediction tree

Lookup table

Inverted Index

57


Prediction tree

Lookup table

Inverted Index


Time complexity

◦ very fast.

◦ after building the tree, we only need to

traverse the branches from the bottom

using the lookup table.

58

(3) Improved Noise Reduction

Recall that CPT removes items from a sequence to be predicted to be more noise tolerant.

Improvement: ◦ only remove less frequent symbols from sequences,

assuming that they are more likely to be noise,

◦ consider a minimum number of sequences to perform a prediction,

◦ add a new parameter Noise Ratio (e.g. 20%) to determine how many symbols should be removed from sequences (e.g.: the 20% most infrequent symbols).

◦ Thus, the amount of noise is assumed to be proportional to the length of sequences.

59

Experiment

60

Competitor algorithms

DG, TDAG, PPM, LZ78, All-K-Markov

Datasets

Prediction accuracy

61

CPT+ is also up to 4.5 times faster than CPT

in terms of prediction time

dataset

Scalability

62

Sequence count

Siz

e (

no

de

s) PPM

Conclusion

CPT(+): a novel sequence prediction model

Fast training time

Good scalability,

High prediction accuracy.

Future work:

further compress the model,

compare with other predictions models such as CTW and NN,

data stream, user profiles…

open-source library for web prefetching IPredict https://github.com/tedgueniche/IPredict/tree/master/src/ca/ipredict

63

https://github.com/tedgueniche/IPredict/tree/master/src/ca/ipredict



CPT+: A Compact Model for Accurate Sequence Prediction

Documents