Top Banner
CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU [email protected]
112
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

CMU SCS

Sensor data mining and forecasting

Christos Faloutsos

[email protected]

Page 2: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 2

CMU SCS

Outline

• Problem definition - motivation

• Linear forecasting - AR and AWSOM

• Coevolving series - MUSCLES

• Fractal forecasting - F4

• Other projects– graph modeling, outliers etc

Page 3: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 3

CMU SCS

Problem definition

• Given: one or more sequences x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

• Find – forecasts; patterns– clusters; outliers

Page 4: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 4

CMU SCS

Motivation - Applications• Financial, sales, economic series

• Medical

– ECGs +; blood pressure etc monitoring

– reactions to new drugs

– elderly care

Page 5: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 5

CMU SCS

Motivation - Applications (cont’d)

• ‘Smart house’

– sensors monitor temperature, humidity, air quality

• video surveillance

Page 6: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 6

CMU SCS

Motivation - Applications (cont’d)

• civil/automobile infrastructure

– bridge vibrations [Oppenheim+02]

– road conditions / traffic monitoring

Page 7: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 7

CMU SCS

Stream Data: automobile traffic

Automobile traffic

0200400600800

100012001400160018002000

time

# cars

Page 8: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 8

CMU SCS

Motivation - Applications (cont’d)

• Weather, environment/anti-pollution

– volcano monitoring

– air/water pollutant monitoring

Page 9: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 9

CMU SCS

Stream Data: Sunspots

Sunspot Data

0

50

100

150

200

250

300

time

#sunspots per month

Page 10: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 10

CMU SCS

Motivation - Applications (cont’d)

• Computer systems

– ‘Active Disks’ (buffering, prefetching)

– web servers (ditto)

– network traffic monitoring

– ...

Page 11: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 11

CMU SCS

Stream Data: Disk accesses

time

#bytes

Page 12: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 12

CMU SCS

Settings & Applications

• One or more sensors, collecting time-series data

Page 13: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 13

CMU SCS

Settings & Applications

Each sensor collects data (x1, x2, …, xt, …)

Page 14: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 14

CMU SCS

Settings & Applications

Sensors ‘report’ to a central site

Page 15: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 15

CMU SCS

Settings & Applications

Problem #1:Finding patternsin a single time sequence

Page 16: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 16

CMU SCS

Settings & Applications

Problem #2:Finding patternsin many time sequences

Page 17: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 17

CMU SCS

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

year

count lynx caught per year(packets per day;temperature per day)

Page 18: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 18

CMU SCS

Problem#1’: ForecastGiven xt, xt-1, …, forecast xt+1

0102030405060708090

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 19: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 19

CMU SCS

Problem #2:• Given: A set of correlated time sequences

• Forecast ‘Sent(t)’

0102030405060708090

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sent

lost

repeated

Page 20: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 20

CMU SCS

Differences from DSP/Stat

• Semi-infinite streams – we need on-line, ‘any-time’ algorithms

• Can not afford human intervention– need automatic methods

• sensors have limited memory / processing / transmitting power– need for (lossy) compression

Page 21: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 21

CMU SCS

Important observations

Patterns, rules, compression and forecasting are closely related:

• To do forecasting, we need– to find patterns/rules

• good rules help us compress• to find outliers, we need to have forecasts

– (outlier = too far away from our forecast)

Page 22: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 22

CMU SCS

Pictorial outline of the talk

Linear Non-linear

1 time seq. AR,AWSOM

F4

Many t.s. MUSCLES

Page 23: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 23

CMU SCS

Outline• Problem definition - motivation• Linear forecasting

– AR – AWSOM

• Coevolving series - MUSCLES• Fractal forecasting - F4• Other projects

– graph modeling, outliers etc

Linear Non-linear

1 time seq. AR,AWSOM

F4

Many t.s. MUSCLES

Page 24: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 24

CMU SCS

Mini intro to A.R.

Page 25: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 25

CMU SCS

Forecasting

"Prediction is very difficult, especially about the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

Page 26: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 26

CMU SCS

Problem#1’: Forecast• Example: give xt-1, xt-2, …, forecast xt

0102030405060708090

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 27: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 28

CMU SCS

Linear Regression: idea

40

45

50

55

60

65

70

75

80

85

15 25 35 45

Body weight

patient weight height

1 27 43

2 43 54

3 54 72

……

N 25 ??

• express what we don’t know (= ‘dependent variable’)• as a linear function of what we know (= ‘indep. variable(s)’)

Body height

Page 28: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 29

CMU SCS

Linear Auto Regression:Time Packets

Sent (t-1)PacketsSent(t)

1 - 43

2 43 54

3 54 72

……

N 25 ??

Page 29: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 30

CMU SCS

Problem#1’: Forecast• Solution: try to express

xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

0102030405060708090

1 3 5 7 9 11Time Tick

??noisexaxax wtwtt 11

Page 30: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 31

CMU SCS

Linear Auto Regression:

40

45

50

55

60

65

70

75

80

85

15 25 35 45

Number of packets sent (t-1)N

um

ber

of

pac

ket

s se

nt

(t)

Time PacketsSent (t-1)

PacketsSent(t)

1 - 43

2 43 54

3 54 72

……

N 25 ??

• lag w=1• Dependent variable = # of packets sent (S [t])• Independent variable = # of packets sent (S[t-1])

‘lag-plot’

Page 31: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 32

CMU SCS

More details:

• Q1: Can it work with window w>1?

• A1: YES!

xt-2

xt

xt-1

Page 32: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 33

CMU SCS

More details:

• Q1: Can it work with window w>1?

• A1: YES! (we’ll fit a hyper-plane, then!)

xt-2

xt

xt-1

Page 33: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 34

CMU SCS

More details:

• Q1: Can it work with window w>1?

• A1: YES! (we’ll fit a hyper-plane, then!)

xt-2

xt-1

xt

Page 34: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 35

CMU SCS

Even more details

• Q2: Can we estimate a incrementally?

• A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details)

• Q3: can we ‘down-weight’ older samples?

• A3: yes (RLS does that easily!)

Page 35: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 36

CMU SCS

Mini intro to A.R.

Page 36: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 37

CMU SCS

How to choose ‘w’?

• goal: capture arbitrary periodicities

• with NO human intervention

• on a semi-infinite stream

noisexaxax wtwtt 11

Page 37: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 38

CMU SCS

Outline• Problem definition - motivation• Linear forecasting

– AR – AWSOM

• Coevolving series - MUSCLES• Fractal forecasting - F4• Other projects

– graph modeling, outliers etc

Linear Non-linear

1 time seq. AR,AWSOM

F4

Many t.s. MUSCLES

Page 38: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 39

CMU SCS

Problem:

• in a train of spikes (128 ticks apart)

• any AR with window w < 128 will fail

What to do, then?

Impulse 128

-50

0

50

100

150

200

250

Page 39: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 40

CMU SCS

Answer (intuition)

• Do a Wavelet transform (~ short window DFT)

• look for patterns in every frequency

Page 40: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 41

CMU SCS

Intuition

• Why NOT use the short window Fourier transform (SWFT)?

• A: how short should be the window?

time

freq Impulse 128

-50

0

50

100

150

200

250

w’

Page 41: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 42

CMU SCS

wavelets

Impulse 128

-50

0

50

100

150

200

250

t

f

• main idea: variable-length window!

Page 42: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 43

CMU SCS

Advantages of Wavelets

• Better compression (better RMSE with same number of coefficients - used in JPEG-2000)

• fast to compute (usually: O(n)!)

• very good for ‘spikes’

• mammalian eye and ear: Gabor wavelets

Page 43: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 44

CMU SCS

Wavelets - intuition:

t

f

• Q: baritone/silence/ soprano - DWT?

time

value

Page 44: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 45

CMU SCS

Wavelets - intuition:

• Q: baritone/soprano - DWT?

t

f

time

value

Page 45: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 46

CMU SCS

AWSOMxt

tt

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency=

Page 46: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 47

CMU SCS

AWSOMxt

tt

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

Page 47: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 48

CMU SCS

AWSOM - idea

Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …

Wl’,t’-1Wl’,t’-2Wl’,t’

Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …

Page 48: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 52

CMU SCS

More details…

• Update of wavelet coefficients

• Update of linear models

• Feature selection– Not all correlations are significant– Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

Page 49: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 53

CMU SCS

Results - Synthetic data• Triangle pulse

• Mix (sine + square)

• AR captures wrong trend (or none)

• Seasonal AR estimation fails

AWSOM AR Seasonal AR

Page 50: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 54

CMU SCS

Results - Real data

• Automobile traffic– Daily periodicity– Bursty “noise” at smaller scales

• AR fails to capture any trend• Seasonal AR estimation fails

Page 51: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 55

CMU SCS

Results - real data

• Sunspot intensity– Slightly time-varying “period”

• AR captures wrong trend• Seasonal ARIMA

– wrong downward trend, despite help by human!

Page 52: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 56

CMU SCS

Complexity

• Model update

Space: OlgN + mk2 OlgNTime: Ok2 O1

• Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN

Page 53: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 57

CMU SCS

Conclusions

• AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

Page 54: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 58

CMU SCS

Outline• Problem definition - motivation• Linear forecasting

– AR – AWSOM

• Coevolving series - MUSCLES• Fractal forecasting - F4• Other projects

– graph modeling, outliers etc

Linear Non-linear

1 time seq. AR,AWSOM

F4

Many t.s. MUSCLES

Page 55: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 59

CMU SCS

Co-Evolving Time Sequences• Given: A set of correlated time sequences

• Forecast ‘Repeated(t)’

0102030405060708090

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sent

lost

repeated

??

Page 56: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 60

CMU SCS

Solution:

Q: what should we do?

Page 57: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 61

CMU SCS

Solution:

Least Squares, with

• Dep. Variable: Repeated(t)

• Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1), ...

• (named: ‘MUSCLES’ [Yi+00])

Page 58: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 62

CMU SCS

Examples - Experiments• Datasets

– Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)

– AT&T WorldNet internet usage (several data streams; 980 time-ticks)

• Measures of success– Accuracy : Root Mean Square Error (RMSE)

Page 59: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 63

CMU SCS

Accuracy - “Modem”

MUSCLES outperforms AR & “yesterday”

0

0.5

1

1.5

2

2.5

3

3.5

4

RMSE

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Modems

AR

yesterday

MUSCLES

Page 60: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 64

CMU SCS

Accuracy - “Internet”

0

0.2

0.4

0.6

0.8

1

1.2

1.4

RMSE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Streams

AR

yesterday

MUSCLES

MUSCLES consistently outperforms AR & “yesterday”

Page 61: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 65

CMU SCS

Outline• Problem definition - motivation• Linear forecasting

– AR – AWSOM

• Coevolving series - MUSCLES• Fractal forecasting - F4• Other projects

– graph modeling, outliers etc

Linear Non-linear

1 time seq. AR,AWSOM

F4

Many t.s. MUSCLES

Page 62: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 66

CMU SCS

Detailed Outline

• Non-linear forecasting– Problem– Idea– How-to– Experiments– Conclusions

Page 63: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 67

CMU SCS

Recall: Problem #1

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...

Time

Value

Page 64: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 68

CMU SCS

How to forecast?

• ARIMA - but: linearity assumption

• ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

Page 65: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 69

CMU SCS

General Intuition (Lag Plot)

xt-1

xxtt

4-NNNew Point

Interpolate these…

To get the final prediction

Lag = 1,k = 4 NN

Page 66: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 70

CMU SCS

Questions:

• Q1: How to choose lag L?• Q2: How to choose k (the # of NN)?• Q3: How to interpolate?• Q4: why should this work at all?

Page 67: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 71

CMU SCS

Q1: Choosing lag L

• Manually (16, in award winning system by [Sauer94])

• Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

Page 68: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 72

CMU SCS

Fractal Dimensions

• FD = intrinsic dimensionality

Embedding dimensionality = 3

Intrinsic dimensionality = 1

Page 69: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 73

CMU SCS

Fractal Dimensions

• FD = intrinsic dimensionality

log(r)

log( # pairs)

Page 70: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 74

CMU SCS

Intuition

• Its lag plot for lag = 1

X(t-1)

X(t) The Logistic Parabola xt = axt-1(1-xt-1) + noise

time

x(t

)

Page 71: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 75

CMU SCS

Intuition

x(t-1)

x(t)

x(t-2)

x(t)

x(t)

x(t-2)

x(t-2) x(t-1)

x(t-1)

x(t-1)

x(t)

Page 72: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 76

CMU SCS

Intuition

• The FD vs L plot does flatten out

• L(opt) = 1

Lag

Fractal dimension

Page 73: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 77

CMU SCS

Proposed Method

• Use Fractal Dimensions to find the optimal lag length L(opt)

Lag (L)

Fra

ctal

Dim

ensi

on

Choose this

epsilon

Page 74: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 78

CMU SCS

Q2: Choosing number of neighbors k

• Manually (typically ~ 1-10)

Page 75: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 79

CMU SCS

Q3: How to interpolate?

How do we interpolate between the k nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop with distance - how?)

Page 76: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 80

CMU SCS

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Xt-1

xt

Page 77: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 81

CMU SCS

Q4: Any theory behind it?

A4: YES!

Page 78: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 82

CMU SCS

Theoretical foundation

• Based on the “Takens’ Theorem” [Takens81]

• which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

Page 79: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 83

CMU SCS

Theoretical foundation

Example: Lotka-Volterra equations

dH/dt = r H – a H*P dP/dt = b H*P – m P

H is count of prey (e.g., hare)P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

H

P

Skip

Page 80: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 84

CMU SCS

Theoretical foundation

• But the delay vector space is a faithful reconstruction of the internal system state

• So prediction in delay vector space is as good as prediction in state space

Skip

H

P

P(t-1)

P(t)

Page 81: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 85

CMU SCS

Detailed Outline

• Non-linear forecasting– Problem– Idea– How-to– Experiments– Conclusions

Page 82: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 86

CMU SCS

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t

)

Lag-plot

Page 83: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 87

CMU SCS

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t

)

Lag-plot

ARIMA: fails

Page 84: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 88

CMU SCS

Logistic Parabola

Timesteps

Value

Our Prediction from here

Page 85: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 89

CMU SCS

Logistic Parabola

Timesteps

Value

Comparison of prediction to correct values

Page 86: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 90

CMU SCS

Datasets

LORENZ: Models convection currents in the airdx / dt = a (y - x) dy / dt = x (b - z) - y dz / dt = xy - c z

Value

Page 87: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 91

CMU SCS

LORENZ

Timesteps

Value

Comparison of prediction to correct values

Page 88: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 92

CMU SCS

Datasets

Time

Value

• LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Page 89: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 93

CMU SCS

Laser

Timesteps

Value

Comparison of prediction to correct values

Page 90: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 94

CMU SCS

Conclusions

• Lag plots for non-linear forecasting (Takens’ theorem)

• suitable for ‘chaotic’ signals

Page 91: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 95

CMU SCS

Additional projects at CMU

• Graph/Network mining

• spatio-temporal mining - outliers

Page 92: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 96

CMU SCS

Graph/network mining

• Internet; web; gnutella P2P networks

• Q: Any pattern?• Q: how to generate

‘realistic’ topologies?• Q: how to define/verify

realism?

Page 93: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 97

CMU SCS

Patterns?

• avg degree is, say 3.3• pick a node at random

- what is the degree you expect it to have?

degree

count

avg: 3.3

Page 94: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 98

CMU SCS

Patterns?

• avg degree is, say 3.3• pick a node at random

- what is the degree you expect it to have?

• A: 1!!

degree

count

avg: 3.3

Page 95: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 99

CMU SCS

Patterns?

• avg degree is, say 3.3• pick a node at random

- what is the degree you expect it to have?

• A: 1!!

degree

count

avg: 3.3

Page 96: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 100

CMU SCS

Patterns?

• A: Power laws!

log {(out) degree}

log(count)

Page 97: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 101

CMU SCS

Other ‘laws’?

Count vs Outdegree Count vs Indegree Hop-plot

Eigenvalue vs Rank “Network value” Stress

Effective Diameter

Page 98: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 102

CMU SCS

RMAT, to generate realistic graphs

Count vs Outdegree Count vs Indegree Hop-plot

Eigenvalue vs Rank “Network value” Stress

Effective Diameter

Page 99: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 103

CMU SCS

Epidemic threshold?

• one a real graph, will a (computer / biological) virus die out? (given– beta: probability that an infected node will

infect its neighbor and– delta: probability that an infected node will

recover

NO MAYBE YES

Page 100: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 104

CMU SCS

Epidemic threshold?

• one a real graph, will a (computer / biological) virus die out? (given– beta: probability that an infected node will

infect its neighbor and– delta: probability that an infected node will

recover

• A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

Page 101: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 105

CMU SCS

Additional projects

• Graph mining

• spatio-temporal mining - outliers

Page 102: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 106

CMU SCS

Outliers - ‘LOCI’

Page 103: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 107

CMU SCS

Outliers - ‘LOCI’

• finds outliers quickly,

• with no human intervention

Page 104: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 108

CMU SCS

Conclusions

• AWSOM for automatic, linear forecasting

• MUSCLES for co-evolving sequences

• F4 for non-linear forecasting

• Graph/Network topology: power laws and generators; epidemic threshold

• LOCI for outlier detection

Page 105: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 109

CMU SCS

Conclusions• Overarching theme: automatic discovery of

patterns (outliers/rules) in– time sequences (sensors/streams)– graphs (computer/social networks)– multimedia (video, motion capture data etc)

www.cs.cmu.edu/[email protected]

Page 106: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 110

CMU SCS

Books

• William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT)

• C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

Page 107: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 111

CMU SCS

Books

• George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)

• Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

Page 108: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 112

CMU SCS

Resources: software and urls

• MUSCLES: Prof. Byoung-Kee Yi:http://www.postech.ac.kr/~bkyi/or [email protected]

• AWSOM & LOCI: [email protected]• F4, RMAT: [email protected]

Page 109: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 113

CMU SCS

Additional Reading

• [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.

• [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172

• [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

Page 110: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 114

CMU SCS

Additional Reading

• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003

• Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003.

• Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

Page 111: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 115

CMU SCS

Additional Reading

• Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.

• Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

Page 112: CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu.

Telcordia 2003 C. Faloutsos 116

CMU SCS

Additional Reading

• Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)

• [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)