Top Banner
It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/1 1
16

It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Mar 29, 2015

Download

Documents

Aidan Broyhill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

It's about time.Choosing Distance Measures for Mining Time Series Data

Spencer Schnier 2/22/11

Page 2: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Indexing and clustering make explicit use of a distance measure The others make implicit use of a distance measure

Major Time Series Data Mining Tasks

• Indexing• Clustering• Classification• Prediction• Summarization• Anomaly Detection• Segmentation

(Ratanamahatana et al., 2010)

Page 3: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Popular Distance Measures

• Lock-step Measure (one-to-one)o Minkowski Distance

L1 norm (Manhattan Distance) L2 norm (Euclidean Distance) L∞ norm (Supremum Distance)

• Elastic Measure (one-to-many/one-to-none)o Dynamic Time Warping (DTW)o Edit distance based measure

Longest Common SubSequence (LCSS) Edit Distance on Real Sequence (EDR)

• Threshold-based Measureo Threshold query based similarity search

(TQuEST)• Pattern-based Measure

o Spatial Assembling Distance (SpADe)(Ding et al., 2008)

Page 4: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Minkowski Distance

h = 1: Manhattan (city block, L1 norm) distance

E.g., the Hamming distance: the number of bits that are different between two binary vectors

h = 2: (L2 norm) Euclidean distance

h . “supremum” (Lmax norm, L norm) distance.

This is the maximum difference between any component (attribute) of the vectors

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

||...||||),(2211 pp jxixjxixjxixjid

Page 5: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

5

Dissimilarity Matricespoint attribute 1 attribute 2

x1 1 2x2 3 5x3 2 0x4 4 5

L x1 x2 x3 x4x1 0x2 5 0x3 3 6 0x4 6 1 7 0

L2 x1 x2 x3 x4x1 0x2 3.61 0x3 2.24 5.1 0x4 4.24 1 5.39 0

L x1 x2 x3 x4

x1 0x2 3 0x3 2 5 0x4 3 1 5 0

Manhattan (L1)

Euclidean (L2)

Supremum

0 2 4

2

4

x1

x2

x3

x4

Minkowski Distance Examples

Page 6: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Similar sequences but they are shifted and have different scales

What’s wrong with Euclidean Distance?

What if a sequence is stretched or compressed along the time axis?

(Goldin and Kanellakis, 1995)

𝑥𝑖′=𝑥 𝑖−μ

σNormalize the time series before measuring the distance between them.

Page 7: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Dynamic Time Warping

• Sequences are similar but accelerate differently along the time axis

• Enforcing a temporal constraint δ on the warping window size improves computation efficiency and accuracy

• Application: Speech recognition

(Berndt and Clifford, 1996)

Page 8: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

1

Longest Common Subsequence Similarity

𝐿𝐶𝑆𝑆 (𝐶 ,𝑄 )=𝑚+𝑛−2∙ 𝑙𝑚+𝑛

c

Dissimilarity:

Tolerance:

2 5 4 5 3 1 81234517

0 0 0 0 0 1 1

1 1 1 1 1 1 1

1 1 1 1 2 2 2

1 1 2 2 2 2 2

2 2 3 3 3 3

1

1

2 2 3 3 4 4

2 2 3 3 4 4

2 4 5 1

• Match 2 sequences by allowing some elements to be unmatched

• C = {1,2,3,4,5,1,7} and Q = {2,5,4,5,3,1,8}

Longest is {2,4,5,1}

• Application: Bioinformatics

Vlachos et al., 2002

Page 9: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

1

Longest Common Subsequence Similarity

for i := 1..m for j := 1..n if C[i] = Q[j] L[i,j] := L[i-1,j-1] + 1 else: L[i,j] := max(L[i,j-1], L[i-1,j]) return L[m,n]

• Input sequences C[1..m] and Q[1..n] • Compute LCS btwn C[1..i] and Q[1..j]

for all 1 ≤ i ≤ m and 1 ≤ j ≤ n • Stores it in L[i,j] • L[m,n] = length of the LCS

2 5 4 5 3 1 81234517

0 0 0 0 0 1 1

1 1 1 1 1 1 1

1 1 1 1 2 2 2

1 1 2 2 2 2 2

2 2 3 3 3 3

1

1

2 2 3 3 4 4

2 2 3 3 4 4

2 4 5 1

Vlachos et al., 2002

Page 10: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Edit Distance on Real Sequence

• Similar to LCSS

• Uses a threshold parameter ε to quantify the distance between a pair of points to 0 or 1

• Seeks the minimum number of edit operations to change one sequence into another

• Assigns penalties to the unmatched segments according to the lengths of the gaps

• Application: Trajectories of moving objects

(Chen et al., 2005)

Page 11: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

TQuEST

SpADe

(Assfalg et al., 2006)

(Chen et al., 2007)

• Uses a threshold parameter τ to transform a time series into a sequence of threshold-crossing intervals (the points within each interval have a value greater than a given τ)

• Each interval is treated as a 2D point: x = starting time, y = ending time

• The similarity between two time series is then defined as the Minkowski sum of the two sequences of time interval points

• A pattern-based similarity measure for time series

• Finds matching segments called patterns by allowing shifting and scaling

• Then finds the most similar set of matching patterns

• Disadvantage: requires many parameters (temporal and amplitude scale factor, pattern length, sliding step size, etc.)

Page 12: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Comparison of Distance Measures

(Din

g e

t al., 2

00

8)

Page 13: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Comparison of Distance Measures

1. The accuracy of elastic measures converge with Euclidean distance as the training set increases. On small data sets, elastic measures can be significantly more accurate than lock-step measures.

2. Constraining the warping window size for elastic measures can reduce the computation cost and increase accuracy.

3. The accuracy of edit distance based similarity measures is very close to that of DTW. Only EDR is potentially slightly better than DTW.

4. The accuracy of several new similarity measures, such as TQuEST and SpADe, is in general inferior to elastic measures.

5. To improve accuracy of a similarity measure, get more training data.

6. If you can’t get more data, trying the other measures might help; however, be careful to avoid overfitting.(Ding et al., 2008)

Page 14: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

ELKI 0.2

(Achtert et al., 2009)

• Software for visualization and performance evaluation of distance measures for time series

www.dbs.ifi.lmu.de/research/KDD/ELKI/

Page 15: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Research Questions

Is distance measure performance related to some intrinsic properties of the data set?

If so, can those properties be used to identify the most appropriate distance measure?

Page 16: It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

References

• Achtert, E., T. Bernecker, H.-P. Kriegel, E. Schubert, and A. Zimek. 2009. “ELKI in Time: ELKI 0.2 for the Performance Evaluation of Distance Measures for Time Series.” SSTD 2009.

• Aßfalg, J., H.-P. Kriegel, P. Kr¨oger, P. Kunath, A. Pryakhin, and M. Renz. 2006. “Similarity search on time series based on threshold queries.” EDBT, 2006.

• Berndt, D., and J. Clifford. 1996. “Finding Patterns in Time Series: A Dynamic Programming Approach.” Advances in Knowledge Discovery and Data Mining AAAI/MIT Press, Menlo Park, CA. pg. 229-248.

• Chen, L., M. Ozsu, and V. Oria. 2005. “Robust and fast similarity search for moving object trajectories. SIGMOD ‘05.

• Chen, Y., M. Nascimento, B. Ooi, and A. Tung. 2007. “SpADe: On Shape-based Pattern Detection in Streaming Time Series. ICDE, 2007.

• Ding, H., G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh. 2008. “Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures.” VLDB ‘08.

• Goldin, D., and P. Kanellakis. 1995. “On Similarity Queries for Time-Series Data: Constraint Specification and Implementation.” Proceedings of the 1st International Conference on the Principles and Practice of Constraint Programming. pp. 137-153.

• Ratanamahatana, C., J. Lin, D. Gunopulos, E. Keogh, M. Vlachos, G. Das. 2010. “Mining Time Series Data.” Data Mining and Knowledge Discovery Handbook. Part 6, pg. 1049-1077.

• Vlachos, M., D. Gunopulos, and G. Kollios. 2002. “Discovering similar multidimensional trajectories.” ICDE, 2002.