In Search of Meaning for TimeSeries Subsequence Clustering
Dina Goldin, Brown University
work done with Ricardo Mardales, UConnand George Nagy, RPI
CIKM, Nov. 8, 2006
CIKM’06 2November 8, 2006
The “Meaningless” Paper
[KLT03] Keogh, E., Lin, J., Truppel, W. Clustering of Time Series is meaningless. Proc. IEEE Conf. on Data Mining (2003)
[KL05] Keogh, E. & Lin, J. Clustering of time-series subsequences is meaning-less: implications for previous and future research. J. Knowledge and Inf. Sys. 8:2 (2005)
Clustering of time series subsequences is meaningless [because] the result of clustering these subsequences
is independent of the input.
CIKM’06 3November 8, 2006
It “cast a shadow over STS clustering”. Jeopardized the legitimacy of research that had used
subsequence clustering. Led to a flurry of follow-up research
Chen ’05 uses cyclical data and k-medoids Simon et al. ’05 uses self-organizing maps Denton ’04 uses density based clustering Struzik ’03 uses correlation for trivial matches Bagnall ’03, Mahoney ’05, Rodrigues et al. ’04 moved away from STS
No one had challenged the results head-on i.e. show that output and input of STS clustering
are not independent
Implications of “Meaningless” Result
CIKM’06 4November 8, 2006
Time series: x y
STS clusteringalgorithm
Clusters: A C B
Independence of Input and Output
Is there a way to match C to the right time series (X or Y) reliably?
Before: NO; cluster_set_dist(C,B) / cluster_set_dist(C,A) not small
Our work: YES Find a different distance measure!
CIKM’06 5November 8, 2006
Outline
1. Introduction
2. New Distance Measure for Cluster Sets based on the notion of cluster shapes
3. STS Cluster Matching
4. Observations and Conclusions
CIKM’06 6November 8, 2006
STS Clustering
Consider all subsequences of the same time series time series T of length m, window Size w
Normalize each subsequence so its average is 0 and std. deviation is 1 Normalize(x) = x – avg(x) / stddev(x)
Cluster the normalized subsequences using K-means clustering algorithm
CIKM’06 7November 8, 2006
K-means Clustering
Given a set of multidimensional points (of dimension w), partition in into K groups, so each point belongs to one cluster.
Compute the center of each cluster; it is the mean of all points in the cluster.
Result: a set of K cluster centers
Cluster Centers
CIKM’06 8November 8, 2006
Cluster Set Distance
- Previous approach to measuring distance between cluster sets
- Returs sum of Euclidean Distances between cluster centers
A B cluster_set_dist(B,A)
CIKM’06 9November 8, 2006
Cluster Shape Distance- New distance measure for cluster sets
- Returns Euclidean Distance between cluster set shapes- Cluster set shape: sorted list of pairwise distances
between cluster centers; has K*(K-1)/2 values
Y
A B
X
Z
Shape of cluster A = [XZ, ZY, XY]
A and B have the same shape(B is a rotated and translated copy of A)
so cluster_shape_dist(A,B) = 0
CIKM’06 10November 8, 2006
Cluster Shape Example
STS clustering for ocean series with K=3
Note: all our datasets come from UC Riverside repository
w seeds (k=3) D 12 D 13 D 23 Sum Average
8 [226;549;82] 2.7771 4.4644 2.4942 9.7357 3.24528 [902;7;171] 4.4855 2.7416 2.9164 10.1435 3.38118 [525;751;820] 4.4928 2.5958 2.8388 9.9274 3.3091
16 [226;549;82] 5.1477 6.5741 3.1317 14.8535 4.951116 [902;7;171] 6.6168 3.6607 4.8998 15.1773 5.059116 [525;751;820] 6.5801 3.2478 5.0325 14.8604 4.953432 [226;549;82] 8.3883 9.2677 3.7964 21.4524 7.150832 [902;7;171] 6.9156 9.4574 5.9889 22.3619 7.453932 [525;751;820] 9.2988 4.9502 7.4518 21.7008 7.2336
D’s: pairwise distancesbetween cluster centers
CIKM’06 11November 8, 2006
Cluster Structure Sort the pairwise distances
Observation: for each K and w, the shapes obtained from different STS clustering runs are similar!
Cluster structure T: the average of cluster set shapes from many clustering runs over T.
w seeds (k=3) 1
2
3 Sum Average
8 [226;549;82] 2.4942 2.7771 4.4644 9.7357 3.24528 [902;7;171] 2.7416 2.9164 4.4855 10.1435 3.38118 [525;751;820] 2.5958 2.8388 4.4928 9.9274 3.3091
16 [226;549;82] 3.1317 5.1477 6.5741 14.8535 4.951116 [902;7;171] 3.6607 4.8998 6.6168 15.1773 5.059116 [525;751;820] 3.2478 5.0325 6.5801 14.8604 4.953432 [226;549;82] 3.7964 8.3883 9.2677 21.4524 7.150832 [902;7;171] 5.9889 6.9156 9.4574 22.3619 7.453932 [525;751;820] 4.9502 7.4518 9.2988 21.7008 7.2336
CIKM’06 12November 8, 2006
Cluster Structure: Example
Cluster structures for datasets from UCR repository data ∆1 ∆2 ∆3ocean 2.3598 3.0464 4.4583packet 2.1712 2.2315 2.3619
soil 1.9434 2.0073 2.0582sp 2.5302 2.9574 3.774tide 2.7175 3.3878 3.705
data ∆1 ∆2 ∆3ocean 3.3018 5.1779 6.5902packet 2.3881 2.4878 2.6392
soil 2.0046 2.3572 2.5495sp 3.624 4.121 5.5512
tide 3.7356 4.2792 4.6382
k=3 w=8 k=3 w=16
data ∆1 ∆2 ∆3 ∆4 ∆5 ∆6ocean 2.4787 2.5844 2.9337 2.9778 3.2012 4.7337packet 2.2235 2.2518 2.3438 2.5314 2.5647 2.6129
soil 2.0568 2.0874 2.1464 2.1803 2.2105 2.2533sp 2.1239 2.465 2.9448 3.17 3.4565 4.1346tide 2.4059 2.4585 3.3746 3.4247 4.0473 4.1776
k=4 w=8
CIKM’06 13November 8, 2006
Outline
1. Introduction
2. New Distance Measure for Cluster Sets
3. STS Cluster Matching
4. Observations and Conclusions
CIKM’06 14November 8, 2006
STS Cluster Matching Problem
Given a dataset of multiple time series and a cluster center set from one of them (“query”),
match it to the series that produced it.
Note: K and w are assumed to be fixed.
Matching algorithm:Outputs a guess -- which of the N time series in the dataset produced the query?
Algorithm accuracy:Percentage of times that the matching algorithm is correct.
Note: no previous work succeeded to attain high accuracy, even with dataset of size 2!
CIKM’06 15November 8, 2006
Matching Algorithm
Pre-processing phase:1. For each sequence in the dataset, perform Q
clustering runs with given K and w, and calculate its cluster structure.
2. Store all the structures in a master table.
Matching phase:1. Given a query, find the Euclidean distance from its
shape to each of the structures in the master table.2. Return the sequence whose structure is the closest.
CIKM’06 16November 8, 2006
Example
data ∆1 ∆2 ∆3ocean 2.3598 3.0464 4.4583packet 2.1712 2.2315 2.3619
soil 1.9434 2.0073 2.0582sp 2.5302 2.9574 3.774tide 2.7175 3.3878 3.705
1 Assignment
2.6517 2.9498 3.7824 sp2.5873 3.5066 3.6869 tide2.5958 2.8388 4.4928 ocean2.1594 2.246 2.3478 packet1.9323 2.0474 2.0711 soil2.196 2.264 2.3352 packet
2.4942 2.7771 4.4644 ocean2.5529 2.8036 3.7939 sp1.9481 2.0417 2.0672 soil2.8821 3.1982 3.7473 tide
Master table k=3 w=8
CIKM’06 17November 8, 2006
Performance Evaluation
10 datasets from UCR time series repository 100 clustering runs per structure
Algorithm evaluated with 3 values of K, 4 values of w (12 combinations)
Result: 100% accuracy
CIKM’06 18November 8, 2006
Outline
1. Introduction
2. New Distance Measure for Cluster Sets
3. STS Cluster Matching Algorithm
4. Observations and Conclusions
CIKM’06 19November 8, 2006
Conclusions Previous work seemed to show that the output
of STS clustering is independent of input. The correct conclusion: cluster set distance is
an inappropriate distance metric. Instead of absolute positions of cluster
centers, one needs to use relative positions (as represented by cluster shapes).
STS clustering becomes meaningful: cluster centers are reliably matched to original series.
We also found correlation between some characteristics (number of unique shapes, shape skew) and sequence smoothness.
CIKM’06 20November 8, 2006
Future WorkWHY?
Difference in behavior between whole-sequence and subsequence clustering?(some preliminary answers are in paper)
Apparent presence of transformations among cluster sets?
Dependency between smoothness, skew, number of unique clusters, etc.?
HOW? Find expected accuracy of the matching
algorithm for given input and Q (number of clustering runs to compute each structure).
Questions?
Thank you!