Mining Time Series Data Carlo Zaniolo UCLA CS Dept A Tutorial on Indexing and Mining Time Series Data ICDM '01 The 2001 IEEE International Conference on Data Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected]With Slides from:
64
Embed
Mining Time Series Data Carlo Zaniolo UCLA CS Dept A Tutorial on Indexing and Mining Time Series Data ICDM '01 The 2001 IEEE International Conference on.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mining Time Series Data Carlo ZanioloUCLA CS Dept
A Tutorial on Indexing and Mining Time Series Data ICDM '01
The 2001 IEEE International Conference on Data Mining
Dr Eamonn KeoghComputer Science & Engineering Department
A time series is a collection of observations made sequentially in time.
Note that virtually all similarity measurements, indexing and dimensionality reduction techniques discussed in this tutorial can be used with other data types.
Time Series are Ubiquitous! I
People measure things...• The presidents approval rating.• Their blood pressure.• The annual rainfall in Riverside.• The value of their Yahoo stock.• The number of web hits per second.
… and things change over time.
Thus time series occur in virtually every medical, scientific and businesses domain.
Time Series are Ubiquitous! II
A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series (Tufte, 1983).
Defining the similarity between two time series is at the heart of most time series data mining applications/tasks
Thus time series similarity will be the primary focus of this tutorial.
10s = 0.5
c = 0.3
Database C
2
1
4
3
5
7
6
9
8
10
Query Q(template)
Time Series Similarity
Classification
Clustering
Rule Discovery
Query by Content
Why is Working With Time Series so Difficult? Part I
1 Hour of EKG data: 1 Gigabyte.
Typical Weblog: 5 Gigabytes per week.
Space Shuttle Database: 158 Gigabytes and growing.
Macho Database: 2 Terabytes, updated with 3 gigabytes per day.
Answer: How do we work with very large databases?
Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate.
Why is Working With Time Series so Difficult? Part II
The definition of similarity depends on the user, the domain and the task at hand. We need to be able to handle this subjectivity.
Answer: We are dealing with subjective notions of similarity.
Why is working with time series so difficult? Part III
Answer: Miscellaneous data handling problems.
• Differing data formats.• Differing sampling rates.• Noise, missing values, etc.
Similarity Matching Problem: Flavors 1
Database C
Query Q(template)
Given a Query Q, a reference database C and a distance measure, find the
Ci that best matches Q.
2
1
4
3
5
7
6
9
8
10
C6 is the best match.
1: Whole Matching
Similarity matching problem: flavor 2
Database C
Query Q(template)
Given a Query Q, a reference database C and a distance measure, find the location that best matches Q.
2: Subsequence Matching
The best matching subsection.
Note that we can always convert subsequence matching to whole matching by sliding a window across the long sequence, and copying the window contents.
After all that background we might have forgotten what we are doing and why we care!
So here is a simple motivator and review..
You go to the doctor because of chest pains. Your ECG looks strange…
You doctor wants to search a database to find similar ECGS, in the hope that they will offer clues about your condition...
Two questions:• How do we define similar?• How do we search quickly?
Similarity is always subjective.(i.e. it depends on the application)
All models are wrong, but some are useful…
This slide was taken from: A practical Time-Series Tutorial with MATLAB—presented at ECLM
PAKDD 2005, by Michalis Vlachos.
Distance functions
Metric distancesi.e., those that satisfy the Triangle Inequality: d(x,z) ≤ d(x,y) + d(y,z)
d(Q,A) ≥ 150 – 20 = 130We do not need to get A from disk
Non-Metric DistancesExamples:
• Time Warping
• LCSS: longest common sub-sequence
Preprocessing the data before distance calculations
If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results.
This is because Euclidean distance is very sensitive to some distortions in the data. For most problems these distortions are not meaningful, and thus we can and should remove them.
In the next four slides, we discuss the 4 most common forms of distortion, and how to remove them.
• Offset Translation• Amplitude Scaling• Linear Trend• Noise
Fit the best fitting straight line to the time series, then subtract that line from the time series.
Transformation IIII: Noise
0 20 40 60 80 100 120 140-4
-2
0
2
4
6
8
0 20 40 60 80 100 120 140-4
-2
0
2
4
6
8
Q = smooth(Q)
C = smooth(C)D(Q,C)
The intuition behind removing noise is this.
Average each datapoints value with its neighbors.
1
2
3
4
6
5
7
8
9
1
4
7
5
8
6
9
2
3
A Quick Experiment to Demonstrate the
Utility of Preprocessing the Data
Clustered using Euclidean distance
on the raw data
Clustered using Euclidean distance on the raw data, after removing noise, linear trend, offset translation and amplitude
scaling.
Summary of PreprocessingThe “raw” time series may have distortions which we should remove before clustering, classification etc.
Of course, sometimes the distortions are the most interesting thing about the data, the above is only a general rule.
We should keep in mind these problems as we consider the high level representations of time series which we will encounter later (Fourier transforms, Wavelets etc). Since these representations often allow us to handle distortions in elegant ways.
Fixed Time AxisSequences are aligned “one to one”.
“Warped” Time AxisNonlinear alignments are possible.
Dynamic Time Warping
Note: We will first see the utility of DTW, then see how it is calculated.
Utility of Dynamic Time Warping: Example II, Data Mining
Power-Demand Time Series.Each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997 [van Selow 1999].
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Wednesday was a national holiday
1
2
7
6
3
5
4
Hierarchical clustering with Euclidean Distance.
<Group Average Linkage>
The two 5-day weeks are correctly grouped.
Note however, that the three 4-day weeks are not clustered together.
Also, the two 3-day weeks are also not
clustered together.
Hierarchical clustering with Dynamic Time Warping.
<Group Average Linkage>
1
2
3
5
7
4
6
The two 5-day weeks are correctly grouped.
The three 4-day weeks are clustered together.
The two 3-day weeks are also clustered
together.
Dynamic Time-Warping
• (how does it work?)The intuition is that we copy an element multiple times so as to achieve a better matching
… however, note that the first few sine waves tend to be the largest (equivalently, the magnitude of the Fourier coefficients tend to decrease as you move down the column).
We can therefore truncate most of the small coefficients with little effect.
I have converted a raw time series X = {8, 4, 1, 3}, into the Haar Wavelet representation H = [4, 2 , 2, 1]We can covert the Haar representation back to raw signal with no loss of information...
0 20 40 60 80 100 120 140
Haar 0
Haar 1
Haar 2
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
X
X'
DWT
Discrete Wavelet Transform II We have only considered one type of
wavelet, there are many others.Are the other wavelets better for indexing?
YES: I. Popivanov, R. Miller. Similarity Search Over Time Series Data Using Wavelets. ICDE 2002.
NO: K. Chan and A. Fu. Efficient Time Series Matching by Wavelets. ICDE 1999
I consider this an open question...
0 20 40 60 80 100 120 140
Haar 0
Haar 1
Haar 2
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
X
X'
DWT
Discrete Wavelet Transform III
Pros and Cons of Wavelets as a time series representation.
• Good ability to compress stationary signals.• Fast linear time algorithms for DWT exist.• Able to support some interesting non-Euclidean
similarity measures.
• Signals must have a length n = 2some_integer • Works best if N is = 2some_integer. Otherwise wavelets
approximate the left side of signal at the expense of the right side.• Cannot support weighted distance measures.
Open Question: We have only considered one type of wavelet, there are many others. Are the other wavelets better for indexing?
YES: I. Popivanov, R. Miller. Similarity Search Over Time Series Data Using Wavelets. ICDE 2002.
NO: K. Chan and A. Fu. Efficient Time Series Matching by Wavelets. ICDE 1999
0 20 40 60 80 100 120 140
X
X'
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
SVD
Singular Value Decomposition
Eugenio Beltrami
1835-1899
Camille Jordan (1838--1921)
James Joseph Sylvester 1814-1897
Basic Idea: Represent the time series as a linear combination of eigenwaves but keep only the first N coefficients.
SVD is similar to Fourier and Wavelet approaches is that we represent the data in terms of a linear combination of shapes (in this case eigenwaves).
SVD differs in that the eigenwaves are data dependent.
SVD has been successfully used in the text processing community (where it is known as Latent Semantic Indexing ) for many years—but it is computationally expensive
Good free SVD Primer
Singular Value Decomposition - A Primer. Sonia Leach
0 20 40 60 80 100 120 140
X
X'
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
SVD
Singular Value Decomposition (cont.)
How do we create the eigenwaves?
We have previously seen that we can regard time series as points in high dimensional space.
We can rotate the axes such that axis 1 is aligned with the direction of maximum variance, axis 2 is aligned with the direction of maximum variance orthogonal to axis 1 etc.
Since the first few eigenwaves contain most of the variance of the signal, the rest can be truncated with little loss.
0 20 40 60 80 100 120 140
X
X'
Piecewise Linear Approximation I
Basic Idea: Represent the time series as a sequence of straight lines.
If we have a n N-vector then:
If lines are connected, we can represent N/2 lines
If lines are disconnected, we can represent only N/3 lines
Personal experience on dozens of datasets suggest disconnected is better. Also only disconnected allows a lower bounding Euclidean approximation
Each line segment has • length • left_height (right_height can be inferred by looking at the next segment)
Each line segment has • length • left_height • right_height
Karl Friedrich Gauss
1777 - 1855
0 20 40 60 80 100 120 140
X
X'
Piecewise Linear Approximation II
How do we obtain the Piecewise Linear Approximation?
Optimal Solution is O(n2N), which is too slow for data mining.
A vast body on work on faster heuristic solutions to the problem can be classified into the following classes:• Top-Down O(n2N)• Bottom-Up O(n(1/CRatio))
• Sliding Window O(n(1/CRatio))• Other (genetic algorithms, randomized algorithms,
Bspline wavelets, MDL etc)
Recent extensive empirical evaluation of all approaches suggest that Bottom-Up is the best
approach overall.
0 20 40 60 80 100 120 140
X
X'
Piecewise Linear Approximation III
Pros and Cons of PLA as a time series representation.
• Good ability to compress natural signals.• Fast linear time algorithms for PLA exist.• Able to support some interesting non-Euclidean
similarity measures. Including weighted measures, relevance feedback, fuzzy queries…
• Already widely accepted in some communities (ie, biomedical)
• Not (currently) indexable by any data structure (but does allows fast sequential scanning).
0 20 40 60 80 100 120 140
X
X'
0
1
2
3
4
5
6
7
C U U C D C U D
C
U
C
D
C
U
D
U
Symbolic Approximation
Key: C = ConstantU = UpD = Down
Basic Idea: Convert the time series into an alphabet of discrete symbols. Use string indexing techniques to manage the data.
Potentially an interesting idea, but all the papers thusfar are very ad hoc.
Pros and Cons of Symbolic Approximation as a time series representation.
• Potentially, we could take advantage of a wealth of techniques from the very mature field of string processing.
• There is no known technique to allow the support of Euclidean queries.
• It is not clear how we should discretize the times series (discretize the values, the slope, shapes? How big of an alphabet? etc)
Piecewise Aggregate Approximation I
0 20 40 60 80 100 120 140
X
X'
x1
x2
x3
x4
x5
x6
x7
x8
Given the reduced dimensionality representation we can calculate the approximate Euclidean distance (a lower bound)
Basic Idea: Represent the time series of n points as a sequence of box basis functions. Each box is the same length w (simple case: assume n multiple of w)
• Extremely fast to calculate• As efficient as other approaches (empirically)• Support queries of arbitrary lengths• Can support any Minkowski metric• Supports non Euclidean measures• Supports weighted Euclidean distance• Simple! Intuitive!
• If visualized directly, looks ascetically unpleasing.
Pros and Cons of PAA as a time series representation.
Adaptive Piecewise Constant
Approximation I
0 20 40 60 80 100 120 140
X
X
<cv1,cr1>
<cv2,cr2>
<cv3,cr3>
<cv4,cr4>
Basic Idea: Generalize PAA to allow the piecewise constant segments to have arbitrary lengths. Note that we now need 2 coefficients to represent each segment, its value and its length.
The intuition is this, many signals have little detail in some places, and high detail in other places. APCA can adaptively fit itself to the data achieving better approximation.
Adaptive Piecewise Constant
Approximation II
0 20 40 60 80 100 120 140
X
X
<cv1,cr1>
<cv2,cr2>
<cv3,cr3>
<cv4,cr4>
The high quality of the APCA had been noted by many researchers. However it was believed that the representation could not be indexed because some coefficients represent values, and some represent lengths.
However an indexing method was discovered! (SIGMOD 2001 best paper award)
Unfortunately, it is non-trivial to understand and implement….
60
SAX: Symbol Mapping• Each average value from the PAA vector is replaced by a
symbol from an alphabet• An alphabet size of 5 to 8 is recommended