1 Embedding-Based Subsequence Matching in Large Sequence Databases Panagiotis Papapetrou ED (Q ,Match) Doctoral Dissertation Defense Committee: George Kollios Stan Sclaroff Margrit Betke Vassilis Athitsos (University of Texas at Arlington) Dimitrios Gunopulos (University of Athens) Committee Chair: Steve Homer
152
Embed
1 Embedding-Based Subsequence Matching in Large Sequence Databases Panagiotis Papapetrou Doctoral Dissertation Defense Committee: George Kollios Stan Sclaroff.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Embedding-Based Subsequence Matching in
Large Sequence Databases
Panagiotis PapapetrouED (Q,Match)
Doctoral Dissertation Defense
Committee: George Kollios Stan Sclaroff Margrit Betke Vassilis Athitsos (University of Texas at Arlington) Dimitrios Gunopulos (University of Athens)
Committee Chair: Steve Homer
2
Subsequence matching General Problem
Given: Sequence S. Query Q. Similarity measure D.
Find the best subsequence of S that matches Q.
Types of Sequences: Time Series. Biological sequences (e.g. DNA).
3
Types of Sequences (1/2) Time Series
Ordered set of events X = {x1, x2, …, xn}. Weather measurements (temperature, humidity, etc). Stock prices. Gestures, motion, sign language. Geological or astronomical observations. Medicine: ECG, …
Q
X
4
Types of Sequences (2/2)
Strings Defined over an alphabet Σ. Text documents. Biological sequences (DNA). Near homology search:
Deviation from Q does not exceed a threshold δ (δ ≤ 15%).
- V. Athitsos, P. Papapetrou, M. Potamias, G. Kollios, and D. Gunopulos, “Approximate embedding-based subsequence matching of time series”
SIGMOD2008
6
Time Series A sequence of observations.
(X1, X2, X3, X4, …, Xm).
Each Xi is a real number, or a vector. E.g., (2.0, 2.4, 4.8, 5.6, 6.3, 5.6, 4.4, 4.5, 5.8, 7.5)
time axis
valu
e ax
is
7
Subsequence Matching in a Database
database
query
What subsequence of any database sequence is the best match for Q?
Naïve approach: brute-force search.
8
Our Contribution
database
query
What subsequence of any database sequence is the best match for Q?
Partial reduction to vector search, via an embedding. Quick way to identify a few candidate matches.
9
How to Compare Time Series
Euclidean distance: Matches rigidly along
the time axis.
Dynamic Time Warping (DTW): Allows stretching and
shrinking along the time axis.
In our method, we use DTW.
10
DTW: Dynamic time warping (1/2)
Each cell c = (i, j) is a pair of
indices whose corresponding
values will be computed, (xi–yj)2,
and included in the sum for the
distance.
Euclidean path:
i = j always.
Ignores off-diagonal cells.X
Y
xi
yj
(x2–y2)2 + (x1–y1)2
(x1–y1)2
11
(i, j)
DTW: Dynamic time warping (2/2)
DTW allows more paths. Examine all valid paths:
Standard dynamic programming to fill in the table.
The top-right cell contains final result.
(i, j)(i-1, j)
(i-1, j-1) (i, j-1)
shrink x / stretch y
stretch x / shrink y
X
Y
a
b
12
J-Position Subsequence Match
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q …such that the match ends at position j?
13
J-Position Subsequence Match
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q …such that the match ends at position j?
position j
14
J-Position Subsequence Match
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q …such that the match ends at position j?
position j
15
Dynamic Programming (1/2)
For each (i, j): Compute the j-position subsequence match
of the first i items of Q.
(i, j)
Q[1:i]
Is matched
database sequence X
quer
y*
Sakurai, Y., Faloutsos, C., & Yoshikawa, M. “Stream Monitoring under the Time Warping Distance”, ICDE2007
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16
Dynamic Programming (2/2)
For each (i, j): Compute the j-position subsequence match
of the first i items of Q.
Top row: j-position subsequence match of Q. Final answer: best among j-position matches.
Look at answers stored at the top row of the table.
(i, j)
database sequence X
quer
y* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17
Time Complexity
Assume that the database is one very long sequence. Concatenate all sequences into one sequence.
O(length of query * length of database). Does not scale to large database sizes.
database sequence X
quer
y
18
Strategy: Identify Candidate Endpoints
database sequence X
19
Strategy: Identify Candidate Endpoints
database sequence X
indexing structure
20
Strategy: Identify Candidate Endpoints
database sequence X
indexing structure
query Q
21
Strategy: Identify Candidate Endpoints
database sequence X
indexing structure
query Q
candidateendpoints
candidateendpoints
22
Strategy: Identify Candidate Endpoints
database sequence X
indexing structure
query Q
candidateendpoints
candidateendpoints
Candidate endpoint: last element of a possible subsequence match.
23
Strategy: Identify Candidate Endpoints
database sequence X
indexing structure
query Q
candidateendpoints
candidateendpoints
Use dynamic programming only to evaluate the candidates.
24
Vector Embedding
database sequence
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
25
Vector Embedding
database sequence
vector set
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
26
Vector Embedding
database sequence
query
vector set
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
Q2Q1 Q4Q3 Q5
27
Vector Embedding
database sequence
query
vector set
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
Q2Q1 Q4Q3 Q5 query vector
28
Vector Embedding
database sequence
query
Embedding should be such that: Query vector is similar to vector of match endpoint.
vector set
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
Q2Q1 Q4Q3 Q5 query vector
subsequence match
29
Vector Embedding
database sequence
query
Using vectors we identify candidate endpoints. Much faster than brute-force search.
vector set
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
Q2Q1 Q4Q3 Q5 query vector
30
Using Reference Sequences
For each cell (|R|, j), DTW computes: cost of best subsequence match of R ending in the j-th position of X.
Define FR(X, j) to be that cost. FR is a 1D embedding.
Each (X, j) single real number.
database sequence X
refe
renc
erow |R|
31
Using Reference Sequences
Cell (|R|, |Q|), DTW computes: cost of best subsequence match of R with a suffix of Q.
Define FR(Q) to be that cost.
database sequence X
refe
renc
e
query Q
refe
renc
e
32
Intuition About This Embedding
Suppose Q appears exactly as (Xi’, …, Xj). If j-position match of R in X starts after i’, then:
Warping paths are the same. FR(Q) = FR(X, j).
33
Intuition About This Embedding
Suppose Q appears inexactly as (Xi’, …, Xj). If j-position match of R in X starts after i’:
We expect FR(Q) to be similar to FR(X, j). Why? Little tweaks should affect FR(X, j) little.
34
Intuition About This Embedding
Suppose Q appears inexactly as (Xi’, …, Xj). If j-position match of R in X starts after i’:
We expect FR(Q) to be similar to FR(X, j). Why? Little tweaks should affect FR(X, j) little. No proof, but intuitive, and lots of empirical evidence.
35
Intuition About This Embedding
If (Xi’, …, Xj) is the subsequence match of Q: If j-position match of R in X starts after i’:
FR(Q) should (for most Q) be more similar to FR(X, j) than to most FR(X, t).
36
Multi-Dimensional Embedding
database sequence X query Q
R1
One reference sequence 1D embedding.
R1
37
Multi-Dimensional Embedding
database sequence X query Q
R1
One reference sequence 1D embedding. 2 reference sequences 2-dimensional embedding.
R1
database sequence X query Q
R2
R2
38
Multi-Dimensional Embedding
database sequence X query Q
R1
d reference sequences d-dim. embedding F. If (Xi’, …, Xj) is the subsequence match of Q:
F(Q) should (for most Q) be more similar to F (X, j) than to most FR(X, t).
R1
database sequence X query Q
R2
R2
39
Filter-and-Refine Retrieval
Offline step: Compute F(X, j) for all j.
Online steps, given a query Q: Embedding step:
Compute F(Q).
Filter step: Compare F(Q) to all F(X, j). Select p best matches p candidate endpoints.
Refine step: Use DTW to evaluate each candidate endpoint.
40
Accuracy: correct match must be among p candidates, for most queries.
Larger p higher accuracy, lower efficiency.
database sequence X
candidateendpoints
Filter-and-Refine Performance
41
Experiments - Datasets
3 datasets from the UCR Time Series Data
Mining Repository:
50Words, Wafer, Yoga.
All database sequences concatenated
one big sequence, of length 2,337,778.
Query lengths 152, 270, 426.
42
Experiments - Methods
Brute force: Full DTW between each query and entire database
sequence. Similar to SPRING of Sakurai et al.
PDTW (Keogh et al. 2004, modified by us): Makes time series smaller by factor of k. Each chunk of k values replaced by their average. Matching on smaller series used as filter step.
EBSM (our method). 40-dimensional embedding.
43
Experiments – Performance Measures Accuracy:
Percentage of queries giving correct results.
Efficiency: DTW cell cost: cost of dynamic programming, as
percentage of brute-force search cost. Runtime cost: CPU time per query, as percentage of
Cost: sum of individual matching costs. Example: contribution of element (4, 11):
4th element of Q matches 11th element of X. 5 matches 4. Cost: |5 – 4| = 1.
121
Selecting Reference Sequences
Select K reference sequences from the database with
lengths between m/2 and M. M: maximum expected query size.
m: minimum expected query size.
From those K select the top K’ reference sequences with the
maximum variance.
Given a set of training queries: Choose reference sequences that minimize the total DTW cost.
J. Venkateswaran, D. Lachwani, T. Kahveci and C. Jermaine,“Reference-based indexing of sequence databases” VLDB2006
122
Limitations
Is EBSM always going to work well? There is no theoretical guarantee.
Reference sequence selection: Training: costly.
Space: (number of reference sequences) x (database size) In our experiments: 40 x (database size)
Is there any way of compression?
Supporting variable query sizes.
123
Query-by-Humming (1/2)
Database of 500 songs. Set of 1000 hummed queries.
Shorter than the song size. Only include the main melody.
Time Series contains pitch value of each note. Pitch value: frequency of the sound of that note. Pitch normalized. Time Series contains pitch differences (to handle queries that
are sung at a higher/lower scale.
Used 500 queries for training and 500 queries for testing EBSM.
124
Query-by-Humming (2/2)
Results For all queries, DTW can find the correct song when
looking at the nearest 5% of the songs (i.e. top 25).
Rank DTW EBSM
Success Success Cell Cost RRT
top 25 100% 99% 4.1 5.8
top 15 94% 90% 3.4 4.5
top 5 82% 78% 2.9 3.8
125
Experiments - Datasets
3 datasets from UCR Time Series Data Mining Archive: 50Words, Wafer, Yoga.
All database sequences concatenated one big sequence, of length 2,337,778.
1750 queries, of lengths 152, 270, 426. 750 queries used for embedding optimization. 1000 queries used for performance evaluation.
126
Smith-Waterman Upper-bound
Bound:
Proof:
127
Results – Effect of Dimensionality
128
RBSA: Results on S-W
Cell Cost
129
Proof of Lower Bound
Two auxiliary definitions:
M (A, B, t): subsequence of B ending at position
(B, t) with the smallest edit distance from A.
Q’: suffix of Q with the smallest edit distance
from Ri.
130
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
131
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
132
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
≤ ED (R, M (Q’, X, t)) – ED (R, Q’)
133
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
≤ ED (R, M (Q’, X, t)) – ED (R, Q’)
- M (R, X, t) and M (Q’, X, t): subsequences of X ending at (X, t). - M (R, X, t): has the smallest distance from R.
134
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
≤ ED (R, M (Q’, X, t)) – ED (R, Q’)
≤ ED (M (Q’, X, t), Q’)
135
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
≤ ED (R, M (Q’, X, t)) – ED (R, Q’)
≤ ED (M (Q’, X, t), Q’)
- Since ED is metric, the triangle inequality holds
136
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
= ED (R, M (R, X, t)) – ED (R, Q’)
≤ ED (R, M (Q’, X, t)) – ED (R, Q’)
≤ ED (M (Q’, X, t), Q’)
≤ ED (M (Q, X, t), Q)
137
Proof of Lower Bound
We have:
LBR (Q, X, t) = FR (X, t) – FR (Q)
≤ ED (M (Q’, X, t), Q’)
≤ ED (M (Q, X, t), Q)
- the minimal set of edit operations to convert Q to M(Q, X, t) suffices to convert Q’ to a suffix of M(Q, X, t). - the smallest possible edit distance between Q’ and a subsequence of X at (X, t) is bounded by ED (M (Q, X, t), Q).
138
BSE
BSE Construction
139
RBSA: Approximate version
Question: Use only one segment Qi of Q.
What is the probability that the subsequence match of Q is
included in the candidates of Qi?
M (Q,X,t): best subsequence match of Q in X.
Assume: ED (Q, M (Q,X,t)) ≤ δ |Q|. δ |Q| edit operations are needed to convert Q to M (Q,X,t).
Each of these operations is applied to ONLY one segment of Q.
140
RBSA: Approximate version
SED: optimal sequence of edit operations to convert
Q into M (Q,X,t).
Proposition:
Given any Qi.
P (out of SED, at most δq EO are applied to Qi) ≥ 50%.
[Hamza et. al. 1995]
141
RBSA: Approximate version
Qcm: segment where the cmth edit operation is applied.
P (m = i): probability that the cmth edit operation is applied to Qi.
Assume that:
P (m = i) is uniform over all i.
The distribution of cm is independent of any cn, for n ≠ m.
SED: optimal sequence of edit operations (EO): Q -> M (Q,X).
Given any Qi :
P (out of SED, at most δq EO are applied to Qi) ≥ 50%
using [Hamza et. al. 1995]
142
RBSA: Approximate version
Proof: The probability that exactly k out of n EO are applied to
Qi follows a binomial distribution:
n trials.
success: an EO is applied to Qi.
P (success) = 1/α.
The expected number of successes over n trials is n/α.
143
RBSA: Approximate version
Proof: The expected number of successes over n trials is n/α.
If α ≥ 4, then P (success) ≤ 25%.
Then, as shown in [Hamza et. al. 1995]
P (number of successes ≤ n/α) ≥ 50%.
Since n ≤ δ|Q|:
n/α ≤ (δ|Q|) / α = δq.
Thus: P (at most δq are applied to Qi) ≥ 50%
144
RBSA: Effect of Alphabet Reduction
Retrieval Runtime Percentage and Cell Cost
145
Contributions: Time Series
EBSM:
The first embedding-based approach for subsequence
matching in Time Series databases.
Achieves speedups of more than an order of
magnitude vs. state-of-the-art methods.
Uses DTW (non metric) and thus it is hard to provide
any theoretical guarantees.
146
Contributions: Time Series
BSE: A bi-directional embedding for time series
subsequence matching under cDTW,
The embedding is enforced and training is not
necessary.
For more details refer to my thesis…
147
Contributions: Strings
RBSA:
The first embedding-based approach for subsequence
matching in large string databases.
Exploits the metric properties of the edit distance measure.
Have defined bounds for subsequence matching under the edit
distance and the Smith-Waterman similarity measure.
Have proved that under some realistic assumptions the
probability of failure to identify the best match drops exponentially
as the number of segments increases.
148
Contributions: Strings
RBSA: Has been applied to real biological problems:
Near homology search in DNA.
Finding near matches of the Mouse Genome in the Human Genome.
Supports large queries, which is necessary for searches in EST
(Expressed Sequence Tag) databases.
Has shown significant speedups compared to
the most commonly used method for near homology search in DNA
sequences (BLAST).
state-of-the-art methods (Q-grams, BWT-SW) for near homology
search in DNA sequences, for small |Q| (<200).
149
RBSA: Results on S-W
Retrieval Runtime Percentage
150
Wafer Dataset
A collection of inline process control measurements recorded from various sensors during the processing of silicon wafers for semiconductor fabrication.
Each data set in the wafer database contains the measurements recorded by one sensor during the processing of one wafer by one tool.
151
Yoga Dataset
20 40 60 80 100 120 1400.8
0.82
0.84
0.86
0.88
0.9
0.92
Number of iterations
Pre
cisi
on-r
ecal
l bre
akev
en p
oint
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 13: Classification performance on Yoga Dataset
Figure 12: Shapes can be converted to time series. The distance from every point on the profile to the center is measured and treated as the Y-axis of a time series