Mining Correlations on Massive Bursty Time Series Collections Tomasz Kuśmierczyk and Kjetil Nørvåg
Aug 13, 2015
Mining Correlations on Massive Bursty Time Series CollectionsTomasz Kuśmierczyk and Kjetil Nørvåg
Problem statement
bursty streams
2
one of many various detection
methods
Problem statement
bursty streams
streams of bursts
3
Problem statement
bursty streams
streams of bursts
correlated bursts
identify correlated bursty streams
4
Problem: Massive Collections
● identify pairs: correlation >= threshold ● N ~ millions of streams● naive (all pairs) solution complexity ~ N 2
● pruning● indexing
5
Motivation
● any source of large number of streams:○ social media○ web page view counts○ traffic monitoring sensors○ smart grid (electricity consumption meters)○ and more
6
Correlated bursts
● different lengths● different heights● slight shifts
but● should overlap
7
Correlated bursty streams
number of bursts per stream
number of bursts from stream i
overlapping with j
number of bursts from stream j
overlapping with i
two streams i and j
Correlated bursty streams
number of bursts per stream
number of bursts from stream i
overlapping with j
number of bursts from stream j
overlapping with i
Ei
Ejtime
ei = 4 oij = 3
ej = 3 oij = 2
min(oij , oi
j) = min(3 , 2) = 2
J(Ei, Ej) = 2 / (4+3 - 2) = 2/5
9
two streams i and j
Enumerating pairs
Order streams according to number of bursts:
❏ FOREACH base count b ❏ FOREACH b’ IN connected counts of b
❏ compare streams with b and b’ bursts
10
Pruning
● for each base count b we need to consider only connected counts b’ such that:
JT • b ≤ b’≤ b
11
threshold particular base countpossible connected counts
Interval Boxes (IB) index
● k-subset of bursts = k-dim box● k-dimensional R-trees
1 2 3 4
4
3
2
1For example (k=2): the representation of stream Ei as 2-dimensional boxes
12
Interval Boxes (IB) index
● k-subset of bursts = k-dim box● k-dimensional R-trees● k-dim boxes overlapping =
at least k bursts overlap
IndexedQuery min(oi
j , oij) ≥ k
13
Interval Boxes (IB) index: mining
● mining: ○ for each base count b maintain
an IB (RTrees) index ○ query it with streams having
connected counts b’
b=1
b=2
b=3
b=414
Interval Boxes (IB) index: mining
● mining: ○ for each base count b maintain
an IB (RTrees) index ○ query it with streams having
connected counts b’
b=1
b=2
b=3
b=4
candidate pairs of streams: min(oij , oi
j) ≥ k
15
correlated output pairs
IB index: what dimensionality k?
● small k (IB Low Dimensional = IBLD)○ small indexes○ large number of candidate pairs
● high k (IB High Dimensional = IBHD)○ large indexes○ small number of candidate pairs○ kmax = JT • b (correlation ≥ threshold guaranteed)
16
IBHD index in practice● to speed up some k-subsets are skipped● some pairs may be missing for multiple overlapping ● efficiency-effectiveness tradeoff
17
List-based (LS) index: bins
separate bin for each (not pruned) b, b’
b=1, b’=2
b=2, b’=3
b=3, b’=4
b=1, b’=3
b=2, b’=4
b=3, b’=5
b=1, b’=4
b=2, b’=5
b=3, b’=6
b=4, b’=5 b=4, b’=6 b=4, b’=7
b=1, b’=5
18
List-based (LS) index: single bin
time
19
time granularity
LS index: mining algorithm● Returns oi
j and oji
● Only for such pairs Ei, Ej that have at least one overlap● Immediate validation of pairs correlation J
20
LS index: mining algorithm● For each set of bursts pointers (time moment):
21
time
current time moment (set of pointers)
bursts active in current moment
bursts active in previous moment
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)
22
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ maintain map
OVERLAPS = burst → set of overlapping streams
23
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ maintain map
OVERLAPS = burst → set of overlapping streams○ update counts oi
j and oji
24
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
Hybrid index
● LS index works well when:○ low number of overlaps○ high number of bursts per stream
● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps
25
Hybrid index
● LS index works well when:○ low number of overlaps○ high number of bursts per stream
● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps
● Solution: Hybrid index:IBHD index for low and LS for high base counts
26
Experimental evaluation● Wikipedia page views from the years 2011-2013● Kleinberg’s burst extraction● streams having at least 5 bursts● 2.1M streams and 43M bursts in total● 10 bursts per stream on average● mean burst length 28h
27
Mining & building
Threshold: JT = 0.9528
Hybrid mining
Number of streams: N = 2.1M29
Number of generated pairs
Threshold: JT = 0.95 (<10% pairs missing)30
How_I_Met_Your_Mother_(season_7)
Two_and_a_Half_Men_(season_9)
Process_(computing) Central_processing_unit
Endoplasmic_reticulum Ribosome
Greatest_Hits,_Vol._2_(Ronnie_Milsap_album)
Greatest_Hits,_Vol._3_(Ronnie_Milsap_album)
DigiTech_JamMan Lexicon_JamMan
Humanistic_psychology Positive_psychology
Computational limits for Naive/LS index
What’s more in the paper?
● formal definitions and proofs● considerations of combinatorial aspects● multiple overlap cases● on-line maintenance of indexes
31
LS index: mining● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ new overlapping bursts: NEW x OLD ∪ NEW x NEW ○ remove ENDING and add new overlapping bursts to the
map OVERLAPS = burst → set of overlapping streams:○ update counts oi
j and oji for new overlapping bursts and
with the help of OVERLAPS ● For each i and j in o: calculate min(oi
j , oji) and J
34
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD