Wavelet decomposition of data streams by Dragana Veljkovic
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 1/34
Wavelet decomposition of
data streams
by Dragana Veljkovic
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 2/34
Motivation
• Continuous data streams arise naturally in:• telecommunication and internet traffic
• retail and banking transactions
• web server log records etc.
• Many applications need this data to be
processed on a 24*7 basis in only one
pass
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 3/34
Motivation cont.
• Usually this data is accumulated andarchived for later use, but not always (e.g.network security)
• The ability to make decisions and interpretinteresting patterns online can be crucialand has real dollar value for large
corporations (e.g. fraud detection)
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 4/34
Our motivation
• Currently working on data collected from100 electrodes receiving electricalpotential of monkey brain over long
periods of time
• We want to look at this data in real timeand seek patterns, trends and surprises
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 5/34
Outline
• Background• streams
• wavelets
• sketches• error analysis
• Results
• Implementation details
• Strengths and weaknesses of thisapproach
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 6/34
Data streams
• Sequence of unbounded, real time datawith high rate that can only be read onceby an application
• Problems:• Unbounded memory requirements
• High data rate
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 7/34
Underlying signal
• Signal is one dimensional function
a: [0, …, N-1] ? Z+
• Data item that arrives in time is an ordered pair:
<domain, value>
Example : voting results
<Texas, 60>
Example : phone call records
<210-748, 12>
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 8/34
Data model
Two different data models used for rendering theunderlying signal:
• Cash register
• Aggregate
Example : cash register model<210-748,10>, <210-689,13>, <210-748, 20>, <210-740, 5>,<210-748, 2>, <210-740, 30>…
where the underlying signal is<210-748, 32>, <210-689, 13>, <210-740, 35>
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 9/34
Stream format
Two distinct formats for the stream – Ordered
– Unordered
Example : Aggregate ordered stream – any time series
Example : Unordered cash-register stream – phone callrecords
Ordered cash-register is trivial to convert to orderaggregate
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 10/34
Wavelets
• Basis functions of limited duration and averagevalue of zero
• Basis functions are shifted and scaled versions
of the original wavelet
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 11/34
Discrete wavelet transform
• Uses only fixed values forwavelet scales based onpowers of two
• Wavelet positions are also
fixed and non overlapping• Wavelets form a set of wavelet
basis vectors of length N
Example: Haar wavelets onsignal of length N = 8• j = 1,…, logN levels
• k = 0,…, 2 j-1 spaces for eachlevel
Haar wavelets for signal of size 8
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 12/34
Wavelet decomposition
• Wavelet decomposition can be regarded as projection ofthe signal on the set of wavelet basis vectors
• Each wavelet coefficient can be computed as the dotproduct of the signal with the corresponding basis vector
Example:
Table 1. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 13/34
Best B-term decomposition
• The signal can be fully recovered from the waveletdecomposition
• Best B-term decomposition uses only a small number of
coefficients, B, that carry the highest energy
• The signal reconstructed using the B-term coefficientsand the corresponding vectors is called the best B-termapproximation
• Most signals that occur in nature can be wellapproximated using only a small number of coefficients(5-10).
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 14/34
Computing best B-term
decomposition in runtimeFor the ordered aggregate model
• Maintain two sets of items• Highest B wavelet basis coefficients for the signal seen so far
• logN straddling coefficients, one for each level
• When the data item is read the affected straddlingcoefficients get updated.
• If a coefficient is no longer straddling it is compared toexisting highest B coefficient and the set is updated ifnecessary. New straddling coefficient is initialized.
• Takes O(B + logN) storage and time for the orderedaggregate model
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 15/34
Sketches
• Sketch is made by projecting a signal ontoseveral different low dimensional spaceswhich are chosen at random
• Many properties of the signal, such ashistograms, can be accurately estimatedby looking at the sketch
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 16/34
Definition of a sketch
• Atomic sketch of signal a is the dotproduct <a, r> where r is a random vectorof ±1 valued random variables
• A sketch of a signal is k independentatomic sketches, each with a differentrandom vector r j
• Sketch size is small compared to the
signal size
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 17/34
Sketches
• Maintaining the sketch is easy as we arereceiving the data
• If element <i, a(i)> arrives, add a(i)*ri j to
the sketch corresponding to random vectorr j
Example : In cash-register receive <5, 10>,need to add 10* r5
j to each atomic sketchcorresponding to the random vector r j
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 18/34
Error metrics
• SSE (sum squared error) – if R is a representation of the
signal a then SSE is defined as
• Pseudoenergy of the representation R is computed as
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 19/34
Query processing
• Batched – queries are posed at certain
periodic intervals
• Ad hoc – a query may be posed at anytime
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 20/34
Batch query using best B-termapproximation for day 0 of call records
Figure 2. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 21/34
Batch query using best B-term approximationfor all 7 days of call records
Figure 3. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 22/34
Estimating a point query
Answer to point query i is a(i)
• Direct point estimate – directly estimating a(i)using the sketch
• Direct wavelet estimate – use the sketch toestimate the wavelet coefficients whose supportintersects i and reconstruct a(i) using thesecoefficients
• Another way is to compute a(i) using only thehigh wavelet coefficients (like the known B-termapproximation) whose support intersects a(i)
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 23/34
Using sketches to estimate dot
product
• Following parameters characterize how well thesketch does
• e – distortion parameter
• d – failure probability
• ? – failure threshold
• Sketch of a signal is independentatomic sketches, each with a different random
vector• If the cosine between vectors a and b is greater
than ? we estimate the dot product within (1±e)with probability at least 1- d
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 24/34
Sketches and random vectors
• If element <i, a(i)> arrives, add a(i)*ri j to the
sketch corresponding to random vector r j
• In order to use the sketches we need to get theelements r j quickly.
• r j is of size N, it can not be stored explicitly
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 25/34
Generating random vectors
• The paper shows that ri j can be generated
by a pseudorandom number generatorusing a seed s j of size logO(1)N
• Generator G is based on second orderReed-Muller codes
• The generator G takes s j and i and outputs
ri j = G(s j, i) quickly
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 26/34
Estimation of dot products using
sketches
Lemma: Lemma: Let X be aLet X be a O(logNO(logN / / dd))--wise median of O(1/ wise median of O(1/
ee22))--wise means of independent copies ofwise means of independent copies of
then we have with probability of 1then we have with probability of 1-- dd
Note Note : use b=a to estimate energy of a using this: use b=a to estimate energy of a using this
lemmalemma
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 27/34
Example :
Want to estimate dot product of vectors a
and b with no more than 30% error withprobability of 80%, assuming the cosinebetween these two vectors is greater then
0.25
That is e = 0.3, ? = 0.25 and d = 0.2 and
for a signal of size N=1024 we would needabout 30 atomic sketches
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 28/34
Theorem
There is a streaming algorithm, A, such that, given a signal a[1,…, N]with energy ||a||2
2 if there is a B-term representation with energy atleast ?*||a||2
2, then, with probability at least (1-d) A finds arepresentation of at most B terms with pseudoenergy at least (1-e)?*||a||2
2. If there is no such B-term representation with energy ?*||a||22,
A reports “no good representation”. In any case A uses
space and per item time while processing the stream. This holds withboth aggregate and cash-register models
Example : take ?=0.3, d=0.2, e=0.3 and B=10. Then if there exists a10 terms representation of the signal that captures at least 30% of thesignal’s energy the algorithm will output a 10 term representation withenergy at least 21% of the signal with 80% probability
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 29/34
Strengths and weaknesses
• Good example how to work with cash-register models
• Shows several ways to estimate the signalusing a sketch
• Time requirements seem higher than thepaper claims
• On-line algorithms do not seem aspromising as batch algorithms
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 30/34
References
1. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "One-pass wavelet decomposition of data streams," IEEE transactionson knowledge and data engineering, Vol. 15, No. 3, May/June2003.
2. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss,"Surfing wavelets on streams: one-pass summaries forapproximate aggregate queries," Proceedings of the 27th VLDBConference, Roma, Italy 2001.
3. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan andM. J. Strauss, "Fast, small-space algorithms for approximatehistogram maintenance," STOC ’02, May 19- 21, 2002, Montreal,Quebec, Canada.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 31/34
Answering queries on-line
Comparison of sse/energy of top –B wavelets against direct estimates
Table 1. from Gilbert et al. 2003.
Table 2. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 32/34
Direct estimates for the top 10 heavy
hitters
Figure 6. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 33/34
Direct estimates for the top 10 heavy
hitters using the greedy algorithm
Figure 7. from Gilbert et al. 2003.
8/4/2019 Wavelet Decomposition
http://slidepdf.com/reader/full/wavelet-decomposition 34/34
Adaptive greedy pursuit for heavy
hitters• Obtain a very accurate estimate for the first heavy hitter
• Get a new sketch by subtracting this value from theoriginal sketch. This can be done because sketches arelinear
• New sketch is a good estimation of the residualdistribution in which the second heavy hitter is the peakvalue
• Use the new sketch to estimate the second heavy hitter
• Repeat procedure for more heavy hitters• Each estimate introduces an error and after manyiterations the errors tend to overwhelm the benefits