Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki
Continuous Processing of Preference Queries in
Data Streams : a Survey
M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos
Data Engineering LabDepartment of Informatics
Aristotle University of Thessaloniki
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Data Streams
Data Stream is an infinite sequence of objects.
Each object can be one-dimensional or multi-dimensional.
Streaming Time Series are finite sequences of objects.
Streaming Time Series changes over time.
Arrival rate of objects usually varies.
t1 t2 t3 t4 t5 t6 t7 t8
Time
W=5
expired
active
Count-based window: Sliding window contains the W most recent tuples (“active”).
Older tuples expire.
Sliding Window Model (1)
Sliding Window Model (2)
t1 t2 t3 t4 t5
t6
t7
Time
W=5
expired
active
Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps.
Older records expire.
t8
User / ApplicationUser / Application
InputInput
Query Result
ResultQuery
Database System
Continuous Evaluation in a Data Stream System
User / ApplicationUser / Application
Query
Query processor
Result
Motivation (1)
Numerous data stream contexts Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data
management
Motivation (2)
Preference queries Useful decision support tool
Many applications in data streams
Example 1 (telecommunication data)Report the clients with the maximum call time and the maximum number of calls.
Continuous skyline query
Example 2 (stock-market data)Report the products with the maximum price, the minimum sales and the minimum number of buyers.
Continuous top-k dominating query
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Conclusions
Skyline Query
distance
price
T4
Hotels
price
distance
T1 4 1
T2 3 2
T3 0.5 3
T4 2.5 4.5
T5 1.5 4
T6 3.5 5
T3
T2
T6
T5
T1
Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.
Skyline: contains all the tuples not dominated by any other tuple.
Continuous Skyline Query
Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series.
Application example: network data Computers with suspicious behavior. Network traffic, number of connections,
number of destinations.
Basic Idea
Skyline changes due The insertion of a new skyline tuple. The expiration of a skyline tuple.
LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a
skyline tuple is not handled efficiently
Event Approach (1)
Existing skyline tuple expires: How can we find new skyline tuples? Very costly operation
Skyline influence time (SIT) Minimum time in which a tuple may
become a skyline tuple. Generate events based on SIT
Event Approach (2)
W=10
K.SIT=19Tuple K can be discardeddue to tuple L (younger and better)
H(8)
A(1)
D(4)
C(3)
I(9)
B(2)
F(6)
E(5)
K(11)G(7)
J(10)
L(12)
Eager [Tao, TKDE06] Advantage: handles
skyline expiration Disadvantage: pro-
cessing time per tuple
n-of-N Skyline Queries (1)
S6 = {a,c} S4 = {c,g}
source: icde05
n-of-N definition
n-of-N Skyline Queries (2)
S6 = {c,h} S4 = {e,h}
source: icde05
n-of-N definition
Method cnN(1)
Tuple K is redundant because tuple L is better and younger than K
The dominance relation between L and E is critical because E is the youngest tuple which dominates L
Tuple L is dominated by D and E.
W=10H(8)
A(1)
D(4)
C(3)
I(9)
B(2)
F(6)
E(5)
K(11)G(7)
J(10)
L(12)
Method cnN [Lin, ICDE05] is also based on events
Method cnN (2)
Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G =
(3,7] Use an interval-tree to store them
Dominance graph contains all the critical dominance relations
A(1)B(2)
D(4)
F(6)E(5)
C(3)
G(7)
Redundant tuples
Critical dominance relation
Method cnN (3) A tuple t is in the answer of an n-of-N skyline
query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far.
A(1)B(2)
D(4)
F(6)E(5)
C(3)
G(7)
To answer a n-of-N query, apply a (M–n+1) stabbing
query
C = (0,3]
D -> F = (4,6] D -> E = (4,5]C -> G = (3,7]
D = (0,4]For n = 6,
M–n+1 = 2
S6 = {C, D}
For n = 4,M–n+1 = 4
S4 = {D, G}
stabbing queryM = 7
Method cnN (3)
Advantages Good use of skyline properties Multiple query processing
Disadvantages Processing time per tuple Increased memory requirements
Frequent Skyline - Motivation
Highly dynamic environment The skyline results are meaningful
only if the skyline tuples appear consistently
Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]
Streaming Model
Client/Server architecture Server receives object updates from the
clients. Each object can be represented as a
d-dimensional point. Object update (point movement in the
d-dimensional space). at least a value in one dimension changes
Object insertion or deletion Point movement from/to a nonexistent position
Minimization of communication cost
Filter Safe region technique
Skyline remains unchanged if each object stays in a safe region
Communication happens only when the safe region is violated
Safe region approach leads to communication optimization
An object as a point and its filter (safe
region)
source: sigmod09
Sampling
All clients report their skyline at the same sampled time
The clients are synchronized with the same random seed
Guaranteed quality if sampling rate is high enough
Hybrid
Hybrid solution Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling
Disadvantage of all three methods energy consumption is not uniform
(critical in sensor networks)
k-dominant Skyline Query - Μotivation
Skyline: contains tuples not dominated by any other tuple.
Disadvantage: High dimensionality problem.
Solution: Relax the notion of dominance.k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them.
k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]
k-dominant Skyline Query - Εxample
D1 D2 D3 D4 D5 D6
T1 6 5 4 3 2 1
T2 5 4 3 5 4 3
T3 6 6 2 2 6 5
T4 6 6 6 1 6 6
T5 6 6 6 5 5 5
Conventional skyline {T1, T2, T3, T4}5-dominant skyline {T1, T2,
T3}4-dominant skyline {T1,
T2}Smaller k, less tuples in k-dominant skyline
T1 dominates T5
T1 5-dominates T4
T1 4-dominates T3
Observations
Traditional or streaming skyline methods are inappropriate Skyline properties do not hold
E.g. transitive property k-dominance can be cyclic
Existence of multiple users and multiple queries.
Method CoSMuQ (1)
A query on D dimensions arrives. Given a parameter value k, split the query
to subqueries of d=k dimensions. Compute the conventional skyline of each
subquery. The k-dominant skyline is the intersection
of the skylines of the subqueries of a query.
Method CoSMuQ (2)
Advantages Based on conventional skyline (simple
domination checks) Properties of conventional skylines can be used Exploits the overlap between different queries.
Disadvantages Memory requirements increase in high
dimensionality.
Continuous Skyline methods - Summary
Method Query Type
Window Type
Multiple Queries
LookOut skyline time no
Lazy and Eager
skyline both no
n-of-N skyline count yes
Filter and Sampling
frequent skyline
time no
CoSMuQ k-dominant skyline
both yes
Presentation Layout
Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Top-k query - Εxample
distance
price
T4
Hotels
price
distance
T1 4 1
T2 3 2
T3 0.5 3
T4 2.5 4.5
T5 1.5 4
T6 3.5 5
T3
T2T6
T5
T1
Given a preference function, a top-k query returns the k tuples with the best scores.
k=1k=2F=price+distance
Continuous Top-k Query
Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series.
Application Example: network data top-100 flows with the largest individual
throughput Common destination DDoS attack
Basic Idea
Influence region
tk
x2
x1
New tuple changes the top-k Should belong in the influence
region of the query Top-k tuple expiration
From scratch query computation TMA (Top-k Monitoring
Algorithm) [Mouratidis, SIGMOD06] Advantage: simple
implementation Disadvantage: no efficient
handling of an expired top-k tuple
source: sigmod06
Line defined by theF = score(tk) =
x1 + x2
Skyband - Example
k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.
E
D
B
C
A
1-skyband (tuples not dominated by other tuples)
1-skyband is the skyline2-skyband (tuples dominated by at most 1 other tuples)
Dominated by 2 other tuples
(3-skyband)
Skyband Approach (1)
Dominance counter (DC): number of tuples that are younger and better
Rule: Keep tuples with DC < k
Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.
Transform tuples in the (score,expiration_time) space
T4
T3
T2
T6
T5
T1
distance
pricescor
e
T1 5
T2 5
T3 3.5
T4 7
T5 5.5
T6 8.5
original space transformed space
T4
T3
T2
T6
T5
T1
exp_time
scoreF=price+distance
DC=0DC=1
DC=1
DC=0
DC=1
DC=0
top-1
Skyband Approach (2)
SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06]
Advantage: independent of the dimensionality 2-dimensional space (score-exp_time)
Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause
query computation from scratch
Distributed Top-k
Continuously report the k largest values obtained from distributed data streams.
Objective is to minimize communication cost
Proposed by [Babcock, SIGMOD03]
Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated
with real values V1, V2 , … , Vn
Value updates are represented as <Oi, Nj, > tuples:
Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk
(kj) The value Vi for an object Oi:
Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node
Method (1)
Initialize a top-k set at the coordinator node
Set arithmetic constraints at monitor nodes Depend on current top-k set
Constraints valid No communications Constraints invalidated
Client communicates with server Possibly new top-k set Recomputation of constraints
Method(2) - Adjustment Factors
V1,1 = 1 V2,1 = 9
V1,2 = 3 V2,2 = 1
= 0 = -3
= 0 = 32,21,21,1Node 1
Node 2
Object 1 Object 2
Top-1 = {O1}
Node 1, Local Top-1 = {O1}
Node 2, Local Top-1 = {O2}
Local top-ks differ from global top-k=>Unnecessary constraint violations
=> Increased communication cost
2,1
Object 1 Object 2
Adjustment Factors (AF)
Node 2: V1,2 = 3+0 = 3Node 2: V2,1 = 1+3 = 4
Local top-k similar to global=>Low communication cost
To keep the results valid AF for each object sum to zero
Disadvantage: Energy consumption is not
uniform
Uncertain DataScore Prob
.
6 0.8
5 0.5
2 0.4
8 0.4
Tuples Pr. Tuples
Pr. Tuples
Pr.
2, 5, 6, 8
.064
2, 5, 6 .096
2, 6, 8 .064
2, 6 .096
2, 5, 8 .016
2, 5 .024
2, 8 .016
2 .024
5, 6, 8 .096
5, 6 .144
5, 8 .024
5 .036
6, 8 .096
6 .144
8 .024
Empty .036
tuples 16 possible worlds
Pk-topk query: returns the k most probable tuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5}
Compute probability of 6
Sum the world probabilities
source: pvldb08
Pk-topk Query
Solution proposed by [Jin, PVLDB08] Compact set based
Space-efficient solution Discard unnecessary tuples and Apply several compression schemes to
compress data Disadvantages
Model assumption: the probability of a tuple is assumed random and independent of each other.
Continuous Top-k Methods -Summary
Method Query Type
Window Type
Multiple Queries
TMA and SMA top-k both yes
Distributed top-k Distributed top-k
time no
Compact set based
Pk-topk both no
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Top-k Dominating Query - Example
distance
price
T4
Hotels
price
distance
T1 4 1
T2 3 2
T3 0.5 3
T4 2.5 4.5
T5 1.5 4
T6 3.5 5
T3
T2T6
T5
T1Skyline: contains all the tuples not dominated by any other tuple.
Disadvantage: High dimensionality problem.
Top-k: Given a preference function, a top-k query returns the k tuples with the best scores.
Disadvantage: user-defined preference function.
Top-k dominating: the answer contains the k tuples with highest domination power.
Combines the advantages of skyline and top-k queries and avoids their disdvantages.
k=1k=2F=price+distance
Continuous Top-k Dominating Query
Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series.
Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed
EVA Objective: reduce domination checks Safe interval of a tuple
Ignore tuple for this interval It depends on its score and the k-th score
End of safe interval -> event Event
Try to compute new safe interval, else Compute score from scratch
New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval
ADA
Advanced computation of safe interval Depends on the number of tuples that
dominate this tuple and expire later Candidate tuples
Tuples with scores close to k-th score are updated in each time instance
EVA and ADA proposed by [Kontaki 2009]
Presentation Layout
Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating
queries Summary
Summary
Preference queries are very useful in data streams
Presented state-of-the-art methods For continuous skyline queries For continuous top-k queries For continuous top-k dominating
queries Examined advantages and
disadvantages of the proposed methods
Research Directions Continuous subspace skyline
queries Solutions appropriate for
distributed environments uniform energy consumption
Approximate algorithms Existence of multiple queries
Thank you