Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of.

Continuous Processing of Preference Queries in

Data Streams : a Survey

M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos

Data Engineering LabDepartment of Informatics

Aristotle University of Thessaloniki

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Presentation Layout


queries Summary

Data Streams

Data Stream is an infinite sequence of objects.

Each object can be one-dimensional or multi-dimensional.

Streaming Time Series are finite sequences of objects.

Streaming Time Series changes over time.

Arrival rate of objects usually varies.

t1 t2 t3 t4 t5 t6 t7 t8

Time

W=5

expired

active

Count-based window: Sliding window contains the W most recent tuples (“active”).

Older tuples expire.

Sliding Window Model (1)

Sliding Window Model (2)

t1 t2 t3 t4 t5

t6

t7

Time

W=5

expired

active

Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps.

Older records expire.

t8

User / ApplicationUser / Application

InputInput

Query Result

ResultQuery

Database System

Continuous Evaluation in a Data Stream System

User / ApplicationUser / Application

Query

Query processor

Result

Motivation (1)

Numerous data stream contexts Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data

management

Motivation (2)

Preference queries Useful decision support tool

Many applications in data streams

Example 1 (telecommunication data)Report the clients with the maximum call time and the maximum number of calls.

Continuous skyline query

Example 2 (stock-market data)Report the products with the maximum price, the minimum sales and the minimum number of buyers.

Continuous top-k dominating query

Presentation Layout


queries Conclusions

Skyline Query

distance

price

T4

Hotels

price

distance

T1 4 1

T2 3 2

T3 0.5 3

T4 2.5 4.5

T5 1.5 4

T6 3.5 5

T3

T2

T6

T5

T1

Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.

Skyline: contains all the tuples not dominated by any other tuple.

Continuous Skyline Query

Problem definition: We have to continuously evaluate a skyline query in multidimensional streaming time series.

Application example: network data Computers with suspicious behavior. Network traffic, number of connections,

number of destinations.

Basic Idea

Skyline changes due The insertion of a new skyline tuple. The expiration of a skyline tuple.

LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a

skyline tuple is not handled efficiently

Event Approach (1)

Existing skyline tuple expires: How can we find new skyline tuples? Very costly operation

Skyline influence time (SIT) Minimum time in which a tuple may

become a skyline tuple. Generate events based on SIT

Event Approach (2)

W=10

K.SIT=19Tuple K can be discardeddue to tuple L (younger and better)

H(8)

A(1)

D(4)

C(3)

I(9)

B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Eager [Tao, TKDE06] Advantage: handles

skyline expiration Disadvantage: pro-

cessing time per tuple

n-of-N Skyline Queries (1)

S6 = {a,c} S4 = {c,g}

source: icde05

n-of-N definition

n-of-N Skyline Queries (2)

S6 = {c,h} S4 = {e,h}

source: icde05

n-of-N definition

Method cnN(1)

Tuple K is redundant because tuple L is better and younger than K

The dominance relation between L and E is critical because E is the youngest tuple which dominates L

Tuple L is dominated by D and E.

W=10H(8)

A(1)

D(4)

C(3)

I(9)

B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Method cnN [Lin, ICDE05] is also based on events

Method cnN (2)

Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G =

(3,7] Use an interval-tree to store them

Dominance graph contains all the critical dominance relations

A(1)B(2)

D(4)

F(6)E(5)

C(3)

G(7)

Redundant tuples

Critical dominance relation

Method cnN (3) A tuple t is in the answer of an n-of-N skyline

query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far.

A(1)B(2)

D(4)

F(6)E(5)

C(3)

G(7)

To answer a n-of-N query, apply a (M–n+1) stabbing

query

C = (0,3]

D -> F = (4,6] D -> E = (4,5]C -> G = (3,7]

D = (0,4]For n = 6,

M–n+1 = 2

S6 = {C, D}

For n = 4,M–n+1 = 4

S4 = {D, G}

stabbing queryM = 7

Method cnN (3)

Advantages Good use of skyline properties Multiple query processing

Disadvantages Processing time per tuple Increased memory requirements

Frequent Skyline - Motivation

Highly dynamic environment The skyline results are meaningful

only if the skyline tuples appear consistently

Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

Streaming Model

Client/Server architecture Server receives object updates from the

clients. Each object can be represented as a

d-dimensional point. Object update (point movement in the

d-dimensional space). at least a value in one dimension changes

Object insertion or deletion Point movement from/to a nonexistent position

Minimization of communication cost

Filter Safe region technique

Skyline remains unchanged if each object stays in a safe region

Communication happens only when the safe region is violated

Safe region approach leads to communication optimization

An object as a point and its filter (safe

region)

source: sigmod09

Sampling

All clients report their skyline at the same sampled time

The clients are synchronized with the same random seed

Guaranteed quality if sampling rate is high enough

Hybrid

Hybrid solution Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling

Disadvantage of all three methods energy consumption is not uniform

(critical in sensor networks)

k-dominant Skyline Query - Μotivation

Skyline: contains tuples not dominated by any other tuple.

Disadvantage: High dimensionality problem.

Solution: Relax the notion of dominance.k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them.

k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]

k-dominant Skyline Query - Εxample

D1 D2 D3 D4 D5 D6

T1 6 5 4 3 2 1

T2 5 4 3 5 4 3

T3 6 6 2 2 6 5

T4 6 6 6 1 6 6

T5 6 6 6 5 5 5

Conventional skyline {T1, T2, T3, T4}5-dominant skyline {T1, T2,

T3}4-dominant skyline {T1,

T2}Smaller k, less tuples in k-dominant skyline

T1 dominates T5

T1 5-dominates T4

T1 4-dominates T3

Observations

Traditional or streaming skyline methods are inappropriate Skyline properties do not hold

E.g. transitive property k-dominance can be cyclic

Existence of multiple users and multiple queries.

Method CoSMuQ (1)

A query on D dimensions arrives. Given a parameter value k, split the query

to subqueries of d=k dimensions. Compute the conventional skyline of each

subquery. The k-dominant skyline is the intersection

of the skylines of the subqueries of a query.

Method CoSMuQ (2)

Advantages Based on conventional skyline (simple

domination checks) Properties of conventional skylines can be used Exploits the overlap between different queries.

Disadvantages Memory requirements increase in high

dimensionality.

Continuous Skyline methods - Summary

Method Query Type

Window Type

Multiple Queries

LookOut skyline time no

Lazy and Eager

skyline both no

n-of-N skyline count yes

Filter and Sampling

frequent skyline

time no

CoSMuQ k-dominant skyline

both yes

Presentation Layout

Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Top-k query - Εxample

distance

price

T4

Hotels

price

distance

T1 4 1

T2 3 2

T3 0.5 3

T4 2.5 4.5

T5 1.5 4

T6 3.5 5

T3

T2T6

T5

T1

Given a preference function, a top-k query returns the k tuples with the best scores.

k=1k=2F=price+distance

Continuous Top-k Query

Problem definition: Continuous evaluation of top-k query in multidimensional streaming time series.

Application Example: network data top-100 flows with the largest individual

throughput Common destination DDoS attack

Basic Idea

Influence region

tk

x2

x1

New tuple changes the top-k Should belong in the influence

region of the query Top-k tuple expiration

From scratch query computation TMA (Top-k Monitoring

Algorithm) [Mouratidis, SIGMOD06] Advantage: simple

implementation Disadvantage: no efficient

handling of an expired top-k tuple

source: sigmod06

Line defined by theF = score(tk) =

x1 + x2

Skyband - Example

k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

E

D

B

C

A

1-skyband (tuples not dominated by other tuples)

1-skyband is the skyline2-skyband (tuples dominated by at most 1 other tuples)

Dominated by 2 other tuples

(3-skyband)

Skyband Approach (1)

Dominance counter (DC): number of tuples that are younger and better

Rule: Keep tuples with DC < k

Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.

Transform tuples in the (score,expiration_time) space

T4

T3

T2

T6

T5

T1

distance

pricescor

e

T1 5

T2 5

T3 3.5

T4 7

T5 5.5

T6 8.5

original space transformed space

T4

T3

T2

T6

T5

T1

exp_time

scoreF=price+distance

DC=0DC=1

DC=1

DC=0

DC=1

DC=0

top-1

Skyband Approach (2)

SMA (Skyband Monitoring Algorithm) proposed in [Mouratidis, SIGMOD06]

Advantage: independent of the dimensionality 2-dimensional space (score-exp_time)

Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause

query computation from scratch

Distributed Top-k

Continuously report the k largest values obtained from distributed data streams.

Objective is to minimize communication cost

Proposed by [Babcock, SIGMOD03]

Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated

with real values V1, V2 , … , Vn

Value updates are represented as <Oi, Nj, > tuples:

Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk

(kj) The value Vi for an object Oi:

Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node

Method (1)

Initialize a top-k set at the coordinator node

Set arithmetic constraints at monitor nodes Depend on current top-k set

Constraints valid No communications Constraints invalidated

Client communicates with server Possibly new top-k set Recomputation of constraints

Method(2) - Adjustment Factors

V1,1 = 1 V2,1 = 9

V1,2 = 3 V2,2 = 1

= 0 = -3

= 0 = 32,21,21,1Node 1

Node 2

Object 1 Object 2

Top-1 = {O1}

Node 1, Local Top-1 = {O1}

Node 2, Local Top-1 = {O2}

Local top-ks differ from global top-k=>Unnecessary constraint violations

=> Increased communication cost

2,1

Object 1 Object 2

Adjustment Factors (AF)

Node 2: V1,2 = 3+0 = 3Node 2: V2,1 = 1+3 = 4

Local top-k similar to global=>Low communication cost

To keep the results valid AF for each object sum to zero

Disadvantage: Energy consumption is not

uniform

Uncertain DataScore Prob

.

6 0.8

5 0.5

2 0.4

8 0.4

Tuples Pr. Tuples

Pr. Tuples

Pr.

2, 5, 6, 8

.064

2, 5, 6 .096

2, 6, 8 .064

2, 6 .096

2, 5, 8 .016

2, 5 .024

2, 8 .016

2 .024

5, 6, 8 .096

5, 6 .144

5, 8 .024

5 .036

6, 8 .096

6 .144

8 .024

Empty .036

tuples 16 possible worlds

Pk-topk query: returns the k most probable tuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5}

Compute probability of 6

Sum the world probabilities

source: pvldb08

Pk-topk Query

Solution proposed by [Jin, PVLDB08] Compact set based

Space-efficient solution Discard unnecessary tuples and Apply several compression schemes to

compress data Disadvantages

Model assumption: the probability of a tuple is assumed random and independent of each other.

Continuous Top-k Methods -Summary

Method Query Type

Window Type

Multiple Queries

TMA and SMA top-k both yes

Distributed top-k Distributed top-k

time no

Compact set based

Pk-topk both no

Presentation Layout


queries Summary

Top-k Dominating Query - Example

distance

price

T4

Hotels

price

distance

T1 4 1

T2 3 2

T3 0.5 3

T4 2.5 4.5

T5 1.5 4

T6 3.5 5

T3

T2T6

T5

T1Skyline: contains all the tuples not dominated by any other tuple.

Disadvantage: High dimensionality problem.

Top-k: Given a preference function, a top-k query returns the k tuples with the best scores.

Disadvantage: user-defined preference function.

Top-k dominating: the answer contains the k tuples with highest domination power.

Combines the advantages of skyline and top-k queries and avoids their disdvantages.

k=1k=2F=price+distance

Continuous Top-k Dominating Query

Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series.

Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed

EVA Objective: reduce domination checks Safe interval of a tuple

Ignore tuple for this interval It depends on its score and the k-th score

End of safe interval -> event Event

Try to compute new safe interval, else Compute score from scratch

New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval

ADA

Advanced computation of safe interval Depends on the number of tuples that

dominate this tuple and expire later Candidate tuples

Tuples with scores close to k-th score are updated in each time instance

EVA and ADA proposed by [Kontaki 2009]

Presentation Layout


queries Summary

Summary

Preference queries are very useful in data streams

Presented state-of-the-art methods For continuous skyline queries For continuous top-k queries For continuous top-k dominating

queries Examined advantages and

disadvantages of the proposed methods

Research Directions Continuous subspace skyline

queries Solutions appropriate for

distributed environments uniform energy consumption

Approximate algorithms Existence of multiple queries

Thank you

Continuous Processing of Preference Queries in Data Streams : a Survey M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of.

Documents