The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon Bakiras Dimitris Papadias Presenter: Kamiru
The university of Hong Kong
Department of Computer Science
Continuous Monitoring of Top-k Queries over Sliding Windows
Authors: Kyriakos Mouratidis, Spiridon Bakiras Dimitris PapadiasPresenter: Kamiru
The university of Hong KongDepartment of Computer Science
Outline
Motivation Problem Setting Related Works
Top-k Queries Skyband
Solutions Top-k Computation Maintenance Module Skyband Monitoring Algorithm
Experimental Evaluation Conclusion Future Works
The university of Hong KongDepartment of Computer Science
Motivation
We define the top-k query first: Given a dataset P and a preference function f, a top-k
query retrieves the k tuples in P with the highest scores according to f.
One real life application is: find the top 5 hotels with the following preference functionf(hotel) = -hotel.price + hotel.quality
The university of Hong KongDepartment of Computer Science
Motivation
Existing methods are not applicable to streaming environment
The internet traffic flow monitoring is one real life application for the streaming case. The data on the internet have very high data rate Each tuple may include
• Source IP address, destination IP address, start time, end time, MTU, TTL…etc.
The university of Hong KongDepartment of Computer Science
Motivation
The availability of such records traffic estimation network security troubleshooting
For instance, top-k query helps the system to prevent the DDoS (Distributed Denial of Service) attack if it monitors the top-k flows with the largest individual throughput in real time
The university of Hong KongDepartment of Computer Science
Motivation
The server 155.223.2.4 has higher chance to have DDoS attack than 155.223.2.3 on this network.
NoPackets destination ip
11 155.223.2.4
22 155.11.5.6
2 155.223.2.1
NoPackets destination ip
32 155.213.2.4
2 155.11.5.6
NoPackets destination ip
12 155.213.2.3
2 155.11.5.2
50 155.223.2.4
155.223.2.4 155.223.2.3
The university of Hong KongDepartment of Computer Science
Problem Setting
A function f is increasingly monotone on dimension xi if for any pair of tuples (points) p1, p2 with
p1.xi≥p2.xi and p1.xj=p2.xj j!=i
we have
score(p1)≥score(p2),
where score(pi)=f(p1.x1,…,pn.xn)
The decreasingly monotone can be defined as the same with the reverse operation (≤).
The university of Hong KongDepartment of Computer Science
Problem Setting
Notice that a function may be increasingly monotone on some dimensions, and decreasingly monotone on the remaining.
For instance,
f(p)=p.x1–p.x2,
f is increasingly monotone on x1 and decreasingly monotone on x2
x1
x2
f has higher valuef has higher value
f has lower valuef has lower value line defined by f=x1-x2
line defined by f=x1-x2
a
b
The university of Hong KongDepartment of Computer Science
Problem Setting
Problem definition:
Given a set of queries Q and a set of points P. The top-k results (Rq) of query qQ are
{Rq | |Rq|=k, f(ri)>f(rj)},
which riRq, rjRq
For each timestamp, update the new arrival objects Pins
remove the objects which are expired Pdel
outputs the top-k results for each query qQ to the remaining P
The university of Hong KongDepartment of Computer Science
Related Works – Top-k query computation
Several existing methods solve the top-k calculation in various scenarios.
They focus on computing the top-k results from multiple data repositories.
Fagin et. al. introduce two efficient methods for processing ranked queries: Threshold algorithm (TA) No Random Access algorithm (NRA)
The university of Hong KongDepartment of Computer Science
TA and NRA
Both methods need to do sorted access in parallel to each of the m sorted lists Si
which m is the number of inputs (attributes), the data in domain i are stored into Si
Descending order is used to scan the data points from all Si
The university of Hong KongDepartment of Computer Science
TA and NRA
As an object o is seen in input Si
TA do random access to the other lists to find the grade xi
of object o in every list Si. Then compute the value of function f.
NRA does not access to other list. Instead of compute the
value of function f, it just updates two bounding attributes.
Both algorithms stop when top-k result is large than threshold T
The university of Hong KongDepartment of Computer Science
Example of TA and NRA
Assume that we have 3 ranked inputs, and 5 records (a~e) in our database, find the top-1 query with the preference function f=SUM by TA and NRA.
The university of Hong KongDepartment of Computer Science
Example of TA and NRA
TA First loop Get object c, compute f(c)=0.9+0.2+0.9=2
• Update result R={(c,2)}
• Threshold value T=0.9+∞+∞=∞>Rk.value, continue
Get object a, compute f(a)=0.1+0.9+0.8=1.8• Do not update the results since Rk.value>1.8
• Threshold value T=0.9+0.9+∞=∞>Rk.value, continue
Get object c, do not compute f• Threshold value T=0.9+0.9+0.9=2.7>Rk.value,
continue
Second loop, … Until T<Rk.value
S1
c 0.9
d 0.8
b 0.6
e 0.3
a 0.1
S2
a 0.9
b 0.8
e 0.6
d 0.4
c 0.2
S3
c 0.9
a 0.8
b 0.6
d 0.6
e 0.5
The university of Hong KongDepartment of Computer Science
Example of TA and NRA
NRA maintains the objects whose upper rub and lower rlb bound of their aggregate score
For initial setting, if the range of value is [0,1] rlb = {0,0,0,0,0}, rub = {∞,∞,∞,∞,∞}
The university of Hong KongDepartment of Computer Science
Example of TA and NRA
NRA Get object c (0.9), a (0.9), and c (0.9) from S1, S2, and S3
• rlb = {0.9,0,1.8,0,0}– Update newly accessed objects
– Update ralb=0.9+ra
lb=0.9
• rub = {2.7,0,2.7,0,0}– Update objects which have been seen so far
– e.g. update raub = 0.9+0.9+0.9 = 2.7
• R = {(c,1.8)}• t = min{rx
lb:xR} = 1.8• u = max{rx
ub:xR} = 2.7• if t<u then repeat, otherwise, leave
Get object d (0.8), b (0.8), and a (0.8) from x1, x2, and x3
• …
S1
c 0.9
d 0.8
b 0.6
e 0.3
a 0.1
S2
a 0.9
b 0.8
e 0.6
d 0.4
c 0.2
S3
c 0.9
a 0.8
b 0.6
d 0.6
e 0.5
The university of Hong KongDepartment of Computer Science
LARA
Mamoulis proposed the LARA (Lattice-based Rank Aggregation) algorithm which is an optimized NRA method
LARA separates the algorithm into two phases Growing phase
• If t=min{rxlb:xR}<T, it is impossible to attempt any pruning.
• T is the sum of possible values from all inputs. In the above example, T=2.7 after the first loop.
Shrinking phase• If an object o is not seen in growing phase, then o is not a result of
the query• rub value only store to the lattice nodes instead of storing to object
itself
• Avoid a lot of updates to objects which have seen so far
S1S2S3
S1S2 S1S3 S2S3
S3S2S1
The university of Hong KongDepartment of Computer Science
Conclusion of Top-k query computation
The performance NRA should be better than TA in conventional database, since it avoids a lot of random accesses.
The performance of LARA is much better than NRA which is shown on their experiments.
The university of Hong KongDepartment of Computer Science
Related Works – Skyband
The skyline is the points which are not dominated by any point A record pi is said to dominate another pj, if and only if, pi is
preferable to pj on every attribute The skyline of a dataset contains all tuples that belong to the
result of any top-1 query with a monotone function. The k-skyband contains the tuples that are dominated by at
most k-1 other points
p1
p2
p3
p4
p7
p6
p5 skyline
2-skyband
The university of Hong KongDepartment of Computer Science
Related Works – Skyband
The skyband is used to monitor the top-k results in score-time space.
Assume that we want to monitor the top-2 results in the following example:
score
expiration time
p1
p2
p3
p4
p5
score
expiration time
p1
p2
p3
p4
p5
{p1,p2}
{p1,p4}
{-}
{p1,p3} {p4}
The university of Hong KongDepartment of Computer Science
Top-k computation
Grid-based indexing method is usedFor each cell c in grid G, maxscore(c) is the
maximum possible value in cell cFor each query q
Start from:• The algorithm starts from the c which has highest maxscore(c)
Terminate condition:• The search terminates when the cell c under
consideration has maxscore(c) Rk.value
The university of Hong KongDepartment of Computer Science
Top-k computation
An example is given to explain how the top-k computation works.
Assume that we have two inputs (x1 and x2) and a function f=x1+2x2
The highest maxscore(c) is c4,4 maxscore(c)=f(P) Scan c4,4
Next scanning cell is c3,4
maxscore(p’)>maxscore(p’’) …
Until maxscore(c)Rk.value
c4,4
c1,1
c3,4
PP’
P’’
P’’’
P’’’’p1
p2
p3
The university of Hong KongDepartment of Computer Science
The maintenance module
Given two datasets: Pins and Pdel
For all pPins
Insert p into the corresponding cell c For all q who visited c,
• Insert into q.R if f(p)q.Rk.value
For all pPdel
Delete p from the corresponding cell c For all q who visited c,
• If pq.R, mark q as affected
The university of Hong KongDepartment of Computer Science
The maintenance module
For each affected query q, Invoke Top-k Computation(q) For all c which are not scanned by Top-k Computation(q)
• Delete q from c.visitedquery
The university of Hong KongDepartment of Computer Science
Example of maintenance module
q:f=x1+2x2, find top-1 result
Timestamp1
Pins={p3,p4}, Pdel={p1,p2}
Timestamp2
Pins={p5}, Pdel={p3}
p1
p2
p3
p4
p5
The university of Hong KongDepartment of Computer Science
Summary of the maintenance module
Insertion does not invoke any top-k re-computationDeletion has more higher cost than insertion
Affected query need to do• Top-k computation
• Update the cells which are not scanned by top-k computation, the worst case is |cell|
The university of Hong KongDepartment of Computer Science
Skyband Monitoring Algorithm
I demonstrate how to use the k-skyband to monitor the results in score-time space in previous slide
The dominance counter (DC) can be used to get the k-skyband DC is the number of records with higher score that
expire after p score
expiration time
p1
p2
p3
p4
p5
01
10
4
p6
Monitoring a top-2 queryMonitoring a top-2 query
22
15
0
The university of Hong KongDepartment of Computer Science
Skyband Monitoring Algorithm
The computation of dominance count can be calculated by a balance tree (BT)
The expiration time of every processed element of q.skyband is stored into a balanced tree BT sorted in descending order The order of insertion is in descending score order
p.DC is simply the number of tulples that precede p in BTscore
expiration time
p1
p2
p3
p4
p5
p1
p2
Balance treeBalance tree
p3
p1 p2
01
10
4p4
p5
The university of Hong KongDepartment of Computer Science
Skyband Monitoring Algorithm
Given two datasets: Pins and Pdel
For all pPins
Insert p into the corresponding cell c For all q who visited c,
• If f(p)q.Rk.value– Insert p into q.skyband and p.DC=0– For each p’ in q.skyband with f(p’)f(p)
» Update p’.DC=p’.DC+1» If p’.DC=k evict p’ from q.skyband
The university of Hong KongDepartment of Computer Science
Skyband Monitoring Algorithm
For all pPdel
Delete p from the corresponding cell c For all q who visited c,
• If pq.R, delete p from q.skyband
For all q whose skyband has changed If q.skyband has at least k points
•q.R=top-k(q.skyband) Else
• Invoke Top-k Computation(q)• Compute dominance counters
The university of Hong KongDepartment of Computer Science
Experimental Evaluation
They evaluate the proposed methods using streams of both independent (IND) and anti-correlated (ANT) datasets
IND (d=2)IND (d=2) ANT (d=2)ANT (d=2)
The university of Hong KongDepartment of Computer Science
Experimental Evaluation
Default experimental setting Data dimensionality (d): 4 Data cardinality (N): 1M Arrival rate (r): 10K Query cardinality (Q): 1K Result cardinality (k): 20
The university of Hong KongDepartment of Computer Science
Experimental Evaluation
The university of Hong KongDepartment of Computer Science
Experimental Evaluation
The university of Hong KongDepartment of Computer Science
Experimental Evaluation
The university of Hong KongDepartment of Computer Science
Conclusions
The top-k computation module processes the minimum number of cells
Proposed two monitoring algorithms TMA and SMA
TMA re-computes the result from scratchSMA maintains a superset of the current answer in
the form of k-skybandIn the experimental evaluation, SMA shows that it
overcomes other proposed solutions
The university of Hong KongDepartment of Computer Science
Future works
Non-monotone preference functionQueries support various dimensionality
Cluster the queries to make a super query SQ, and monitor the results for these superset of queries
The university of Hong Kong
Department of Computer Science
Thank you for your attention!
PS. Hope I can show this page on the time!
The university of Hong KongDepartment of Computer Science
References