1 Best-Effort Top-k Query Processing Under Budgetary Constraints Michal Shmueli-Scheuer (IBM Haifa Research Lab and UCI) Yosi Mass, Haggai Roitman Chen.
Post on 14-Dec-2015
214 Views
Preview:
Transcript
1
Best-Effort Top-k Query Processing Under Budgetary Constraints
Michal Shmueli-Scheuer
(IBM Haifa Research Lab and UCI)
Yosi Mass, Haggai Roitman Chen Li Ralf Schenkel, Gerhard Weikum
2
Motivating Example
Engine
Top-kresultsqueries
Michal Shmueli-Scheuer
Top-k
Mobile Applications
Highly impatient
users, need fast
results.
Mediation Systems
Achieve high query throughput.
Online Analytics (e.g. logs)
Achieve high query throughput.
3
• Pre-computed lists over multiple attributes.
• Combine scores by some monotonic aggregation function.
• Two accesses modes:– sorted access (Cs)– random access (Cr)
• Objective: Compute k objects with highest scores.
Traditional top-k query
Rm
c0.9
b0.6
g0.5
…..
a0.4
R1
a0.9
b0.6
c0.5
…..
d0.4
n
m
sort
ed
R2
d0.87
a0.85
f0.5
…..
c0.2
Michal Shmueli-Scheuer
4
NRA algorithm (Fagin et al.)
a[0.9,1.77]
d[0.87,1.77]
Top-2R1
a0.9
b0.6
c0.5
…..
d0.4
R2
d0.87
a0.85
f0.5
.…..
c0.2
Worst score
Best score
highi
mink
candidates
mink > best-score of candidates
f = SUM
Michal Shmueli-Scheuer
5
NRA algorithm (Fagin et al.)
a[1.75,1.75]
d[0.87,1.47]
Top-2R1
a0.9
b0.6
c0.5
…..
d0.4
R2
d0.87
a0.85
f0.25
.…..
c0.2
Worst score
Best score
highi
mink
b[0.6,1.45]
candidates
mink > best-score of candidates
Michal Shmueli-Scheuer
6
NRA algorithm (Fagin et al.)
a[1.75,1.75]
d[0.87,1.37]
Top-2R1
a0.9
b0.6
c0.5
…..
d0.4
R2
d0.87
a0.85
f0.25
.…..
c0.2
Worst score
Best score
highi mink
b[0.6,0.85]
c[0.5,0.75]
f[0.25,0.75]
candidates
mink > best-score of candidates
Michal Shmueli-Scheuer
7
Top-k with Budget Constraints
R1
s0.95
u0.93
t0.92
d0.9
x0.5
y0.4
z0.2
…
R2
a1.0
b0.9
c0.85
d0.8
e0.7
t0.6
f0.4
..
d1.7
t1.52
Top-2NRA: 12Cs = 12
precision =0.5
Cs=1, Cr =3
f = SUM
Access Costs
Sorted access cost- Cs
Random access cost- Cr
Budget =10 ?
TA: 7Cs +7Cr = 28
precision =0Given budget B ,maximize result quality
Michal Shmueli-Scheuer
8
Contributions
• Sorted Accesses– Efficient Plan– Solution with Adaptive
• Sorted and Random Accesses– Efficient Plan– Solution with Adaptive
• Experiments
Michal Shmueli-Scheuer
9
Results Under Limited Budget
Michal Shmueli-Scheuer
K results for unlimited Results for limited budget
budget
10
Efficient Plan- Sorted Accesses
• Assume that we know the k results for unlimited budget (REXACT).
• Plan – {L1,4} {L2,2}
o5
o1
Top-2
P1
P2
Q1
Q2
• Interesting positions- where the k objects appear in the lists.
L1 L2
o1, SL1
o1, SL2
o5, SL1
o2, SL2
o5, SL2
o4, SL2
o8, SL1
o6, SL1
o3, SL2
Michal Shmueli-Scheuer
11
Efficient Plan- Sorted Accesses
• Goal: find plan t, such that :
|||R|maxarg e
||t
xacttBtTt RR
P1
P2
Q1
Q2
L1 L2
o1, SL1
o1, SL2
o5, SL1
o2, SL2
o5, SL2
o4, SL2
o8, SL1
o6, SL1
o3, SL2
Denoted as ROPT
Plans for B=5
Plan: {L1,2} {L2,3}
Michal Shmueli-Scheuer
12
Sorted Accesses
• Observations:
Prefer high scores
L1 L2 L3
O2, SL1 O2, SL2 O2, SL3
O1, SL1 O1, SL2
Michal Shmueli-Scheuer
13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
sco
res
<title>
<description>
Observations – contd.
Prefer large score reductions
title=“war” description=“weapon”
Michal Shmueli-Scheuer
14
Score Utilities
Score gain: Score reduction:
o1, 0.6
o2, 1
o5, 0.8
o4, 0.9
o3, 0.7
y =39.03
8.09.01
2.08.01
Michal Shmueli-Scheuer
15
Optimization Problem
bbts
xLutil
m
ii
m
i
i
1
1
,
.
))(( maximizes
Where m is the number of lists
• Bi-objective optimization problem:
util(Li,x) = * gain +(1-)* reduction
Heuristics:
• Fair Heuristic
• Rank Heuristic
Michal Shmueli-Scheuer
17
Adaptive
candidates
top-k
o4 [0.6,bs]
o1 [ws,bs]
o2 [ws,bs]
o3 [0.8,bs]
L1 L2 L3
O1, SL1
O1, SL2
O1, SL3
)(
] | )([)(cEi
iiikhighScSPcp
o6 [ws,bs]
hight1
hight2
Theobald et al. VLDB04
(o4) = 0.8-0.6=0.2
Michal Shmueli-Scheuer
19
Efficient Plan- Random Accesses
• Observations:– random accesses occur always after sorted
accesses have been finished.
schedule 1: {SA……RA……SA….}
schedule 2: {SA……SA……RA….}
precision(schedule1) = precision(schedule2)
Michal Shmueli-Scheuer
20
Observations- contd.
• Random accesses are only useful to objects in REXACT.
L2
o1, SL2
o2, SL2
o5, SL2
o5, Not in
REXACT
top-k
o1 [ws,bs]
o5 [ws,bs]
o2 [ws,bs]
candidates
o4 [ws,bs]o5 [ws,bs]Precision
remains the same
Precision reduced
o1 [ws,bs]
o2 [ws,bs]
o3 [ws,bs]
Michal Shmueli-Scheuer
21
Random Accesses
Gathering with Sorted
Probing with Random
• When to switch from SA to RA?
(1-(
)(
Not enough RAs to prune the candidates
Not enough good candidates, RA is wasted
time
Michal Shmueli-Scheuer
22
Random Accesses
• Switch from Sorted to Random:
R= (1- )*SS – total cost of sorted accesses.
R – total cost for random accesses.
• Which items to access ?– maximize expected score.
S+R > B
Michal Shmueli-Scheuer
23
Experimental Data• TREC Terabyte
– 25M webpages– 50 queries with average length of 3 words.
• IMDB – 375,000 movies– 20 queries , each with 4 attributes: {Title, Genre, Actors, Description}
• Synthetic data
– Zipf, #lists =[2,6], #objects =[10000,1000000]
• Aggregate Function : Sum
Michal Shmueli-Scheuer
24
Evaluation Methods
• percentage of optimal precision
opt
a
precision
precision lg
Michal Shmueli-Scheuer
• SME
RalgRopt RoptRexact
25
50%
60%
70%
80%
90%
500 1000 2000 3000 4000 5000
Budget (#SA)
per
cen
tag
e o
f O
pti
mal
Pre
cisi
on
NRA
KBA
Fair
Ranking
Results- Sorted Accesses
TREC, k=100
• Less budget, more improvement
Michal Shmueli-Scheuer
26
20%
30%
40%
50%
60%
70%
80%
90%
20 50 100
k
per
cen
tag
e o
f O
pti
mal
Pre
cisi
on
NRA
KBA
Fair
Ranking
Varied k
IMDB, B=400
• Lower K, more improvement.
Michal Shmueli-Scheuer
27
40%
60%
80%
100%
2 3 4 5 6
Number of Lists
per
cen
tag
e o
f O
pti
mal
Pre
cisi
on NRA
KBA
Fair
Ranking
Number of Lists
Zipf, K=100, B=4000
• More lists, more improvement.
Michal Shmueli-Scheuer
29
Related Works• Minimize budget for optimal results:
– the algorithm computes the exact results with minimum cost. (Bast et al. VLDB06, Bruno et al. ICDE02, Chang et al. SIGMOD02)
– Dual problem.• Anytime top-k :
– The algorithm collects statistics during processing, which can be used to provide probabilistic guarantees at any time during processing. (Aray et al. VLDB07)
– Do not do any optimizations.• Approximate top-k:
– approximate results with probabilistic guarantees. (Theobald et al. VLDB04, Fagin et al. 2001)
Michal Shmueli-Scheuer
30
Conclusions
• First attempt to deal with budget constraints.
• For SA only, average precision around 70%.
• Tradeoff between RAs and SAs, for relatively low cost of RA, RA schedules are improved.
Michal Shmueli-Scheuer
33
• Given a set of n objects and m scoring lists sorted in decreasing order, find the top-k objects according to a scoring function f
• top-k: a set T of k objects such that f(rj1,…,rjm) ≤ f(ri1,…,rim) for every object Xi in T and every object Xj not in T
• Assumption: The scoring function f is monotone– f(r1,…,rm) ≤ f(r1’,…,rm’) if ri ≤ ri’ for all I– Two accesses modes:
• sorted access – Cs• random access - Cr
• Objective: Compute top-k with the minimum cost
Top-k query
34
Sorted Accesses
• Observations:– object with high
scores has higher potential to be part of the top-k.
– object with “mediocre” scores does not help.
Prefer high scores
L1 L2 L3
O1, SL1 O1, SL2 O1, SL3
36
Applications
• Mobile Applications– Highly impatient users, need fast results.
• Mediation Systems– Achieve high query throughput.
• Online analytics (e.g. logs)– Achieve high query throughput.
Michal Shmueli-Scheuer
37
Motivating Example
Query throughput
Mediator
Servers
User query
Engine
Given #queries per
time unit
Allo
cate
tim
e fo
r
each
que
ry
38
Terminology
1. Sorted Access2. Random Access3. highi
4. Top-k queue5. Candidates queue6. mink7. worstScore(d)8. bestScore(d)
39
Efficient Offline Solution- Sorted
• Goal: find trace t, such that :
|||R| e
t
xactt
RR
|||R|maxarg e
||t
xacttBtTt RR
P1
P2
P1
P2
L1 L2
o1, SL1
o1, SL2
o5, SL1
o2, SL2
o5, SL2
o4, SL2
o8, SL1
o6, SL1
o3, SL2Denoted as ROPT
t105
t214
t323
t432
t541
t650
L1 L2
B=5
40
Efficient Offline Solution- Sorted
• Goal: find trace t, such that :
|||R|maxarg e
||t
xacttBtTt RR
P1
P2
P1
P2
L1 L2
o1, SL1
o1, SL2
o5, SL1
o2, SL2
o5, SL2
o4, SL2
o8, SL1
o6, SL1
o3, SL2
• Feasible for K up to 100, and m up to 10.
B =5
t105
t214
t323
t432
t541
t650
L1 L2
41
Efficient Offline Solution- Sorted
• Proof: (in negation)– Assume that t does not exists, and chose trace s that within the budget and has optimal
precision. Assume s` with traces s`i that are largest position of Pi less or equal to si.
– By construction the score of any object in S is the same to S`
42
Fair Heuristic
• Assume budget =b
m
jj
iLi
xLutil
xLutilbSA
1
),(
),(
),(*)1(),(*),( xLutilxLutilxLutil isriasi
Runs in batches
43
Efficient Offline Solution- Random
• Budget for RAs =(B-|t|*Cs)
Top-k
o1, S
o4, S
o2, S
o3, S
d Rexact
o9, S
o5, S
o7, S
o8, S
….
….
best(o)-mink
(best(o) = wosrt(o)+RA)
o10, S
o14, S
….
44
Motivation
• Many applications work in budgeted constraint environments. Still, they wish to perform top-k queries.
Mediator
Servers
User query
Engine
Budget-awareQuery processing
top related