1 Best-Effort Top-k Query Processing Under Budgetary Constraints Michal Shmueli-Scheuer (IBM Haifa Research Lab and UCI) Yosi Mass, Haggai Roitman Chen.

Post on 14-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

1

Best-Effort Top-k Query Processing Under Budgetary Constraints

Michal Shmueli-Scheuer

(IBM Haifa Research Lab and UCI)

Yosi Mass, Haggai Roitman Chen Li Ralf Schenkel, Gerhard Weikum

2

Motivating Example

Engine

Top-kresultsqueries

Michal Shmueli-Scheuer

Top-k

Mobile Applications

Highly impatient

users, need fast

results.

Mediation Systems

Achieve high query throughput.

Online Analytics (e.g. logs)

Achieve high query throughput.

3

• Pre-computed lists over multiple attributes.

• Combine scores by some monotonic aggregation function.

• Two accesses modes:– sorted access (Cs)– random access (Cr)

• Objective: Compute k objects with highest scores.

Traditional top-k query

Rm

c0.9

b0.6

g0.5

…..

a0.4

R1

a0.9

b0.6

c0.5

…..

d0.4

n

m

sort

ed

R2

d0.87

a0.85

f0.5

…..

c0.2

Michal Shmueli-Scheuer

4

NRA algorithm (Fagin et al.)

a[0.9,1.77]

d[0.87,1.77]

Top-2R1

a0.9

b0.6

c0.5

…..

d0.4

R2

d0.87

a0.85

f0.5

.…..

c0.2

Worst score

Best score

highi

mink

candidates

mink > best-score of candidates

f = SUM

Michal Shmueli-Scheuer

5

NRA algorithm (Fagin et al.)

a[1.75,1.75]

d[0.87,1.47]

Top-2R1

a0.9

b0.6

c0.5

…..

d0.4

R2

d0.87

a0.85

f0.25

.…..

c0.2

Worst score

Best score

highi

mink

b[0.6,1.45]

candidates

mink > best-score of candidates

Michal Shmueli-Scheuer

6

NRA algorithm (Fagin et al.)

a[1.75,1.75]

d[0.87,1.37]

Top-2R1

a0.9

b0.6

c0.5

…..

d0.4

R2

d0.87

a0.85

f0.25

.…..

c0.2

Worst score

Best score

highi mink

b[0.6,0.85]

c[0.5,0.75]

f[0.25,0.75]

candidates

mink > best-score of candidates

Michal Shmueli-Scheuer

7

Top-k with Budget Constraints

R1

s0.95

u0.93

t0.92

d0.9

x0.5

y0.4

z0.2

R2

a1.0

b0.9

c0.85

d0.8

e0.7

t0.6

f0.4

..

d1.7

t1.52

Top-2NRA: 12Cs = 12

precision =0.5

Cs=1, Cr =3

f = SUM

Access Costs

Sorted access cost- Cs

Random access cost- Cr

Budget =10 ?

TA: 7Cs +7Cr = 28

precision =0Given budget B ,maximize result quality

Michal Shmueli-Scheuer

8

Contributions

• Sorted Accesses– Efficient Plan– Solution with Adaptive

• Sorted and Random Accesses– Efficient Plan– Solution with Adaptive

• Experiments

Michal Shmueli-Scheuer

9

Results Under Limited Budget

Michal Shmueli-Scheuer

K results for unlimited Results for limited budget

budget

10

Efficient Plan- Sorted Accesses

• Assume that we know the k results for unlimited budget (REXACT).

• Plan – {L1,4} {L2,2}

o5

o1

Top-2

P1

P2

Q1

Q2

• Interesting positions- where the k objects appear in the lists.

L1 L2

o1, SL1

o1, SL2

o5, SL1

o2, SL2

o5, SL2

o4, SL2

o8, SL1

o6, SL1

o3, SL2

Michal Shmueli-Scheuer

11

Efficient Plan- Sorted Accesses

• Goal: find plan t, such that :

|||R|maxarg e

||t

xacttBtTt RR

P1

P2

Q1

Q2

L1 L2

o1, SL1

o1, SL2

o5, SL1

o2, SL2

o5, SL2

o4, SL2

o8, SL1

o6, SL1

o3, SL2

Denoted as ROPT

Plans for B=5

Plan: {L1,2} {L2,3}

Michal Shmueli-Scheuer

12

Sorted Accesses

• Observations:

Prefer high scores

L1 L2 L3

O2, SL1 O2, SL2 O2, SL3

O1, SL1 O1, SL2

Michal Shmueli-Scheuer

13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

sco

res

<title>

<description>

Observations – contd.

Prefer large score reductions

title=“war” description=“weapon”

Michal Shmueli-Scheuer

14

Score Utilities

Score gain: Score reduction:

o1, 0.6

o2, 1

o5, 0.8

o4, 0.9

o3, 0.7

y =39.03

8.09.01

2.08.01

Michal Shmueli-Scheuer

15

Optimization Problem

bbts

xLutil

m

ii

m

i

i

1

1

,

.

))(( maximizes

Where m is the number of lists

• Bi-objective optimization problem:

util(Li,x) = * gain +(1-)* reduction

Heuristics:

• Fair Heuristic

• Rank Heuristic

Michal Shmueli-Scheuer

16

Adaptive

gain reduction)) (1-(

time

Michal Shmueli-Scheuer

17

Adaptive

candidates

top-k

o4 [0.6,bs]

o1 [ws,bs]

o2 [ws,bs]

o3 [0.8,bs]

L1 L2 L3

O1, SL1

O1, SL2

O1, SL3

)(

] | )([)(cEi

iiikhighScSPcp

o6 [ws,bs]

hight1

hight2

Theobald et al. VLDB04

(o4) = 0.8-0.6=0.2

Michal Shmueli-Scheuer

18

Adaptive

setcandckkcp

setcandp

.

)(|.|

kp̂

TREC query, k=100

Michal Shmueli-Scheuer

19

Efficient Plan- Random Accesses

• Observations:– random accesses occur always after sorted

accesses have been finished.

schedule 1: {SA……RA……SA….}

schedule 2: {SA……SA……RA….}

precision(schedule1) = precision(schedule2)

Michal Shmueli-Scheuer

20

Observations- contd.

• Random accesses are only useful to objects in REXACT.

L2

o1, SL2

o2, SL2

o5, SL2

o5, Not in

REXACT

top-k

o1 [ws,bs]

o5 [ws,bs]

o2 [ws,bs]

candidates

o4 [ws,bs]o5 [ws,bs]Precision

remains the same

Precision reduced

o1 [ws,bs]

o2 [ws,bs]

o3 [ws,bs]

Michal Shmueli-Scheuer

21

Random Accesses

Gathering with Sorted

Probing with Random

• When to switch from SA to RA?

(1-(

)(

Not enough RAs to prune the candidates

Not enough good candidates, RA is wasted

time

Michal Shmueli-Scheuer

22

Random Accesses

• Switch from Sorted to Random:

R= (1- )*SS – total cost of sorted accesses.

R – total cost for random accesses.

• Which items to access ?– maximize expected score.

S+R > B

Michal Shmueli-Scheuer

23

Experimental Data• TREC Terabyte

– 25M webpages– 50 queries with average length of 3 words.

• IMDB – 375,000 movies– 20 queries , each with 4 attributes: {Title, Genre, Actors, Description}

• Synthetic data

– Zipf, #lists =[2,6], #objects =[10000,1000000]

• Aggregate Function : Sum

Michal Shmueli-Scheuer

24

Evaluation Methods

• percentage of optimal precision

opt

a

precision

precision lg

Michal Shmueli-Scheuer

• SME

RalgRopt RoptRexact

25

50%

60%

70%

80%

90%

500 1000 2000 3000 4000 5000

Budget (#SA)

per

cen

tag

e o

f O

pti

mal

Pre

cisi

on

NRA

KBA

Fair

Ranking

Results- Sorted Accesses

TREC, k=100

• Less budget, more improvement

Michal Shmueli-Scheuer

26

20%

30%

40%

50%

60%

70%

80%

90%

20 50 100

k

per

cen

tag

e o

f O

pti

mal

Pre

cisi

on

NRA

KBA

Fair

Ranking

Varied k

IMDB, B=400

• Lower K, more improvement.

Michal Shmueli-Scheuer

27

40%

60%

80%

100%

2 3 4 5 6

Number of Lists

per

cen

tag

e o

f O

pti

mal

Pre

cisi

on NRA

KBA

Fair

Ranking

Number of Lists

Zipf, K=100, B=4000

• More lists, more improvement.

Michal Shmueli-Scheuer

28

Results- Random Accesses

TREC, k=100,Cr=10

TREC, K=100, Cr=100

29

Related Works• Minimize budget for optimal results:

– the algorithm computes the exact results with minimum cost. (Bast et al. VLDB06, Bruno et al. ICDE02, Chang et al. SIGMOD02)

– Dual problem.• Anytime top-k :

– The algorithm collects statistics during processing, which can be used to provide probabilistic guarantees at any time during processing. (Aray et al. VLDB07)

– Do not do any optimizations.• Approximate top-k:

– approximate results with probabilistic guarantees. (Theobald et al. VLDB04, Fagin et al. 2001)

Michal Shmueli-Scheuer

30

Conclusions

• First attempt to deal with budget constraints.

• For SA only, average precision around 70%.

• Tradeoff between RAs and SAs, for relatively low cost of RA, RA schedules are improved.

Michal Shmueli-Scheuer

31

Thank You !

32

33

• Given a set of n objects and m scoring lists sorted in decreasing order, find the top-k objects according to a scoring function f

• top-k: a set T of k objects such that f(rj1,…,rjm) ≤ f(ri1,…,rim) for every object Xi in T and every object Xj not in T

• Assumption: The scoring function f is monotone– f(r1,…,rm) ≤ f(r1’,…,rm’) if ri ≤ ri’ for all I– Two accesses modes:

• sorted access – Cs• random access - Cr

• Objective: Compute top-k with the minimum cost

Top-k query

34

Sorted Accesses

• Observations:– object with high

scores has higher potential to be part of the top-k.

– object with “mediocre” scores does not help.

Prefer high scores

L1 L2 L3

O1, SL1 O1, SL2 O1, SL3

35

Example

uselessQ

Wireless zone

36

Applications

• Mobile Applications– Highly impatient users, need fast results.

• Mediation Systems– Achieve high query throughput.

• Online analytics (e.g. logs)– Achieve high query throughput.

Michal Shmueli-Scheuer

37

Motivating Example

Query throughput

Mediator

Servers

User query

Engine

Given #queries per

time unit

Allo

cate

tim

e fo

r

each

que

ry

38

Terminology

1. Sorted Access2. Random Access3. highi

4. Top-k queue5. Candidates queue6. mink7. worstScore(d)8. bestScore(d)

39

Efficient Offline Solution- Sorted

• Goal: find trace t, such that :

|||R| e

t

xactt

RR

|||R|maxarg e

||t

xacttBtTt RR

P1

P2

P1

P2

L1 L2

o1, SL1

o1, SL2

o5, SL1

o2, SL2

o5, SL2

o4, SL2

o8, SL1

o6, SL1

o3, SL2Denoted as ROPT

t105

t214

t323

t432

t541

t650

L1 L2

B=5

40

Efficient Offline Solution- Sorted

• Goal: find trace t, such that :

|||R|maxarg e

||t

xacttBtTt RR

P1

P2

P1

P2

L1 L2

o1, SL1

o1, SL2

o5, SL1

o2, SL2

o5, SL2

o4, SL2

o8, SL1

o6, SL1

o3, SL2

• Feasible for K up to 100, and m up to 10.

B =5

t105

t214

t323

t432

t541

t650

L1 L2

41

Efficient Offline Solution- Sorted

• Proof: (in negation)– Assume that t does not exists, and chose trace s that within the budget and has optimal

precision. Assume s` with traces s`i that are largest position of Pi less or equal to si.

– By construction the score of any object in S is the same to S`

42

Fair Heuristic

• Assume budget =b

m

jj

iLi

xLutil

xLutilbSA

1

),(

),(

),(*)1(),(*),( xLutilxLutilxLutil isriasi

Runs in batches

43

Efficient Offline Solution- Random

• Budget for RAs =(B-|t|*Cs)

Top-k

o1, S

o4, S

o2, S

o3, S

d Rexact

o9, S

o5, S

o7, S

o8, S

….

….

best(o)-mink

(best(o) = wosrt(o)+RA)

o10, S

o14, S

….

44

Motivation

• Many applications work in budgeted constraint environments. Still, they wish to perform top-k queries.

Mediator

Servers

User query

Engine

Budget-awareQuery processing

45

Future work

• Different access costs for different lists

• Time-aware top-k

• Top-k with budget constraints for P2P

top related