Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science.

Tuning the top-k view update process

Eftychia BaikousiPanos Vassiliadis

University of Ioannina

Dept. of Computer Science

M-Pref 2007, Vienna 23/9/2007 2

Forecast

Problem of maintaining materialized top-k views, when updates occur in the base relation Extra difficulty: address the problem in the presence of high

deletion rates The crux of the approach is to materialize an

appropriate number of extra tuples kcomp to sustain the deletion rates that are drastically higher than average The correct estimation & fine tuning of kcomp is not

obvious We use appropriate probabilistic methods

M-Pref 2007, Vienna 23/9/2007 3

Contents

Motivation & Problem Definition

Overview of our Method Computation of rates affecting the view

Computation of kcomp

Fine tuning kcomp

Experiments

Conclusions

M-Pref 2007, Vienna 23/9/2007 4

Contents




Fine tuning kcomp

Experiments

Conclusions

M-Pref 2007, Vienna 23/9/2007 5

Top-k query

Given a relation R (id, x1, x2, x3) and a query Q, sum(x1, x2, x3)

Find k tuples with highest grades according to Q

id x1 x2 x3

a 0.3 0.6 0.7

b 0.2 0.3 0.4

c 0.4 0.5 0.9

d 0.7 0.6 0.1

R

Top-2 tuples

sum

1.6

0.9

1.8

1.4

M-Pref 2007, Vienna 23/9/2007 6

Motivating Example Shopping Center

Customers sign in with a palmtop (PDA) Need for advertisements – Special offers to Customers

Given relation Customers (id, name, age, salary, …) materialized view V of the top-2 (Younger and Highly paid Customers)

according to the query Q: - age + 2*salary

Maintain the view V Customers sign in and out (e.g., train departures, working hours)

id name age salary

1 John 18 20

2 Mary 42 25

3 Bill 26 35

4 Peter 57 37

Q

22

8

44

17

name Q

Bill 44

John 22

Customers V

M-Pref 2007, Vienna 23/9/2007 7

Problem definition Given

a base relation R (ID, X, Y) that originally contains N tuples, a materialized view V

that contains top-k tuples of the form (id, val) where val is the score according to a function Q(x,y)=ax + by and a, b are constant parameters,

the update ratios ins, del and upd for insertions, deletions and updates respectively over the base relation R,

Compute kcomp

that is of the form kcomp = k + Δk

Such that the view will contain at least k tuples,

k ≤ kcomp, with probability p, after a period T

id Q

k

Δkkcomp

V

M-Pref 2007, Vienna 23/9/2007 8

Related WorkKe Yi, Hai Yu, Jun Yang, Gangqiang Xia, Yuguo Chen:

“Efficient Maintenance of Materialized Top-k Views”, ICDE ’03

Maintain a materialized top-k view when updates occur in the base table Compute a kmax (instead of the necessary k)

adjusted at runtime so a refill query is rarely needed formulates the problem through a random walk model

The method is theoretically guaranteed to work well only when the probabilities

of insertions and deletions are equal, pins=pdel

of insertions are more frequent than deletions pins>pdel

There is no quality-of-service guarantee when deletions are more probable than insertions, pins<pdel

M-Pref 2007, Vienna 23/9/2007 9

Motivating Example

id name age salary

1 John 18 20

2 Mary 42 25

3 Bill 26 35

4 Peter 57 37

Q

22

8

44

17

Customers sign in and out Due to train departures, working hours

At certain time periods, deletions are more probable than insertions pins<pdel

The view will not contain at least k tuples

name Q

Bill 44

John 22

CustomersV

M-Pref 2007, Vienna 23/9/2007 10

Contents




Fine tuning kcomp

Experiments

Conclusions

M-Pref 2007, Vienna 23/9/2007 11

Overview of the method

1. Compute the ratios of the incoming source updates that affect the view

2. Compute kcomp

3. Fine tune kcomp

M-Pref 2007, Vienna 23/9/2007 12

Empirical Cumulative Distribution Function ECDF ECDF is a non parametric cumulative distribution function

that adapts itself to the data

Definition

Fn(x)

represents the proportion of observations in a sample less than or equal to x

assigns the probability 1/n to each of n observations in the sample

estimates the true population proportion F(x)

nx_values_of_number)x(nF

M-Pref 2007, Vienna 23/9/2007 13

Computation of update rates that affect V Given

a relation Customers (id, name, age, salary, …) having N=4 tuples a materialized view V containing top-2 tuples (k=2) of the form

(id, Q) where Q= -age +2*salary is the score Update ratios ins=1, del=2, upd=0

Find ins_aff and del_aff (insertions & deletions affecting the view)

id name age salary

1 John 18 20

2 Mary 42 25

3 Bill 26 35

4 Peter 57 37

Q

22

8

44

17

name Q

Bill 44

John 22

Customers V

M-Pref 2007, Vienna 23/9/2007 14

Computation of update rates that affect V Given

N=4, ins=1, del=2, upd=0 We compute the following:

updates are treated as a combination of deletions and insertions

from ECDF the probability of a new tuple affecting the view

Ratios affecting the view

3/2p

3/1p

delins

deldel

delins

insins

2

1

updinsdel

updinsins

4/kcompN/kcomp)valkz(p

12/kcomp2)valkz(p*pp12/kcomp)valkz(p*pp

delaff_del

insaff_ins

3/kcomp2)(*p

3/kcomp)(*p

delinsaff_delaff_del

delinsaff_insaff_ins

M-Pref 2007, Vienna 23/9/2007 15



2. Compute kcomp

3. Fine tune kcomp

M-Pref 2007, Vienna 23/9/2007 16


Compute kcomp such that it will guarantee that the view will contain at least

k tuples, k ≤ kcomp, with probability p, after a period of operation T

that is of the form kcomp = k + Δk

id Q

Δk

k

kcomp

3kcomp

3/kcomp3/kcomp22kcomp

)(kkcomp ins_affaff_del

id name age salary

1 John 18 20

2 Mary 42 25

3 Bill 26 35

4 Peter 57 37

Q

22

8

44

17

name Q

Bill 44

John 22

Peter 17

Customers V

M-Pref 2007, Vienna 23/9/2007 17


id name age salary

1 John 18 20

2 Mary 42 25

3 Bill 26 35

4 Peter 57 37

5 Kate 25 30

Q

22

8

44

17

25

name Q

Bill 44

Kate 25

John 22

Peter 17

There is 1 insertion and 2 deletions affecting the view Tuple (5, Kate, 25, 30) is inserted and Tuples (3, Bill, 26, 35) and (4, Peter, 57, 37) are deleted

from the view The view will contain 2 tuples, as initially needed

Customers V

M-Pref 2007, Vienna 23/9/2007 18



2. Compute kcomp

3. Fine tune kcomp

M-Pref 2007, Vienna 23/9/2007 19

Fine tune kcomp

kcomp is expressed as a formula depending on

ins_aff and del_aff the ratios of insertions and deletions affecting

the view

The probability of a tuple affecting the view may vary

according to probabilistic properties

Fine tune kcomp by adding the appropriate variance

M-Pref 2007, Vienna 23/9/2007 20

Fine tune kcomp

The probability of a new tuple z affecting the view is p(z>valk) Bernoulli experiment with 2 possible events

New tuple z affecting the view with probability p(z) New tuple z not-affecting the view with probability 1-p(z)

The number of successes of ins Bernoulli experiments follow

a Binomial distribution with VARIANCE :

))valkz(p1(*)valkz(p*VAR insins

ins insertions in the base relation

ins Bernoulli experiments

M-Pref 2007, Vienna 23/9/2007 21

Fine tune kcomp

In worst case, in order to guarantee that the view will contain at least k tuples with confidence 95% kcomp is computed as:

VARins denotes the variance of the insertions

VARdel denotes the variance of the deletions

delinsaff_insaff_del VAR*2VAR*2)(kkcomp

M-Pref 2007, Vienna 23/9/2007 22

Contents




Fine tuning kcomp

Experiments

Conclusions

M-Pref 2007, Vienna 23/9/2007 23

Experimental methodology Test the following methods

kcomp without fine tuning

kcomp with fine tuning

Yi et al @ ICDE03

For the following measures

Number of tuples (# tuples) deleted from the view that fall below the

threshold value of k

Memory overhead for kcomp with & without fine tuning as number of

extra tuples needed to keep in the view

Number of extra tuples for kcomp with & without fine tuning compared to

the number of extra tuples of the related work

M-Pref 2007, Vienna 23/9/2007 24

Experimental methodology

Synthetic data sets: Gaussian distribution with mean μ=50 and variance σ=10 Negative exponential distribution with parameters a=1.0 for X and

a=2.0 for Y Zipf distribution with parameter a=2.1

Size of source table R (tuples) |R| 1x105, 5x105, 1x106, 2x106

Size of mat. View (tuples) k 5, 10, 100, 1000

Size of update stream (pct over |R|) 1/1000, 1/100

Deletion rate over insertion rate (ratio)

D/I 1.0, 1.5, 2.0

Experimental parameters:

M-Pref 2007, Vienna 23/9/2007 25

Max & average misses

012345678

R (D/I=1.5, k=1000)

#miss

es

max misses, λ=0.1%avg misses, λ=0.1%max misses, λ=1%avg misses, λ=1%

012345678

5 10 100 1000

k (R=1M, λ = 1%)

#max

mis

ses

D/I = 1.0

D/I = 1.5

D/I = 2.0

kcomp without fine tuning Gaussian distribution

As a function of R and As a function of k and D/I

M-Pref 2007, Vienna 23/9/2007 26

Memory overhead Number of extra tuples as a function of R and D/I

D/I=1D/I=1.5

D/I=2

960970980990

10001010102010301040105010601070

R (k=1000, λ=1%)

#tu

ple

s

K

KCOMP

KCOMP tuning

M-Pref 2007, Vienna 23/9/2007 27

Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax

of the related work as a function of R

0

200

400

600

800

1000

1200

1400

1600

100000 500000 1000000 2000000

R (k=100, λ=1%)

# t

up

les

KCOMP tuning

[10]

M-Pref 2007, Vienna 23/9/2007 28

Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax

of the related work as a function of k

0

500

1000

1500

2000

2500

3000

5 10 100 1000

k (R=2M, λ=1%)

# t

up

les

KCOMP tuning

[10]

M-Pref 2007, Vienna 23/9/2007 29

Contents




Fine tuning kcomp

Experiments

Conclusions

M-Pref 2007, Vienna 23/9/2007 30

Conclusions

We handled the problem of maintaining materialized top-k views in the presence of high deletion rates

The method comprises the following steps: a computation of the rate that actually affects the

materialized view, a computation of the necessary extension to k in order to

handle the augmented number of deletions that occur and a fine tuning part that adjusts this value to take the

fluctuation of the statistical properties of this value into consideration

M-Pref 2007, Vienna 23/9/2007 31

Thank you for your attention!

… many thanks to our hosts!

This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF).

M-Pref 2007, Vienna 23/9/2007 32

Auxiliary slidesFormulas for kcomp

insdelaff_insaff_del

delinsaff_delaff_del

delinsaff_insaff_ins

delaff_del

insaff_ins

delins

deldel

delins

insins

upddeldel

updinsins

NN*kkcomp)(kkcomp

)(*p

)(*p

)z(p*pp

)z(p*ppN

kcomp)valkz(p

p

p

M-Pref 2007, Vienna 23/9/2007 33

Time to build top-k view in microsecondsN K Gauss Negative

exponentialZipf

100K 5 328000 348500 242000

100K 10 333000 345667 239667

100K 100 335500 343000 239667

100K 1000 395333 406000 299500

500K 5 1650667 1715500 1216333

500K 10 1650667 1713000 1208333

500K 100 1653167 1710500 1205667

500K 1000 1736667 1796167 1291833

1M 5 3298667 3429000 2427167

1M 10 3301333 3426667 2429667

1M 100 3304000 3439500 2422167

1M 1000 3403167 3520500 2606667

2M 5 6650667 6900500 5406333

2M 10 6653167 6900833 4909000

2M 100 6747167 6906000 4906500

2M 1000 6895500 7082833 4992167

Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science.

Documents

fine tune kcomp

form kcomp

compute kcomp

view number of extra

kk kcomp v slide

view tuplesk5

view v customers

materialized view v