Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science
Mar 30, 2015
Tuning the top-k view update process
Eftychia BaikousiPanos Vassiliadis
University of Ioannina
Dept. of Computer Science
M-Pref 2007, Vienna 23/9/2007 2
Forecast
Problem of maintaining materialized top-k views, when updates occur in the base relation Extra difficulty: address the problem in the presence of high
deletion rates The crux of the approach is to materialize an
appropriate number of extra tuples kcomp to sustain the deletion rates that are drastically higher than average The correct estimation & fine tuning of kcomp is not
obvious We use appropriate probabilistic methods
M-Pref 2007, Vienna 23/9/2007 3
Contents
Motivation & Problem Definition
Overview of our Method Computation of rates affecting the view
Computation of kcomp
Fine tuning kcomp
Experiments
Conclusions
M-Pref 2007, Vienna 23/9/2007 4
Contents
Motivation & Problem Definition
Overview of our Method Computation of rates affecting the view
Computation of kcomp
Fine tuning kcomp
Experiments
Conclusions
M-Pref 2007, Vienna 23/9/2007 5
Top-k query
Given a relation R (id, x1, x2, x3) and a query Q, sum(x1, x2, x3)
Find k tuples with highest grades according to Q
id x1 x2 x3
a 0.3 0.6 0.7
b 0.2 0.3 0.4
c 0.4 0.5 0.9
d 0.7 0.6 0.1
R
Top-2 tuples
sum
1.6
0.9
1.8
1.4
M-Pref 2007, Vienna 23/9/2007 6
Motivating Example Shopping Center
Customers sign in with a palmtop (PDA) Need for advertisements – Special offers to Customers
Given relation Customers (id, name, age, salary, …) materialized view V of the top-2 (Younger and Highly paid Customers)
according to the query Q: - age + 2*salary
Maintain the view V Customers sign in and out (e.g., train departures, working hours)
id name age salary
1 John 18 20
2 Mary 42 25
3 Bill 26 35
4 Peter 57 37
Q
22
8
44
17
name Q
Bill 44
John 22
Customers V
M-Pref 2007, Vienna 23/9/2007 7
Problem definition Given
a base relation R (ID, X, Y) that originally contains N tuples, a materialized view V
that contains top-k tuples of the form (id, val) where val is the score according to a function Q(x,y)=ax + by and a, b are constant parameters,
the update ratios ins, del and upd for insertions, deletions and updates respectively over the base relation R,
Compute kcomp
that is of the form kcomp = k + Δk
Such that the view will contain at least k tuples,
k ≤ kcomp, with probability p, after a period T
id Q
k
Δkkcomp
V
M-Pref 2007, Vienna 23/9/2007 8
Related WorkKe Yi, Hai Yu, Jun Yang, Gangqiang Xia, Yuguo Chen:
“Efficient Maintenance of Materialized Top-k Views”, ICDE ’03
Maintain a materialized top-k view when updates occur in the base table Compute a kmax (instead of the necessary k)
adjusted at runtime so a refill query is rarely needed formulates the problem through a random walk model
The method is theoretically guaranteed to work well only when the probabilities
of insertions and deletions are equal, pins=pdel
of insertions are more frequent than deletions pins>pdel
There is no quality-of-service guarantee when deletions are more probable than insertions, pins<pdel
M-Pref 2007, Vienna 23/9/2007 9
Motivating Example
id name age salary
1 John 18 20
2 Mary 42 25
3 Bill 26 35
4 Peter 57 37
Q
22
8
44
17
Customers sign in and out Due to train departures, working hours
At certain time periods, deletions are more probable than insertions pins<pdel
The view will not contain at least k tuples
name Q
Bill 44
John 22
CustomersV
M-Pref 2007, Vienna 23/9/2007 10
Contents
Motivation & Problem Definition
Overview of our Method Computation of rates affecting the view
Computation of kcomp
Fine tuning kcomp
Experiments
Conclusions
M-Pref 2007, Vienna 23/9/2007 11
Overview of the method
1. Compute the ratios of the incoming source updates that affect the view
2. Compute kcomp
3. Fine tune kcomp
M-Pref 2007, Vienna 23/9/2007 12
Empirical Cumulative Distribution Function ECDF ECDF is a non parametric cumulative distribution function
that adapts itself to the data
Definition
Fn(x)
represents the proportion of observations in a sample less than or equal to x
assigns the probability 1/n to each of n observations in the sample
estimates the true population proportion F(x)
nx_values_of_number)x(nF
M-Pref 2007, Vienna 23/9/2007 13
Computation of update rates that affect V Given
a relation Customers (id, name, age, salary, …) having N=4 tuples a materialized view V containing top-2 tuples (k=2) of the form
(id, Q) where Q= -age +2*salary is the score Update ratios ins=1, del=2, upd=0
Find ins_aff and del_aff (insertions & deletions affecting the view)
id name age salary
1 John 18 20
2 Mary 42 25
3 Bill 26 35
4 Peter 57 37
Q
22
8
44
17
name Q
Bill 44
John 22
Customers V
M-Pref 2007, Vienna 23/9/2007 14
Computation of update rates that affect V Given
N=4, ins=1, del=2, upd=0 We compute the following:
updates are treated as a combination of deletions and insertions
from ECDF the probability of a new tuple affecting the view
Ratios affecting the view
3/2p
3/1p
delins
deldel
delins
insins
2
1
updinsdel
updinsins
4/kcompN/kcomp)valkz(p
12/kcomp2)valkz(p*pp12/kcomp)valkz(p*pp
delaff_del
insaff_ins
3/kcomp2)(*p
3/kcomp)(*p
delinsaff_delaff_del
delinsaff_insaff_ins
M-Pref 2007, Vienna 23/9/2007 15
Overview of the method
1. Compute the ratios of the incoming source updates that affect the view
2. Compute kcomp
3. Fine tune kcomp
M-Pref 2007, Vienna 23/9/2007 16
Computation of kcomp
Compute kcomp such that it will guarantee that the view will contain at least
k tuples, k ≤ kcomp, with probability p, after a period of operation T
that is of the form kcomp = k + Δk
id Q
Δk
k
kcomp
3kcomp
3/kcomp3/kcomp22kcomp
)(kkcomp ins_affaff_del
id name age salary
1 John 18 20
2 Mary 42 25
3 Bill 26 35
4 Peter 57 37
Q
22
8
44
17
name Q
Bill 44
John 22
Peter 17
Customers V
M-Pref 2007, Vienna 23/9/2007 17
Computation of kcomp
id name age salary
1 John 18 20
2 Mary 42 25
3 Bill 26 35
4 Peter 57 37
5 Kate 25 30
Q
22
8
44
17
25
name Q
Bill 44
Kate 25
John 22
Peter 17
There is 1 insertion and 2 deletions affecting the view Tuple (5, Kate, 25, 30) is inserted and Tuples (3, Bill, 26, 35) and (4, Peter, 57, 37) are deleted
from the view The view will contain 2 tuples, as initially needed
Customers V
M-Pref 2007, Vienna 23/9/2007 18
Overview of the method
1. Compute the ratios of the incoming source updates that affect the view
2. Compute kcomp
3. Fine tune kcomp
M-Pref 2007, Vienna 23/9/2007 19
Fine tune kcomp
kcomp is expressed as a formula depending on
ins_aff and del_aff the ratios of insertions and deletions affecting
the view
The probability of a tuple affecting the view may vary
according to probabilistic properties
Fine tune kcomp by adding the appropriate variance
M-Pref 2007, Vienna 23/9/2007 20
Fine tune kcomp
The probability of a new tuple z affecting the view is p(z>valk) Bernoulli experiment with 2 possible events
New tuple z affecting the view with probability p(z) New tuple z not-affecting the view with probability 1-p(z)
The number of successes of ins Bernoulli experiments follow
a Binomial distribution with VARIANCE :
))valkz(p1(*)valkz(p*VAR insins
ins insertions in the base relation
ins Bernoulli experiments
M-Pref 2007, Vienna 23/9/2007 21
Fine tune kcomp
In worst case, in order to guarantee that the view will contain at least k tuples with confidence 95% kcomp is computed as:
VARins denotes the variance of the insertions
VARdel denotes the variance of the deletions
delinsaff_insaff_del VAR*2VAR*2)(kkcomp
M-Pref 2007, Vienna 23/9/2007 22
Contents
Motivation & Problem Definition
Overview of our Method Computation of rates affecting the view
Computation of kcomp
Fine tuning kcomp
Experiments
Conclusions
M-Pref 2007, Vienna 23/9/2007 23
Experimental methodology Test the following methods
kcomp without fine tuning
kcomp with fine tuning
Yi et al @ ICDE03
For the following measures
Number of tuples (# tuples) deleted from the view that fall below the
threshold value of k
Memory overhead for kcomp with & without fine tuning as number of
extra tuples needed to keep in the view
Number of extra tuples for kcomp with & without fine tuning compared to
the number of extra tuples of the related work
M-Pref 2007, Vienna 23/9/2007 24
Experimental methodology
Synthetic data sets: Gaussian distribution with mean μ=50 and variance σ=10 Negative exponential distribution with parameters a=1.0 for X and
a=2.0 for Y Zipf distribution with parameter a=2.1
Size of source table R (tuples) |R| 1x105, 5x105, 1x106, 2x106
Size of mat. View (tuples) k 5, 10, 100, 1000
Size of update stream (pct over |R|) 1/1000, 1/100
Deletion rate over insertion rate (ratio)
D/I 1.0, 1.5, 2.0
Experimental parameters:
M-Pref 2007, Vienna 23/9/2007 25
Max & average misses
012345678
R (D/I=1.5, k=1000)
#miss
es
max misses, λ=0.1%avg misses, λ=0.1%max misses, λ=1%avg misses, λ=1%
012345678
5 10 100 1000
k (R=1M, λ = 1%)
#max
mis
ses
D/I = 1.0
D/I = 1.5
D/I = 2.0
kcomp without fine tuning Gaussian distribution
As a function of R and As a function of k and D/I
M-Pref 2007, Vienna 23/9/2007 26
Memory overhead Number of extra tuples as a function of R and D/I
D/I=1D/I=1.5
D/I=2
960970980990
10001010102010301040105010601070
R (k=1000, λ=1%)
#tu
ple
s
K
KCOMP
KCOMP tuning
M-Pref 2007, Vienna 23/9/2007 27
Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax
of the related work as a function of R
0
200
400
600
800
1000
1200
1400
1600
100000 500000 1000000 2000000
R (k=100, λ=1%)
# t
up
les
KCOMP tuning
[10]
M-Pref 2007, Vienna 23/9/2007 28
Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax
of the related work as a function of k
0
500
1000
1500
2000
2500
3000
5 10 100 1000
k (R=2M, λ=1%)
# t
up
les
KCOMP tuning
[10]
M-Pref 2007, Vienna 23/9/2007 29
Contents
Motivation & Problem Definition
Overview of our Method Computation of rates affecting the view
Computation of kcomp
Fine tuning kcomp
Experiments
Conclusions
M-Pref 2007, Vienna 23/9/2007 30
Conclusions
We handled the problem of maintaining materialized top-k views in the presence of high deletion rates
The method comprises the following steps: a computation of the rate that actually affects the
materialized view, a computation of the necessary extension to k in order to
handle the augmented number of deletions that occur and a fine tuning part that adjusts this value to take the
fluctuation of the statistical properties of this value into consideration
M-Pref 2007, Vienna 23/9/2007 31
Thank you for your attention!
… many thanks to our hosts!
This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF).
M-Pref 2007, Vienna 23/9/2007 32
Auxiliary slidesFormulas for kcomp
insdelaff_insaff_del
delinsaff_delaff_del
delinsaff_insaff_ins
delaff_del
insaff_ins
delins
deldel
delins
insins
upddeldel
updinsins
NN*kkcomp)(kkcomp
)(*p
)(*p
)z(p*pp
)z(p*ppN
kcomp)valkz(p
p
p
M-Pref 2007, Vienna 23/9/2007 33
Time to build top-k view in microsecondsN K Gauss Negative
exponentialZipf
100K 5 328000 348500 242000
100K 10 333000 345667 239667
100K 100 335500 343000 239667
100K 1000 395333 406000 299500
500K 5 1650667 1715500 1216333
500K 10 1650667 1713000 1208333
500K 100 1653167 1710500 1205667
500K 1000 1736667 1796167 1291833
1M 5 3298667 3429000 2427167
1M 10 3301333 3426667 2429667
1M 100 3304000 3439500 2422167
1M 1000 3403167 3520500 2606667
2M 5 6650667 6900500 5406333
2M 10 6653167 6900833 4909000
2M 100 6747167 6906000 4906500
2M 1000 6895500 7082833 4992167