Online Bigtable merge compaction - UCRneal/Slides/bigtable_merge_compaction.pdf · 2015-09-22 · BIGTABLE — data storage at Google Maps, Search/Crawl, Gmail ...use BIGTABLE to
Post on 05-Jan-2020
3 Views
Preview:
Transcript
Online Bigtable merge compaction
work in progress
Claire Mathieu
CNRS Paris
Carl Staelin
Google Haifa
Neal E. Young
1
UC Riverside
Arman Yousefia
UCLA
instigator memy student
you are here
this is now
Northeastern University, September 17, 2015
1funded by faculty re$earch award
BIGTABLE — data storage at
Google Maps, Search/Crawl, Gmail . . . use BIGTABLE to store data.
I 24,500 Bigtable Servers
I 1.2 million requests per second
I 16 GB/s of outgoing RPC tra�c
I over a petabyte of data just for Google Crawl and Analytics
I these figures are from 2006
Similar to other “NoSQL” databases:
Accumulo, AsterixDB, Cassandra, HBase, Hypertable, Spanner, . . .
Used by Adobe, Ebay, Facebook, GitHub, Meetup, Netflix, Twitter, . . .
“Log-structured merge tree” architecture— for high-volume, highly reliable, distributed, real-time data storage.
BIGTABLE — implements dictionary data type
operations supported by a Bigtable instance:
I write(key, value)
I read(key) — return most recent value written for key
I. . . there’s more, but not today . . .
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty–
file sequence
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a)
file sequence
write(1, a);
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b)
file sequence
write(1, a); write(2, b);
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b) (3, c)
file sequence
write(1, a); write(2, b); write(3, c);
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (1, a) (2, b) (3, c) (4, d)
file sequence
write(1, a); write(2, b); write(3, c); write(4, d);
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
write(1, a); write(2, b); write(3, c); write(4, d); flush();
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: (5, e) (6, f ) (7, g)
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
write(1, a); write(2, b); write(3, c); write(4, d); flush();
write(5, e); write(6, f ); write(7, g);
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g)| {z }
from 2nd flush
write(1, a); write(2, b); write(3, c); write(4, d); flush();
write(5, e); write(6, f ); write(7, g); flush();
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g)| {z }
from 2nd flush
(8, h) (9, i)| {z }from 3rd flush
write(1, a); write(2, b); write(3, c); write(4, d); flush();
write(5, e); write(6, f ); write(7, g); flush();
write(8, h); write(9, i); flush();
BIGTABLE — writes and flushes
write(key, value):
1. Store key/value pair in cache (e.g. hash table in RAM).
Environment periodically forces flush of cache to new immutable disk file.
Example
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g)| {z }
from 2nd flush
(8, h) (9, i)| {z }from 3rd flush
Environment forces Flushes at arbitrary times.
BIGTABLE — reads and compactions
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g)| {z }
from 2nd flush
(8, h) (9, i)| {z }from 3rd flush
read(key):
1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)
BIGTABLE — reads and compactions
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g)| {z }
from 2nd flush
(8, h) (9, i)| {z }from 3rd flush
read(key):
1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)
compaction(): asynchronous background process, to reduce read costs
Periodically select files to merge.
BIGTABLE — reads and compactions
cache: –empty–
file sequence: (1, a) (2, b) (3, c) (4, d)| {z }
from 1st flush
(5, e) (6, f ) (7, g) (8, h) (9, i)| {z }
merge of 2nd and 3rd
read(key):
1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)
compaction(): asynchronous background process, to reduce read costs
Periodically select files to merge. cost = O(SIZE of merged files) !!
goals: (i) keep read costs low(ii) keep compaction costs low
constraint: each merge must merge a contiguous subsequence of files
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k =1, problem is easy — never merge
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k =1, problem is easy — never merge
after flush 1:
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k =1, problem is easy — never merge
after flush 1:after flush 2:
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k =1, problem is easy — never merge
after flush 1:after flush 2:after flush 3:after flush 4:
...Total compaction cost = 0.
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:after flush 2: too many files!
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:after flush 2: compaction cost x
1
+ x2
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:after flush 2: compaction cost x
1
+ x2
after flush 3: too many files!
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:after flush 2: compaction cost x
1
+ x2
after flush 3: compaction cost x1
+ x2
+ x3
Bigtable Merge Compaction (bmc) — formal definition
given: Sequence x1
, x2
, . . . , xn
. xt
is size of file resulting from flush t
Integer k > 0. tuned to workload; typically 3–40.
choose: Compactions. Ensure number of files never exceeds k .
objective: Minimize total compaction cost.
If k = 1, problem is easy — must merge everything each time
after flush 1:after flush 2: compaction cost x
1
+ x2
after flush 3: compaction cost x1
+ x2
+ x3
...after flush n: compaction cost x
1
+ · · ·+ xn
Total compaction costP
n
i=2
(x1
+ x2
+ · · ·+ xi
) ⇡P
n
i=1
(n� i +1)xi
.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
4.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
4.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
4.
5.
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
4.
5.
...
Google’s default compaction algorithm:
Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.
Example: k = 2, on uniform input x = 1, 1, 1, . . .:
1.
2.
3.
4.
5.
...
Total compaction cost = ⇥(n2).
n
2
-
2
66666666666666666664
for general k, cost is ⇥(n
2/3k�1
)
OPTIMAL solution for k = 2, uniform x = 1, 1, 1, . . .1.
2.
3.
4.
...
“big” merges: O(pn), of size O(n)
“small” merges: O(n), of size O(pn)
pn -
2
6666666664
Total compaction cost = O(n3/2).
for general k, opt cost is ⇥(kn
1+1/k)
Definition: c-competitive online algorithm
A compaction algorithm is c-competitive if, on any input (k , x), itssolution costs at most c times the optimal cost.
A compaction algorithm is online if its choice of merge after flush tdepends only on k and x
1
, x2
, . . . , xt
(the files flushed so far).
I Default’s cost can be n times opt cost (for any k).
I So default is no better than n-competitive.
! May have high compaction cost even for “easy” inputs.
Theorem 1. There is a k-competitive online algorithm for bmc. today
Theorem 2. No deterministic online algorithm is less than k-competitive.
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
t.
?
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
t.
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
t.
?
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
t.
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges.
case 1 (during this interval, opt does a big merge):
Opt’s cost for big merge during interval is at least C .
case 2 (during this interval, opt does no big merge):
Opt’ cost for small merges during interval is at least C .
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges.
case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .
case 2 (during this interval, opt does no big merge):
Opt’ cost for small merges during interval is at least C .
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges.
case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .
case 2 (during this interval, opt does no big merge):
Opt’ cost for small merges during interval is at least C .
Idea behind 2-competitive online algorithm (for k = 2)...
Q: At each step, do “big” merge or small merge?
A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.
s. previous big merge, cost C
alg. cost during interval is 2C
t.
Why 2-competitive? Focus on a time interval between two big merges.
case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .
case 2 (during this interval, opt does no big merge):Opt’ cost for small merges during interval is at least C .
Idea behind k-competitive online algorithm for general k
‘idea: Do big merge, then recurse with k = k � 1.
Q: When to do next big merge?
A: When cost of previous big merge
⇡ (cost for recursion)/(k � 1).
Recurse with k = k � 1
to handle this part.
“Balanced rent-or-buy algorithm (brb)”
Recap of analyses in worst-case model
Bigtable default is at best n-competitive...
Theorem 1. Brb is a k-competitive online algorithm for bmc. today
Theorem 2. No deterministic online algorithm is less than k-competitive.
What about “typical” inputs?
Preliminary benchmarks (one example with k = 5)
0 500 1000 1500 2000
1e+0
52e
+05
3e+0
54e
+05
n
cost
per
ste
p
DefaultBRBOptimal
0e+00 4e+04 8e+040.0e
+00
4.0e
+06
8.0e
+06
1.2e
+07
n
cost
per
ste
p
DefaultBRB
xt
’s are i.i.d. from log-normal distribution.Conjectures
1. Brb and Opt cost per time step ⇠ x k n1/k/e.
2. Default cost per time step ⇠ x n/(2 · 3k�1).
Lots of work in progress
theoretical:
I average-case analyses:absolute and relative costs on i.i.d. inputs
I randomized online algorithms (o(k)-competitive?)I optimal compaction schedules
⌘ optimal binary search trees
practical:
I realistic testing. . . on AsterixDB, then at Google
problem variants:
I allow expiration/deletion of key/value pairs (done)I allowing k to vary — bmc w/ read costs... (open!)
Working paper available on arxiv.org
(Search web for “bigtable merge compaction”.)
Bmc with read costs (geometric interpretation)
given: Staircase step-lengths and step-heights (x1
, y1
), (x2
, y2
), . . ..
do: Partition region below staircase into axis-parallel rectangles.
objective: Minimize the sum of the widths and heights of the rectangles.
x
1
y
1
x
2
y
2
x
3
y
3
x
4
y
4
x
5
y
5
x
6
y
6
x
7
y
7
open problem: is there an O(1)-competitive online algorithm?
Bmc with read costs (geometric interpretation)
given: Staircase step-lengths and step-heights (x1
, y1
), (x2
, y2
), . . ..
do: Partition region below staircase into axis-parallel rectangles.
objective: Minimize the sum of the widths and heights of the rectangles.
x
1
y
1
x
2
y
2
x
3
y
3
x
4
y
4
x
5
y
5
x
6
y
6
x
7
y
7
open problem: is there an O(1)-competitive online algorithm?
Thank you
A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1
, x2
, . . . , xn
. Int. k > 0.
do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.
objective: Minimize the sum of the widths of the rectangles.
x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
10
input: an uneven staircase with 10 steps; k = 2.
A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1
, x2
, . . . , xn
. Int. k > 0.
do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.
objective: Minimize the sum of the widths of the rectangles.
input: an uneven staircase with 10 steps; k = 2.
solution
A geometric interpretation of bmc
given: Uneven staircase with step-lengths x1
, x2
, . . . , xn
. Int. k > 0.
do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.
objective: Minimize the sum of the widths of the rectangles.
input: an uneven staircase with 10 steps; k = 2.
not a solution
This partition is cheaper. . .
but not valid for k = 2.
top related