Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety
Database Replication in Tashkent
CSEP 545 Transaction Processing
Sameh Elnikety
Replication for Performance
ExpensiveLimited scalability
ExpensiveLimited scalability
2
DB Replication is Challenging• Single database system
– Large, persistent state– Transactions– Complex software
• Replication challenges– Maintain consistency – Middleware replication
3
Background
Replica 1StandaloneDBMS
4
Background
Replica 2
Replica 1
Replica 3
Load Balancer
5
Read Tx
Replica 2
Replica 1
Replica 3
Load Balancer
T
Read tx does not change DB state
Read tx does not change DB state
6
Update tx changesDB state
Update tx changesDB state
Update Tx 1/2
Replica 2
Replica 1
Replica 3
Load Balancer
Twsws
7
Update tx changesDB state
Update tx changesDB state
Update Tx 1/2
Replica 2
Replica 1
Replica 3
Load Balancer
Tws
Apply (or commit) T everywhere
Apply (or commit) T everywhere
ws
ws
ws
Example:T1: { set x = 1 }
Example:T1: { set x = 1 }
8
ws
Ordering
Update Tx 2/2
Replica 2
Replica 1
Replica 3
Load Balancer
Tws
Update tx changesDB state
Update tx changesDB state
ws
Tws
ws
9
Ordering
Update Tx 2/2
Replica 2
Replica 1
Replica 3
Load Balancer T
Update tx changesDB state
Update tx changesDB state
T
ws
ws
ws ws
ws
ws
ws
ws
Replica 3Example:T1: { set x = 1 }T2: { set x = 7 }
Example:T1: { set x = 1 }T2: { set x = 7 }
Commit updates in order
Commit updates in order
10
Ordering
Sub-linear Scalability Wall
Replica 2
Replica 1
Replica 3
Load Balancer T
T
ws
ws
ws ws
ws
ws
ws
ws
Replica 3
11
Replica 4
• General scaling techniques– Address fundamental bottlenecks– Synergistic, implemented in middleware– Evaluated experimentally
This Talk
12
Super-linear Scalability
Single Base United MALB UF0
20
40
60
80
100
120
TP
S
12 X
25 X
37 X
1 X
7 X
Big Picture: Let’s Oversimplify
StandaloneDBMS
R reading
update
loggingU
14
reading
update
logging
Big Picture: Let’s Oversimplify
Replica 1/N (traditional)
StandaloneDBMS
R reading
update
loggingU
N.R
N.U
R
U
(N-1).ws
15
reading
update
logging
reading
update
logging
Big Picture: Let’s Oversimplify
Replica 1/N (traditional)
Replica 1/N (optimized)
StandaloneDBMS
16
R reading
update
loggingU
N.R
N.U
R
U
(N-1).ws
N.R
N.U
R*
U*
(N-1).ws*
reading
update
logging
reading
update
logging
Big Picture: Let’s Oversimplify
Replica 1/N (traditional)
Replica 1/N (optimized)
StandaloneDBMS
17
R reading
update
loggingU
N.R
N.U
R
U
(N-1).ws
N.R
N.U
R*
U*
(N-1).ws*
MALBUpdate FilteringUniting O & D
MALBUpdate FilteringUniting O & D
Key Points1. Commit updates in order
– Perform serial synchronous disk writes– Unite ordering and durability
2. Load balancing– Optimize for equal load: memory contention– MALB: optimize for in-memory execution
3. Update propagation– Propagate updates everywhere– Update filtering: propagate to where needed
18
Tx A
Roadmap
Replica 2
Replica 1
Replica 3
Load Balancer
12, 3
Ordering
Load balancingLoad balancing
Update propagationUpdate propagation
Commit updates in
order
Commit updates in
order 19
• Traditionally: – Commit ordering and durability are separated
• Key idea: – Unite commit ordering and durability
Key Idea
20
All Replicas Must Agree• All replicas agree on
– which update tx commit– their commit order
• Total order – Determined by middleware – Followed by each replica
durability
Replica 3
Tx A
Tx Bdurability
Replica 2
durability
Replica 1
21
Tx B
durability
Replica 3
Ordering
Tx A
Order Outside DBMS
Tx A
Tx Bdurability
Replica 2
durability
Replica 1
22
Tx B
durability
Replica 3
Ordering
Tx A
A B
A B
Order Outside DBMS
Tx A
Tx Bdurability
Replica 2
A B
durability
Replica 1
A B
A B
A B
A B
23
Ordering
A B DBMS
durability
Replica 3
Proxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
Enforce External Commit Order
24
Ordering
A B DBMS
durability
Replica 3
Proxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
B A
Enforce External Commit Order
25
Ordering
A B DBMS
durability
Replica 3
Proxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
B A
Cannot commit A & B concurrently!
Enforce External Commit Order
26
Ordering
A B
durability
Replica 3
Proxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
A
Enforce Order = Serial Commit
DBMS
27
Ordering
A B
durability
Replica 3
Proxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
A B
Enforce Order = Serial Commit
DBMS
28
Commit Serialization is Slow
DurabilityA
Proxy
DBMS
durability
CPU
OrderingA B C
Commit orderA B C
DurabilityA B
CPU
DurabilityA B C
CPU
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
29
Commit Serialization is Slow
DurabilityA
Proxy
DBMS
durability
CPU
OrderingA B C
Commit orderA B C
DurabilityA B
CPU
DurabilityA B C
CPU
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
Problem: Durability & ordering separated → serial disk writes
Problem: Durability & ordering separated → serial disk writes
30
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
Unite D. & O. in Middleware
Proxy
DBMS
CPU
OrderingA B C
Commit orderA B C
CPU
DurabilityA B C
CPU
durabilityOFF
durability
31
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
Unite D. & O. in Middleware
Proxy
DBMS
CPU
OrderingA B C
Commit orderA B C
CPU
DurabilityA B C
CPU
durabilityOFF
durability
Solution: Move durability to MW Durability & ordering in middleware → group commit
Solution: Move durability to MW Durability & ordering in middleware → group commit
32
• Middleware logs tx effects– Durability of update tx
• Guaranteed in middleware• Turn durability off at database
• Middleware performs durability & ordering– United → group commit → fast
• Database commits update tx serially– Commit = quick main memory operation
Implementation: Uniting D & O in MW
33
Uniting Improves Throughput• Metric
– Throughput• Workload
– TPC-W Ordering
(50% updates)• System
– Linux cluster – PostgreSQL– 16 replicas– Serializable exec. Single Base United MALB UF
0
5
10
15
20
25
30
35
40
TPC-W
1 X
12 X
7 X
TP
S
Tx A
Roadmap
Replica 2
Replica 1
Replica 3
Load Balancer
1
Ordering
2, 3
Load balancingLoad balancing
Update propagationUpdate propagation
Commit updates in
order
Commit updates in
order 35
Key IdeaReplica 1
Mem
Disk
Replica 2
Mem
Disk
Load Balancer
Equal load on replicas Equal load on replicas
36
Key IdeaReplica 1
Mem
Disk
Replica 2
Mem
Disk
Load Balancer
Equal load on replicas Equal load on replicas
MALB: (Memory-Aware Load Balancing)Optimize for in-memory execution
MALB: (Memory-Aware Load Balancing)Optimize for in-memory execution 37
How Does MALB Work?
Database 21 3
Workload A →
B →
MemMemory
21
2 3
38
A, B, A, B
A, B, A, B
Read Data From Disk
A, B, A, B
Replica 1
Mem
Disk21 3
Replica 2
Mem
Disk21 3
LeastLoaded
31
A →
B →
21
2 3
39
A, B, A, B
A, B, A, B
Read Data From Disk
A, B, A, B
Replica 1
Mem
Disk21 3
Replica 2
Mem
Disk21 3
LeastLoaded
31
SlowSlow
SlowSlow
A →
B →
21
2 3
40
21 331
21 331
Data Fits in MemoryReplica 1
Mem
Disk21 3
Replica 2
Mem
Disk21 3
MALB
A →
B →
21
2 3A, A, A, A
B, B, B, B
A, B, A, B
41
Data Fits in MemoryReplica 1
Mem
Disk21 3
21
Replica 2
Mem
Disk21 3
32
MALB
FastFast
FastFast
A →
B →
21
2 3A, A, A, A
B, B, B, BMemory info?Many tx and replicas?Memory info?Many tx and replicas?
A, B, A, B
42
• Exploit tx execution plan– Which tables & indices are accessed– Their access pattern
• Linear scan, direct access
• Metadata from database– Sizes of tables and indices
Estimate Tx Memory Needs
43
• Objective– Construct tx groups that fit together in memory
• Bin packing– Item: tx memory needs– Bin: memory of replica– Heuristic: Best Fit Decreasing
• Allocate replicas to tx groups– Adjust for group loads
Grouping Transactions
44
MALB in Action
A B CD E F
MALB
45
MALB in Action
A B CD E F
MALB
Memory needs forA, B, C, D, E, F
46
Group A
MALB in Action
A B CD E F Group B C
Group D E F
MALB
Memory needs forA, B, C, D, E, F
47
Group A
MALB in Action
A B CD E F Replica
Replica
Replica
Group B C
A
Group D E F
B C
D E F
MALB
Disk
Disk
Disk
Memory needs forA, B, C, D, E, F
48
• Objective– Optimize for in-memory execution
• Method– Estimate tx memory needs– Construct tx groups– Allocate replicas to tx groups
MALB Summary
49
• Implementation– No change in consistency – Still middleware
• Compare– United: efficient baseline system– MALB: exploits working set information
• Same environment– Linux cluster running PostgreSQL– Workload: TPC-W Ordering (50% update txs)
Experimental Evaluation
50
MALB Doubles Throughput
TPC-W
Ordering
16 replicas
51
Single Base United MALB UF0
20
40
60
80
100
120
TP
S
105%
12 X
25 X
1 X
7 X
MALB Doubles Throughput
52United MALB
0.0
0.2
0.4
0.6
0.8
1.0
Single Base United MALB UF0
20
40
60
80
100
120
TP
S
Rea
d I/
O, n
orm
aliz
ed
105%
12 X
25 X
1 X
7 X
BigSmall
Big
Small
MemSize
DBSize
Big Gains with MALB
4%4%0%0%29%29%
48%48%105%105%45%45%
182%182%75%75%12%12%
BigSmall
Big
Small
MemSize
DBSize
Big Gains with MALB
4%4%0%0%29%29%
48%48%105%105%45%45%
182%182%75%75%12%12%
Run from memoryRun from memory
Run from disk
Run from disk
Tx A
Roadmap
Replica 2
Replica 1
Replica 3
Load Balancer
1
Ordering
2, 3
Load balancingLoad balancing
Update propagationUpdate propagation
Commit updates in
order
Commit updates in
order 55
• Traditional: – Propagate updates everywhere
• Update Filtering: – Propagate updates to where they are needed
Key Idea
56
Update Filtering ExampleReplica 1
Mem
Disk21 3
Replica 2
Mem
Disk21 3
MALBUF
A →
B →
21
2 3
A, B, A, B
57
Group A
Update Filtering ExampleReplica 1
Group B
Mem
Disk21 3
21
Replica 2
Mem
Disk21 3
32
MALBUF
A →
B →
21
2 3
A, B, A, B
58
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk21 3
2
MALBUF
Updatetable 1
3
3
A →
B →
21
2 3
A, B, A, B
59
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk2
13
2
MALBUF
Updatetable 1
3
3
A →
B →
21
2 3
A, B, A, B
60
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk2 3
2
MALBUF
Updatetable 1
3
3
A →
B →
21
2 3
A, B, A, B
611
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk21 3
2
MALBUF
Updatetable 1
3
Updatetable 3
3
A →
B →
21
2 3
A, B, A, B
62
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk21 3
2
MALBUF
Updatetable 1
3
Updatetable 3
3
A →
B →
21
2 3
A, B, A, B
63
Group A
Update Filtering Example
Disk
Replica 1
Group B
Mem
21
21
Replica 2
Mem
Disk21 3
2
MALBUF
Updatetable 1
3
Updatetable 3
3
A →
B →
21
2 3
A, B, A, B
64
Update Filtering in Action
UF
65
Update Filtering in Action
UF
Update tored table
66
Update Filtering in Action
UF
Update tored table
Update togreen table
67
Update Filtering in Action
UF
Update tored table
Update togreen table
68
Update Filtering in Action
UF
Update tored table
Update togreen table
69
Single Base United MALB UF0
20
40
60
80
100
120
MALB+UF Triples Throughput
37 X
TP
S
12 X
25 X
1 X
7 X
49%TPC-W
Ordering
16 replicas
MALB UF0
2
4
6
8
10
12
14
Single Base United MALB UF0
20
40
60
80
100
120
MALB+UF Triples Throughput
37 X
TP
S
12 X
25 X
1 X
7 X Pro
p. U
pd
ates
15
7
49%
1.49
0
0.5
1
1.5
2
MALB MALB+UF
Filtering Opportunities
50%Ordering Mix
5% Browsing Mix
1.02
0
0.5
1
1.5
2
MALB MALB+UF
Updates
Rat
io
MA
LB
+U
F /
MA
LB
72
Conclusions1. Commit updates in order
– Perform serial synchronous disk writes– Unite ordering and durability
2. Load balancing– Optimize for equal load: memory contention– MALB: optimize for in-memory execution
3. Update propagation– Propagate updates everywhere– Update filtering: propagate to where needed
73