Top Banner
Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
26

Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

SpannerGoogle’s scalable, multi-version, globally-distributed and synchronously-replicated database

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Page 2: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Why Spanner born?

• Google had BigTable and MegaStore.

• Why not BigTable ?

• Can’t handle with complex, evolving schemes.

• Only eventual consistency across datacenters.

• Transactional scope limited to single row.

• Why not MegaStore ?

• Low performance.

Page 3: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

So, What is Spanner?

• At high level of abstraction, it is a database that shards data across many set of Paxos state machines in datacenters spread all over the world.

• Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.

• Spanner maintained multiple replicas for each data.

• Replication is used for global availability.

• Applications can use Spanner for high availability even in face of wide-area natural disasters.

Page 4: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

So, What is Spanner?

• Spanner supports general-purpose transactions (ACID).

• Atomicity, Consistency, Isolation, Durability.Sometimes “Eventually-consistent” of BigTable isn’t good enough.

• Spanner provides a SQL-based query language.

• Which provides to the applications the ability to handle complex schemes.

Page 5: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

universemaster – status of all zones

placement driver – transfers data between zones

A Spanner deployment is called “universe”.

location proxies – Used by clients tolocate spanservers that hold the datathey need

Thousands of spanservers per zone

zonemaster allocates data tospanservers

Page 6: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Spanserver Software Stack

Tables sharded across rows into tablets (like bigtable) .

Tablet maps(key:string, timestamp:int64)->string.

Each spanserver is responsible for. 100-1000 tablets

Paxos state machine enables support synchronous replication.

Page 7: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Paxos State Machine

Paxos state machines - to implement a consistently replicated bag of mapping.

The key-value mapping state of each replica is stored in its corresponding tablet.

Writes must initiate the Paxos protocol at the leader.

The set of replicas is collectively a Paxos Group.

Each replica can be located on different datacenter.

Page 8: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Spanner’s Features

• As a globally-distributed database , Spanner provides several interesting features.

• Applications can specify constraints to control which datacenters contain which data :

• How far data is from its users (to control read latency).

• How replicas are from each other (to control write latency).

• How many replicas are maintained (to control durability, availability and read performance).

• In addition, Spanner has two features that are difficult to implement in a distributed database :

• Externally-consistent reads and writes.

• Globally-consistent reads across the database at a timestamp.

Page 9: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Spanner’s Features

• Why Externally-consistent reads and writes and Globally-consistent reads across the database at a timestamp are difficult to implement in a distributed database.

• Because we don't have a global “Wall Clock”.

Page 10: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

So, what we can do?

• Global “Wall-Clock” time == External Consistency : Commit order respects global wall-time order.

• So, we will transform the problem to :

• Timestamp order respects global wall-time order.

• timestamp order == commit order.

Page 11: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Assigning timestamps to RW transactions

• Transaction that write use 2PL.

• Each transaction T is assigned a timestamp s.

• Data written by T is timestamped with s.

• Assign timestamp while locks are held.

T

Pick s = now()

Acquired locks Release locks

Page 12: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

TIMESTAMP INVARIANTS

Timestamp order == commit order

Timestamp order respects global wall-time order

T2

T3

T4

T1

Page 13: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

TrueTime API

• The key enabler of these properties (previous slide) is a new TrueTime APIand its implementation.

• The API exposes clock uncertainty, and the guarantees on Spanner’s timestampsdepend on the bounds that the implementation provides.

• The implementation keeps uncertainty small (generally less than 10ms) by usingmultiple modern clock references (GPS and atomic clocks).

Page 14: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

TrueTime

• “Global wall-clock time” with bounded uncertainty.

time

earliest latest

TT.now()

2*ε

Page 15: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

TIMESTAMPS AND TRUETIME

T

Pick s = TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > ss

average ε

Commit wait

average ε

Page 16: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Operations

• Spanner supports:

• Read-write transaction.

• Read-only transaction.

• Snapshot reads.

• Read-only transaction must be pre-declared as not have any writes.

• Reads in read-only transactions execute at a system-chosen timestamp without locking, so that incoming writes are not blocked.

• Snapshot read is a read in the past that execute without locking.

• Client can either specify a timestamp or provide an upper bound.

Page 17: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Reads within read-write transactions

• Writes that occur in a transaction are buffered at the client until commit, as a result reads in a transaction do not see the effects of them.

• The client issues reads to the leader replica of the appropriate group.

• Acquires read locks and then reads the most recent data.

• While a client transaction remains open, it sends “keep-alive” messages.

• When a client has completed all reads and buffered all writes , write protocol begin.

Page 18: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

RW transactions which involves one Paxos Group

T

Acquired locks Release locks

Start consensus Notify slaves

Commit wait donePick s

Achieve consensus

Page 19: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

RW transactions which involves more than one Paxos Group – 2PC protocol

TC

Acquired locks Release locks

TP1

Acquired locks Release locks

TP2

Acquired locks Release locks

Notify participants of s

Commit wait doneCompute s for each

Start logging Done logging

Prepared

Compute overall s

Committed

Send s

Page 20: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

EXAMPLE

TP

Remove X from my friend list

Remove myself from X’s friend list

sC=6

sP=8

s=8 s=15

Risky post P

s=8

Time <8

[X]

[me]

15

TC T2

[P]

My friends

My posts

X’s friends

8

[]

[]

Page 21: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Serving Reads at a Timestamp

• Each replica maintains .

• A replica can satisfy a read at a timestamp t if t <= .

• = min(, ).

• is timestamp of highest-applied Paxos write.

• is much harder:

• = ∞ if no pending 2PC transaction.

• = minimum (s-prepare i,g ) over i prepared transactions in group g.

• Thus, is maximum timestamp at which reads are safe.

Page 22: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Read-Only transactions

• Executes in two phases:

• Assign a timestamp .

• Reads as snapshot reads at .• The snapshot reads can execute at any replicas that are up-to-date.

• The simple assignment of =TT.now().latest , preservers external consistency.

• Such a timestamp may require the execution of data reads at to blockif has not advanced sufficiently.

• To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserved external consistency.

Page 23: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Read-Only transactions

• Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the read.

• As a result , Spanner requires a scope expression that summarizes the keys that will be read.

• If the scope’s values are served by a single Paxos group:

• The client issues the read-only transaction to the group leader.

• The leader assign = LastTS() (=the timestamp of the last committed write at Paxos).

• And execute the read at any up-to-date replica.

• If the scope’s values are served by multiple Paxos groups:

• = TT.now().latest (which may wait for safe time to advance).

Page 24: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

50 Paxos groups, 2500 buckets, 4KBreads or writes, datacenters 1ms apart.

Benchmarks

Latency remains mostly constant asnumber of replicas increases becausePaxos executes in parallel at agroup’s replicas.

Page 25: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

All leaders explicitly placed in zone Z1.

Red-line – Killing non-leaderno effects on read throughput.

Green-line – Killing leader-softgiving the leaders time to handoff leadership.

Blue-line – Killing leader-hardno warning for leaders.

Benchmarks

Page 26: Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX.

Questions?

Thanks!