Jan 07, 2016
PowerPoint-Prsentation
MySQL Group Replication:
'Synchronous',
multi-master,
auto-everything
Ulf Wendel, MySQL/Oracle
The speaker says...
MySQL 5.7 introduces a new kind of replication: MySQL Group
Replication. At the time of writing (10/2014)
MySQL Group Replication is available as a preview release on
labs.mysql.com. In common user terms it features (virtually)
synchronous, multi-master, auto-everything replication.
Proper wording...
An eager update everywhere system based on the database state
machine approach atop of a group communication system offering
virtual synchrony and reliable total ordering messaging. MySQL
Group Replication offers generalized snapshot isolation.
The speaker says...
And here is a more technical description....
WHAT ?!
Hmm, how does it compare?
The speaker says...
The technical description given for MySQL Group Replication may
sound confusing because it has elements from the distributed
systems and database systems theory. From around 1996 and 2006 the
two research communities jointly formulated the replication method
implemented by MySQL Group Replication.
As a web developer or MySQL DBA you are not expected to know
distributed systems theory inside out. Yet to understand the
properties of MySQL Group Replication and to get most of it, we'll
have to touch some of the concepts.Let's see first how the new
stuff compares to the existing.
AvailabilityCluster as a whole unaffected by loss of nodes
ScalabilityGeographic distribution
Scale size in terms of users and data
Database specific: read and/or write load
Distribution TransparencyAccess, Location, Migration, Relocation (while in use)
Replication
Concurrency, Failure
Goals of distributed databases
The speaker says...
MySQL Group Replication is about building a distributed
database. To catalog it and compare it with the existing MySQL
solutions in this area, we can ask what the goals of distributed
databases are. The goals lead to some criteria that is used to give
a first, brief overview.
Goal: a distributed database cluster strives for maximum
availability and scalability while maintaining distribution
transparency.
Criteria: availability, scalability, distribution transparency.
MySQL clustering cheat sheet
MySQL ReplicationMySQL
ClusterMySQL
Fabric
AvailabilityPrimary = SpoF,
no auto failoverShared nothing,auto failoverSpoF monitored,auto
failover
ScalabilityReadsPartial replication,node limitPartial
replication,
no node limit
Scale on WANAsynchronousSynchronous
(WAN option)Asynchronous(depends)
Distribution TransparencyR/W splittingSQL: yes(low level: no)Special clientsNo distributed queries
The speaker says...
Already today MySQL has three solutions to build a distributed MySQL cluster: MySQL Replication, MySQL Cluster and MySQL Fabric. Each system has different optimizations, none can achieve all the goals of a distributed cluster at once. Some goals are orthogonal.Take MySQL Cluster. MySQL Cluster is a shared nothing system. Data storage is reundant, nodes fail independently. Transparent sharding (partial replication) ensures read and write scalability until the maximum number of nodes is reached. Great for clients: any SQL node runs any SQL, synchronous updates become visible immediately everywhere. But, it won't scale on slow WAN connections.
How Group Replication fits in
Repl.ClusterGroup Repl.Fabric
AvailabilityShared nothing,auto failoverShared nothing,auto
failover/joinScalabilityPartial replication,node limitFull
replication,read and some write scalabilityScale on
WANSynchronous
(WAN option)(Virtually)
SynchronousDistribution TransparencySQL: yes(low level: no)All
nodes run
all SQL
The speaker says...
MySQL Group Replication has many of the desireable properties of
MySQL Cluster. Its strong on availability and client friendly due
to the distribution transparency. No complex client or application
logic is required to use the cluster. So, how do the two
differ?
Unlike MySQL Cluster, MySQL Group Replication supports the InnoDB
storage engine. InnoDB is the dominant storage engine for web
applications. This makes MySQL Group Replication a very attractive
choice for small clusters (3-7 nodes) running Drupal, WordPress, in
LAN settings! Also, Group Replication is not synchronous in a
technical way. For practical matters it is.
AvailabilityNodes fail independently
Cluster continues operation in case of node failures
ScalabilityGeographic distribution: n/a, needs fast messaging
All nodes accept writes, mild write scalability
All nodes accept reads, full read scalability
Distribution TransparencyFull replication: all nodes have all the data
Fail stop model: developer free'd to worry about consistency
Group Replication (vs. Cluster)
The speaker says...
Another major difference between MySQL Cluster and MySQL Group Replication is the use of partial replication versus full replication. MySQL Cluster has transparent sharding (partial replication) build-in. On the inside, on the level of so-called MySQL Cluster data nodes, not every node has all the data. Writes don't add work to all nodes of the cluster but only a subset of them. Partial replication is the only known solution to write scalability. With MySQL Group Replication all nodes have all the data. Writes can be executed concurrently on different nodes but each write must be coordinated with every other node. time to dig deeper >:).
Eager update everywhere... ?!
Where are transactions run?
Primary CopyUpdate Everywhere
When does synchronization happen?Eager(MySQL semi-synch Replication)MySQL ClusterMySQL Group3rd party: Galera
LazyMySQL Replication/Fabric3rd party: TungstenMySQL ClusterReplication
A developers categorization...
The speaker says...
I've described MySQL Group Replication as an eager update
everywhere system. The term comes from a categorization of
different database replication systems by the two
questions:
- where can transaction every be run?
- when are transactions synchronized between nodes?
The answers to the questions tells a developer which challenges to expect. The answers determine which additional tasks an application must handle when its run on a cluster instead of a single server.
Lazy causes work...
010101001011010101010110100101101010010101010101010110101011101010110111101
NodeNodeNodeNode
price = 1.23price = 1.00price = 1.23price = 0.98Set price = 1.23
The speaker says...
When you try to scale an application running it on a lazy (asynchronous) replication cluster instead of a single server you will soon have users complaining about outdated and incorrect data. Depending which node the application connects to after a write, a user may or may not see his own updates. This can neither happen on a single server system nor on an eager (synchronous) replication cluster. Lazy replication causes extra work for the developer.
BTW, have a look at PECL/mysqlnd_ms. It abstracts the problem of consistency for you. Things like read-your-writes boil down to a single function call.
Primary Copy causes work...
PrimaryWriteCopyCopyCopy
Read
Read
ReadRead
The speaker says...
Judging from the developer perspective only, primary copy is an undesired replication solution. In a primary copy system only one node accepts writes. The other nodes copy the updates performed on the primary. Because of the read-write splitting, the replication system does not need to coordinate conflicting operations. Great for the replication system author, bad for the developer. As a developer you must ensure that all write operations are directed to the primary node... Again, have a look at PECL/mysqlnd_ms.
MySQL Replication follows this approach. Worse, MySQL Replication is a lazy primary copy system.
Love: Eager Update Everywhere
NodeWriteNodeNode
Read
WriteReadWriteReadprice = 1.23price = 1.23price = 1.23
The speaker says...
From a developer perspective an eager update anywhere system, like MySQL Group Replication, is indistinguishable from a single node. The only extra work it brings you is load balancing, but that is the case with any cluster. An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. (Sometimes synchronization happens as part of the commit, thus strong consistency can be achieved.) Fault tolerance is better than with Primary Copy. There is no single point of failure a single primary - that can cause a total outage of the cluster. Nodes may fail individually without bringing the cluster down immediately.
HOW? Distributed + DB?
Database state machine?
The speaker says...
In the mid-1990s two observations made the database and
distributed system theory communities wondered if they could
develop a joint replication approach.
First Gray et. al. (database communitiy) showed that the common
two-phase locking has an expected deadlock rate that grows with the
third power of the number of replicas.
Second, Schiper and Raynal noted that transactions have common
properties with group communication principles (distributed
systems) such as ordering, agreement/'all-or-nothing' and even
durability.
State machine replication trivial to understand
Atomic Broadcast database meets distributed systems community
OMG, how easy state machine replication is to implement!
Deferred Update Database Replication database meets distributed systems community
how we gain high availability and high performance
what those MySQL Replication team blogs talk about ;-)
Three building blocks
The speaker says...
Finally, in 1999 Pedone, Guerraoui and Schiper published the
paper The Database State Machine Approach. The paper combines two
well known building blocks for replication with a messaging
primitive common in the distributed systems world: atomic
broadcast.
MySQL Group Replication is slightly different from this 1999 version, more following a later refinement from 2005 plus a bit of additional ease-of-use. However, by end of this chapter you learned how MySQL Cluster and MySQL Group Replication differ beyond InnoDB support and built-in sharding.
State machine replication
ReplicaReplicaSet A = 1
Input
ReplicaOutputA = 1A = 1A = 1OutputOutput
The speaker says...
The first building block is trivial: a state machine. A state machine takes some input and produces some output. Assume your state machines are determinisitic. Then, if you have a set of replicas all running the same state machine and they all get the same input, they all will produce the same output. On an aside: state machine replication is also known as active replication. Active means that every replica executes all the operations, active adds compute load to every replica. With passive replication, also called primary-backup replication, one replica (primary) executes the operations and forwards the results to the others. Passive suffers under primary availability and possibly network bandwith.
Requirement: Agreement
ReplicaReplicaSet A = 1
Input
ReplicaOutputA = 1A = NULL
The speaker says...
Here's more trivia about the state machine replication approach. There are two requirements for it to work. Quite obviously, every replica has to receive all input to come to the same output. And the precondition for receiving input is that the replica is still alive.
In academic words the requirement is: agreement. Every non-faulty replica receives every request. Non-faulty replicas must agree on the input.
Requirement: Order
ReplicaReplica1) Set A = 1
ReplicaA = 1A = 13) Set B = A *22) Set B = 1B = 2B = 1Input: 1, 2, 3Input: 1, 3, 2Input: 3, 1, 2A = 1B = 1
The speaker says...
The second trivial requirement for state machine replication is
ordering. To produce the same output any two state machines must
execute the very same input including the ordering of input
operations. The academic wording goes: if a replica processes
requests r1 before r2, then no replica processes request r2 before
r1. Note that if operations commute, some reording may still lead
to correct output. The sequence A = 1, B = 1, B = A * 2 and the
sequence B = 1, A = 1, B = A * 2 produce the same output.
(Unrelated here: the database scaling talk touches the fancy
commutative replicated data types Riak offers... hot!)
Distributed systems messaging abstractionMeets all replicated
state machine requirements
AgreementIf a site delivers a message m then every site delivers m
OrderNo two sites deliver any two messages in different orders
TerminationIf a site broadcasts message m and does not fail, then every site eventually delivers m
We need this in asynchronous enivronments
Atomic Broadcast
The speaker says...
State machine replication is the first building block for
understanding the database state machine approach. The second
building block is a messaging abstraction from the distributed
systems world called atomic broadcast. Atomic broadcast provides
all the properties required for state machine replication:
agreement and ordering. It adds a property needed for communication
in an asynchronous system, such as a system communicating via
network messages: termination.
All in all, this greatly simplifies state machine replication and contributes to a simple, layered design.
Delivery, durability, group
ReplicaReplicaReplicaMr. X
Client
ReplicaReplicaReplica
Group
Send first, possibly delivered second
The speaker says...
The Atomic broadcast properties given are literally copied from the original paper describing the database state machine replication approach. There is two things in it not explained yet. First, atomic broadcast defines properties in terms of message delivery. The delivery property not only ensures total ordering despite slow transport but also covers message loss (MySQL desires uniform agreement here, something better than Corosync) and even the crash and recovery of processors (durability)! A recovering processor must first deliver outstanding messages before it continues. Second, note that atomic broadcast introduces the notion of a group. Only (correct) members of a group can exchange messages.
Deferred Update: the best?
ReplicaReplicaReplicaReplicaReplicaReplica
ClientClient
Client Request
Server Coordination
Execution
Agreement
Client Response
The speaker says...
We are almost there. The third building block to the database state machine replication is deferred update database replication. The slide shows a generic functional model used by Pedone and Schiper in 2010 to illustrate their choice of deferred update.The argument goes that deferred update combines the best of the two most prominent object replication techniques: active and passive replication. Only the comination of the best from the two will give both high availability and high performance. Translation: MySQL Group Replication can in theory - have higher overall throughput than MySQL Replication. Do you love the theory ;-) ? As a DBA you should.
Active Replication (SM)
ReplicaReplicaReplicaReplicaReplicaReplica
ClientClient
Client sends op to all
Requests get ordered
Execution
All reply to client
The speaker says...
In an active replication system, a pure state machine replication system, the client operations are forwarded to all replicas and each replica individually executes the operation. The two challenges are to ensure all replicas execute requests in the same order and all replicas decide the same. Recall, that we talk multi-threaded database servers here.
A downside is that every replica has to execute the operation. If the operation is expensive in terms of CPU, this can be a waste of CPU time.
Passive Replication
BackupPrimaryBackupReplicaReplicaReplica
ClientClient
Client sends op to primary
Only primary executes
Primary forwards changes
Primary replies to client
The speaker says...
The alternative is passive replication or primary-backup replication. Here, the client talks to only one server, the primary. Only the primary server executes client operations. After computation of the result, the primary forwards the changes to the backups which apply tem.
The problem here is that the primary determines the systems throughput. None of the backups can contribute its computing power to the overall system throughput.
What we want... for performance: more than one primary
for scalability: no distributed locking
.. and of course: transactions
Two-staged transaction protocol
Multi-primary (pass.) replication
ClientPrimaryPrimaryPrimary
Transaction processing
Transaction termination
The speaker says...
Multi-primary (passive) replication has all the ingredients desired. Transaction processing is two staged. First, a client picks any replica to execute a transaction. This replica becomes the primary of the transaction. The transaction executes locally, the stage is called transaction processing. In the second stage, during transaction termination, the primaries jointly decide whether the transaction can commit or must abort.Because updates are not immediately applied, database folks call this deferred update our last building block.
Deterministic certificationReads execute locally, Updates get certified
Certification ensures transaction serializability
Replicas decide independently about certification result
Deferred Update DB Replication
ReadPrimary
WritePrimaryPrimaryPrimary
Rs/Ws/U
The speaker says...
One property of transactions is isolation. Isolation is also
know as serializability: the concurrent execution of transactions
should be equivalent to a serial execution of the same
transactions. In Deferred Update system, read transactions are
processed and terminated on one replica and serialized
locally.
Updates must be certified. After the transaction processing the
readset, writeset and updates are sent to all other replicas. The
servers then decide in a deterministic procedure whether (one-copy)
serializability holds, if the transaction commits. Because its a
deterministic procedure, the servers can certify transactions
independently!
Atomic Broadcast based this is what is used, by MySQL, by DBSM
Optimization: Reordering (atop of Atomic Broadcast) in theory it means less transaction aborts
Optimization limit: Generic Broadcast based this has issues,
which make it nasty
Atomic Commit based more transactions than atomic broadcast
Options for termination
The speaker says...
There are several ways of implementing the termination protocol and the certification. There are two truly distinct choices: atomic broadcast and atomic commit. Atomic commit causes more transaction aborts than atomic broadcast. So, it's out and atomic broadcast remains.
Atomic broadcast can in theory be further optimized towards less transaction aborts using reordering. For practically matters, this is about where the optimizations end. A weaker (and possibly faster) generic broadcast causes problems in the transactional model. For databases, it could be an over-optimization.
Transactions have a stateExecuting, Comitting, Comitted, Aborted
Reads are handled locally
Updates are send to all replicasReadset and writeset are
forwarded
On each replica: search for 'conflicting' transactionsCan be serialized with all previous transactions? Commit!
Commit? Abort local transaction that overlap with update
Generic certification test
The speaker says...
No matter what termination procedure is used, the basic procedure for certification in the deferred update model is always the same. Updates/writes need certification. The data read and the data written by a transaction is forwarded to all other replicas.
Every replica searches for potentially 'conflicting' transactions, the details depend on the termination procedure. A transaction is decided to commit if it does not violate serializability with all previous transactions. Any local transaction currently running and conflicting with the update is aborted.
Deferred Update Database Replication as a state machineAtomic Broadcast based termination
Database State Machine
Plugin ServicesMySQLTransaction hooksPluginsMySQL Group ReplicationCaptureApplyRecoverReplication Protocol incl. termination protocol/certifierGroup Communication System
The speaker says...
The Database State Machine Approach combines all the bits and pieces. Let's do a bottom up summary. Atomic broadcast not only free's the database developer to bother about networking APIs it also solves the nasty bits of communicating in an asynchronous network. It provides properties that meet the requirements of the state machine replication. A deterministic state machine is what one needs to implement the termination protocol within deferred update replication. Deferred update replication does not use distributed locking which Gray proved problematic and it combines the best of active and passive replication. Side effects: simple replication protocol, layered code.
Updates are send to all replicasReadset and writeset are
forwarded
Step 1 - On each replica: certifyIs there any comitted
transaction that conflicts?
(In the original paper: check for write-read conflicts between
comitting transaction and comitted transactions using. Does the
committing transaction readset overlap with any comitted
transactions writeset. Works slightly different in MySQL.)
Step 2 On each replica: commitmentApply transactions decided to commit
Handle concurrent local transactions: remote wins
The termination algorithm
The speaker says...
The termination process has two logical steps, just like the general one presented earlier. The very details of how exactly two transactions are checked for conflicts in the first step don't matter here. MySQL Group Replication is using a refinement of the algorithm tailored to its own needs. As a developer all you need to know is: a remote transaction always wins no matter how expensive local transactions are. And, keep conflicting writes on one replica. It's faster.
The puzzling bit on the slide is the rule to check check a commiting transaction against any commited transaction for conflicts. Any !? Not any... only concurrent.
What's concurrent?
ReplicaReplicaReplicaReplicaReplica
Total order11
22
1212Broadcast
Delivery
Any other transaction that precedes the current oneRecall: total ordering
Recall: asynchronous, delay between broadcast and delivery
The speaker says...
The definition of what concurrent means is a bit tricky. Its
defined through a negation and that's confusing on the first look
but becomes hopefully clear on the next slide.Concurrent to a
transaction is any other transaction that does precede it. If we
know the order of all transactions in the entire cluster -, then we
can which transactions precede one another.
Atomic broadcast ensures total order on delivery. Some
implementations decide on ordering when sending and that number
(logical clock) could be be used. Any logical clock works.
Certify against all previous?
ReplicaReplicaReplicaReplicaReplica
Total order3Transaction(2)Certification
22234344
Broadcast:
Transaction 4 is based on all previous up to 2
Certification when 4 is delivered:Check conflicts with trx >2 and trx < 4
The speaker says...
The slide has an example how to find any other transaction that precedes one. When a transaction enters the committing state and is broadcasted, the broadcast includes the logical time (= total order number on the slide) of the latest transaction comitted on the replica. Eventually the transaction is delivered on all sites. Upon delivery the certification considers all transactions that happend after the logical time of the to be certified transaction. All those transactions precede the one to be certified, they executed concurrently at different replicas. We don't have to look further in the past. Further in the past is stuff that's been decided on already.
TIME TO BREATH
MySQL is different anyway...
The speaker says...
Good news! The algorithm used by MySQL Group Replication is different and simpler. For correctness, the precedes relation is still relevant. But it comes for free...
A developers view on commit
ReplicaReplicaReplicaReplicaReplica
t(3)Certify
44
CertifyApplyClientClient
Execute
BEGINCOMMIT
Result
The speaker says...
We are not done with the theory yet but let's do some slides
that take the developers perspective. Assuming you have to scale a
PHP application, assuming a small cluster of a handful MySQL
servers is enough and assuming these servers are co-located on
racks, then MySQL Group Replication is your best possible
choice.
Did you get this from the theory? Replication is 'synchronous'. On
commit you wait only for the server you are connected to. Once your
transaction is broadcasted, you are done. You don't wait for the
other servers to execute the transaction. With uniform atomic
broadcast, once your transaction is broadcasted, it cannot get
lost. (That's why I torture you with theory.)
MySQL Replication
MasterSlaveReplicaReplicaReplica
Fetch
Bin log etc.ApplyClientClient
execute
BEGINCOMMIT
OK
The speaker says...
If your network is slow or mother earth, the speed of light and
network message round trip time adds too much too your transaction
execution time, then asynchronous MySQL Replication is a better
choice.
In MySQL Replication the master (primary) never waits for the network. Not even to broadcast updates. Slaves asynchronously pull changes. Despite pushing work on the developer this approach has the downsite that a hardware crash on the master can cause transaction loss. Slaves may or may not have pulled the latest data.
MySQL Semi-sync Replication
MasterSlaveReplicaReplicaReplica
Fetch
Bin logApplyClientClient
Execute
BEGINCOMMIT
OK
SlaveReplica
FetchApply
Wait for first ACK
The speaker says...
In the times of MySQL 5.0 the MySQL Community suggested that to avoid transaction loss the master should wait for one slave to acknowledge it has fetched the update from the master. The fact that it's fetched does not mean that it's been applied. The update may not be visible to clients yet.
It is a back and forth whether database replication should be asynchronous or not. It depends on your needs.
Back to theory after this break.
Back to theory!
Virtual Synchrony?
Groups and viewsA turbo-charged veryion of Atomic Broadcast
Virtual Synchrony
P1P2P3P4
M1
M2
VC
M3
M4
G1 = {P1, P2, P3}
G2 = {P1, P2, P3, P4}
The speaker says...
Good news! Virtual Synchrony and Atomic Broadcast are the same. Our Atomic Broadcast definition assumes a static group. Adding group members, removing members or detecting failed ones is covered.
Virtual Synchrony handles all these membership changes. Whenever an existing group agrees on changes, a new view is installed through a view change (VC) event.(The term 'virtual': it's not synchronous. There is a delay we don't want to wait for short message delays. Yet, the system appears to be synchronous to most real life observers.)
View changes act as a message barrierThat's a case causing troubles in Two-Phase Commit
Virtual Synchrony
P1P2P3P4
M5
VC
M6
G2 = {P1, P2, P3, P4}
G3 = {P1, P2, P3}
M7
M8
The speaker says...
View changes are message barriers. If the group members suspect a member to have failed they install a new view.
Maybe the former member was not dead but just too slow to respond, or disconnected for a brief period. False alarm. The former member then tries to broadcast some updates. Virtual Synchrony ensures that the updates will not be seen by the remaining members. Furthermore the former member will realize that it was excluded.Some GCS implementing virtual synchrony even provide abstractions that ensure a joining member learns all updates it missed (state transfer) before it rejoins.
Auto-everything: failover
MySQLMySQLMySQL
MySQLMySQL
MySQLMySQL Group Replication has a pluggable GCS APISplit brain handling? Depends onGCS and/or GCS config
Default GCS is Corosync
The speaker says...
Good news! The Virtual Synchrony group membership advantages are fully exposed to the user level: node failures are detected and handled automatically. PECL/mysqlnd_ms can help you with the client site. It's a minor tweak to have it automatically learn about remaining MySQL server. Expect and update release soon.
MySQL Group Replication works with any Group Communication system that can be accessed from C and implements Virtual Synchrony. The default choice is Corosync. Split brain handling is GCS dependent. MySQL follows view change notifications of the GCS.
Auto-everything: joining
MySQLMySQLMySQL
Elastic cluster grows and shrinks on demandState transfer done via asynch replication channel
MySQLMySQLMySQL
Donor
State transfer
Joiner
The speaker says...
Good news! When adding a server you don't fiddle with the very details. You start the server, tell it to join the cluster and wait for it to catch up. The server picks a donor, begins fetching updates using much of the existing MySQL Replication code infrastructure and that's it.
Back to theory!
Generalized Snapshot Isolation
Transaction read set does not need to be broadcastedReadset is hard to extract and can be huge
Weaker serializability level than 1SR
Sufficient for InnoDB default isolation
Deferred Update tweak
ReadPrimary
WritePrimaryPrimaryPrimary
V/Ws/U
The speaker says...
Good news! This is last bit of theory. The original Database State Machine proposal was followed by a simpler to implement proposal in 2005. If the clusters serialization level is marginally lowered to snapshot, certification becomes easier. Generalized snapshot isolation can be achieved without having to broadcast the readset of transactions. Recording the readset of a transaction is difficult in most existing databases. Also, readsets can be huge. Snapshot isolation is an isolation level for multi-version concurrency control. MVCC? InnoDB! Somehow... Whatever this is the MySQL Group Replication termination base algorithm.
Conflict (both change x)
Concurrent and write conflict? First comitter wins!Reads use snapshot from the beginning of the transaction
Snapshot Isolation
T1T2T1
T2
BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1
BEGIN(v1), W(v1, x=2), , , COMMIT?
Concurrent write (version 1)
First committer
The speaker says...
In Snapshot Isolations transactions take a snapshot when they begin. All reads return data from this snapshot. Although any other concurrent transaction may update the underlying data while the transaction still runs, the change is unvisiable, the transaction runs in isolation. If two concurrent transactions change the same data item they conflict. In case of conflicts, the first comitter wins.
MVCC requires that as part update of an data item its version is incremented. Future transactions will base their snapshot on the new version.
The actual termination protocol
ReplicaReplicaReplicaReplicaReplica
Write(v2, x=1)Certification
ObjectLatest version
x1
y13
OK
The speaker says...
Every replica checks the version of a write during certification. It compares the writes data items version number with the latest it knows of. If the version is higher or equal than the one found in the replicas certification index, the write is accepted. A lower number indicates that someone has already updated the data item before. Because the first comitter must win a write showing a lower version number than is in the certification index must abort.
(The certification index fills over time and is truncated periodically by MySQL. MySQL reports the size through Performance Schema tables.)
Hmm...
Does it work?
It's a preview there are limits
General InnoDB only
Corosync lacks uniform agreement
No rules to prevent split-brain (it's a preview, you're allowed to fool yourself if you misconfigure the GCS!)
Isolation levelPrimary Key based
Foreign Keys and Unique Keys not supported yet
No concurrent DDL
That's it, folks!
Questions?
The speaker says...
(Oh, a question. Flips slide)
Network messages pffft!
@markcallaghan Sep 30
For MySQL sync replication, when all commits originate from 1 master is there 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group-replication-hello-world @Ulf_Wendel@markcallaghan AFAIK, on the logical level, there should be one. Some of your questions might depend on the GCS used. The GCS is pluggable
@markcallaghan@Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I remain confused
MySQL super hero at Facebook
GCS != MySQL Semi-sync
MySQLMySQLIt's many round trips, how many depends on GCSDefault GCS is Corosync, Corosyc is Totem Ring
Corosync uses a privilege-based approach for total ordering
Many options: fixed sequencer, moving sequencer, ...
Where you run your updates only impacts collision rate
CorosyncCorosyncMySQLCorosync
The speaker says...
No Mark, MySQL Group Replication cannot be understood as a replacement for MySQL Semi-sync Replication. The question about network round trips is hard to answer. Atomic Broadcast and Virtual Synchrony stack many subprotocols together. Let's consider a stable group, no network failure, Totem. Totem orders messages using a token that circulates along a virtual ring of all members. Whoever has the token, has the priviledge to broadcast. Others wait for the token to appear. Atomic Broadcast gives us all or nothing messaging. It takes at least another full round on the ring to be sure the broadcast has been received by all. How many round trips are that? Welcome to distributed systems...
THE END
Contact: [email protected]
The speaker says...
Thank you for your attendance!
Upcoming shows:
Talk&Show! - YourPlace, any time