Top Banner

of 92

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

PowerPoint-Prsentation

MySQL Group Replication:
'Synchronous',
multi-master,
auto-everything

Ulf Wendel, MySQL/Oracle

The speaker says...

MySQL 5.7 introduces a new kind of replication: MySQL Group Replication. At the time of writing (10/2014)
MySQL Group Replication is available as a preview release on labs.mysql.com. In common user terms it features (virtually) synchronous, multi-master, auto-everything replication.

Proper wording...
An eager update everywhere system based on the database state machine approach atop of a group communication system offering virtual synchrony and reliable total ordering messaging. MySQL Group Replication offers generalized snapshot isolation.

The speaker says...

And here is a more technical description....

WHAT ?!
Hmm, how does it compare?

The speaker says...

The technical description given for MySQL Group Replication may sound confusing because it has elements from the distributed systems and database systems theory. From around 1996 and 2006 the two research communities jointly formulated the replication method implemented by MySQL Group Replication.

As a web developer or MySQL DBA you are not expected to know distributed systems theory inside out. Yet to understand the properties of MySQL Group Replication and to get most of it, we'll have to touch some of the concepts.Let's see first how the new stuff compares to the existing.

AvailabilityCluster as a whole unaffected by loss of nodes

ScalabilityGeographic distribution

Scale size in terms of users and data

Database specific: read and/or write load

Distribution TransparencyAccess, Location, Migration, Relocation (while in use)

Replication

Concurrency, Failure

Goals of distributed databases

The speaker says...

MySQL Group Replication is about building a distributed database. To catalog it and compare it with the existing MySQL solutions in this area, we can ask what the goals of distributed databases are. The goals lead to some criteria that is used to give a first, brief overview.

Goal: a distributed database cluster strives for maximum availability and scalability while maintaining distribution transparency.

Criteria: availability, scalability, distribution transparency.

MySQL clustering cheat sheet

MySQL ReplicationMySQL
ClusterMySQL
Fabric

AvailabilityPrimary = SpoF,
no auto failoverShared nothing,auto failoverSpoF monitored,auto failover

ScalabilityReadsPartial replication,node limitPartial replication,
no node limit

Scale on WANAsynchronousSynchronous
(WAN option)Asynchronous(depends)

Distribution TransparencyR/W splittingSQL: yes(low level: no)Special clientsNo distributed queries

The speaker says...

Already today MySQL has three solutions to build a distributed MySQL cluster: MySQL Replication, MySQL Cluster and MySQL Fabric. Each system has different optimizations, none can achieve all the goals of a distributed cluster at once. Some goals are orthogonal.Take MySQL Cluster. MySQL Cluster is a shared nothing system. Data storage is reundant, nodes fail independently. Transparent sharding (partial replication) ensures read and write scalability until the maximum number of nodes is reached. Great for clients: any SQL node runs any SQL, synchronous updates become visible immediately everywhere. But, it won't scale on slow WAN connections.

How Group Replication fits in

Repl.ClusterGroup Repl.Fabric

AvailabilityShared nothing,auto failoverShared nothing,auto failover/joinScalabilityPartial replication,node limitFull replication,read and some write scalabilityScale on WANSynchronous
(WAN option)(Virtually)
SynchronousDistribution TransparencySQL: yes(low level: no)All nodes run
all SQL

The speaker says...

MySQL Group Replication has many of the desireable properties of MySQL Cluster. Its strong on availability and client friendly due to the distribution transparency. No complex client or application logic is required to use the cluster. So, how do the two differ?

Unlike MySQL Cluster, MySQL Group Replication supports the InnoDB storage engine. InnoDB is the dominant storage engine for web applications. This makes MySQL Group Replication a very attractive choice for small clusters (3-7 nodes) running Drupal, WordPress, in LAN settings! Also, Group Replication is not synchronous in a technical way. For practical matters it is.

AvailabilityNodes fail independently

Cluster continues operation in case of node failures

ScalabilityGeographic distribution: n/a, needs fast messaging

All nodes accept writes, mild write scalability

All nodes accept reads, full read scalability

Distribution TransparencyFull replication: all nodes have all the data

Fail stop model: developer free'd to worry about consistency

Group Replication (vs. Cluster)

The speaker says...

Another major difference between MySQL Cluster and MySQL Group Replication is the use of partial replication versus full replication. MySQL Cluster has transparent sharding (partial replication) build-in. On the inside, on the level of so-called MySQL Cluster data nodes, not every node has all the data. Writes don't add work to all nodes of the cluster but only a subset of them. Partial replication is the only known solution to write scalability. With MySQL Group Replication all nodes have all the data. Writes can be executed concurrently on different nodes but each write must be coordinated with every other node. time to dig deeper >:).

Eager update everywhere... ?!

Where are transactions run?

Primary CopyUpdate Everywhere

When does synchronization happen?Eager(MySQL semi-synch Replication)MySQL ClusterMySQL Group3rd party: Galera

LazyMySQL Replication/Fabric3rd party: TungstenMySQL ClusterReplication

A developers categorization...

The speaker says...

I've described MySQL Group Replication as an eager update everywhere system. The term comes from a categorization of different database replication systems by the two questions:

- where can transaction every be run?
- when are transactions synchronized between nodes?

The answers to the questions tells a developer which challenges to expect. The answers determine which additional tasks an application must handle when its run on a cluster instead of a single server.

Lazy causes work...

010101001011010101010110100101101010010101010101010110101011101010110111101

NodeNodeNodeNode

price = 1.23price = 1.00price = 1.23price = 0.98Set price = 1.23

The speaker says...

When you try to scale an application running it on a lazy (asynchronous) replication cluster instead of a single server you will soon have users complaining about outdated and incorrect data. Depending which node the application connects to after a write, a user may or may not see his own updates. This can neither happen on a single server system nor on an eager (synchronous) replication cluster. Lazy replication causes extra work for the developer.

BTW, have a look at PECL/mysqlnd_ms. It abstracts the problem of consistency for you. Things like read-your-writes boil down to a single function call.

Primary Copy causes work...

PrimaryWriteCopyCopyCopy

Read

Read

ReadRead

The speaker says...

Judging from the developer perspective only, primary copy is an undesired replication solution. In a primary copy system only one node accepts writes. The other nodes copy the updates performed on the primary. Because of the read-write splitting, the replication system does not need to coordinate conflicting operations. Great for the replication system author, bad for the developer. As a developer you must ensure that all write operations are directed to the primary node... Again, have a look at PECL/mysqlnd_ms.

MySQL Replication follows this approach. Worse, MySQL Replication is a lazy primary copy system.

Love: Eager Update Everywhere

NodeWriteNodeNode

Read

WriteReadWriteReadprice = 1.23price = 1.23price = 1.23

The speaker says...

From a developer perspective an eager update anywhere system, like MySQL Group Replication, is indistinguishable from a single node. The only extra work it brings you is load balancing, but that is the case with any cluster. An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. (Sometimes synchronization happens as part of the commit, thus strong consistency can be achieved.) Fault tolerance is better than with Primary Copy. There is no single point of failure a single primary - that can cause a total outage of the cluster. Nodes may fail individually without bringing the cluster down immediately.

HOW? Distributed + DB?
Database state machine?

The speaker says...

In the mid-1990s two observations made the database and distributed system theory communities wondered if they could develop a joint replication approach.

First Gray et. al. (database communitiy) showed that the common two-phase locking has an expected deadlock rate that grows with the third power of the number of replicas.

Second, Schiper and Raynal noted that transactions have common properties with group communication principles (distributed systems) such as ordering, agreement/'all-or-nothing' and even durability.

State machine replication trivial to understand

Atomic Broadcast database meets distributed systems community

OMG, how easy state machine replication is to implement!

Deferred Update Database Replication database meets distributed systems community

how we gain high availability and high performance

what those MySQL Replication team blogs talk about ;-)

Three building blocks

The speaker says...

Finally, in 1999 Pedone, Guerraoui and Schiper published the paper The Database State Machine Approach. The paper combines two well known building blocks for replication with a messaging primitive common in the distributed systems world: atomic broadcast.

MySQL Group Replication is slightly different from this 1999 version, more following a later refinement from 2005 plus a bit of additional ease-of-use. However, by end of this chapter you learned how MySQL Cluster and MySQL Group Replication differ beyond InnoDB support and built-in sharding.

State machine replication

ReplicaReplicaSet A = 1

Input

ReplicaOutputA = 1A = 1A = 1OutputOutput

The speaker says...

The first building block is trivial: a state machine. A state machine takes some input and produces some output. Assume your state machines are determinisitic. Then, if you have a set of replicas all running the same state machine and they all get the same input, they all will produce the same output. On an aside: state machine replication is also known as active replication. Active means that every replica executes all the operations, active adds compute load to every replica. With passive replication, also called primary-backup replication, one replica (primary) executes the operations and forwards the results to the others. Passive suffers under primary availability and possibly network bandwith.

Requirement: Agreement

ReplicaReplicaSet A = 1

Input

ReplicaOutputA = 1A = NULL

The speaker says...

Here's more trivia about the state machine replication approach. There are two requirements for it to work. Quite obviously, every replica has to receive all input to come to the same output. And the precondition for receiving input is that the replica is still alive.

In academic words the requirement is: agreement. Every non-faulty replica receives every request. Non-faulty replicas must agree on the input.

Requirement: Order

ReplicaReplica1) Set A = 1

ReplicaA = 1A = 13) Set B = A *22) Set B = 1B = 2B = 1Input: 1, 2, 3Input: 1, 3, 2Input: 3, 1, 2A = 1B = 1

The speaker says...

The second trivial requirement for state machine replication is ordering. To produce the same output any two state machines must execute the very same input including the ordering of input operations. The academic wording goes: if a replica processes requests r1 before r2, then no replica processes request r2 before r1. Note that if operations commute, some reording may still lead to correct output. The sequence A = 1, B = 1, B = A * 2 and the sequence B = 1, A = 1, B = A * 2 produce the same output.
(Unrelated here: the database scaling talk touches the fancy commutative replicated data types Riak offers... hot!)

Distributed systems messaging abstractionMeets all replicated state machine requirements

AgreementIf a site delivers a message m then every site delivers m

OrderNo two sites deliver any two messages in different orders

TerminationIf a site broadcasts message m and does not fail, then every site eventually delivers m

We need this in asynchronous enivronments

Atomic Broadcast

The speaker says...

State machine replication is the first building block for understanding the database state machine approach. The second building block is a messaging abstraction from the distributed systems world called atomic broadcast. Atomic broadcast provides all the properties required for state machine replication: agreement and ordering. It adds a property needed for communication in an asynchronous system, such as a system communicating via network messages: termination.

All in all, this greatly simplifies state machine replication and contributes to a simple, layered design.

Delivery, durability, group

ReplicaReplicaReplicaMr. X

Client

ReplicaReplicaReplica

Group

Send first, possibly delivered second

The speaker says...

The Atomic broadcast properties given are literally copied from the original paper describing the database state machine replication approach. There is two things in it not explained yet. First, atomic broadcast defines properties in terms of message delivery. The delivery property not only ensures total ordering despite slow transport but also covers message loss (MySQL desires uniform agreement here, something better than Corosync) and even the crash and recovery of processors (durability)! A recovering processor must first deliver outstanding messages before it continues. Second, note that atomic broadcast introduces the notion of a group. Only (correct) members of a group can exchange messages.

Deferred Update: the best?

ReplicaReplicaReplicaReplicaReplicaReplica

ClientClient

Client Request

Server Coordination

Execution

Agreement

Client Response

The speaker says...

We are almost there. The third building block to the database state machine replication is deferred update database replication. The slide shows a generic functional model used by Pedone and Schiper in 2010 to illustrate their choice of deferred update.The argument goes that deferred update combines the best of the two most prominent object replication techniques: active and passive replication. Only the comination of the best from the two will give both high availability and high performance. Translation: MySQL Group Replication can in theory - have higher overall throughput than MySQL Replication. Do you love the theory ;-) ? As a DBA you should.

Active Replication (SM)

ReplicaReplicaReplicaReplicaReplicaReplica

ClientClient

Client sends op to all

Requests get ordered

Execution

All reply to client

The speaker says...

In an active replication system, a pure state machine replication system, the client operations are forwarded to all replicas and each replica individually executes the operation. The two challenges are to ensure all replicas execute requests in the same order and all replicas decide the same. Recall, that we talk multi-threaded database servers here.

A downside is that every replica has to execute the operation. If the operation is expensive in terms of CPU, this can be a waste of CPU time.

Passive Replication

BackupPrimaryBackupReplicaReplicaReplica

ClientClient

Client sends op to primary

Only primary executes

Primary forwards changes

Primary replies to client

The speaker says...

The alternative is passive replication or primary-backup replication. Here, the client talks to only one server, the primary. Only the primary server executes client operations. After computation of the result, the primary forwards the changes to the backups which apply tem.

The problem here is that the primary determines the systems throughput. None of the backups can contribute its computing power to the overall system throughput.

What we want... for performance: more than one primary

for scalability: no distributed locking

.. and of course: transactions

Two-staged transaction protocol

Multi-primary (pass.) replication

ClientPrimaryPrimaryPrimary

Transaction processing

Transaction termination

The speaker says...

Multi-primary (passive) replication has all the ingredients desired. Transaction processing is two staged. First, a client picks any replica to execute a transaction. This replica becomes the primary of the transaction. The transaction executes locally, the stage is called transaction processing. In the second stage, during transaction termination, the primaries jointly decide whether the transaction can commit or must abort.Because updates are not immediately applied, database folks call this deferred update our last building block.

Deterministic certificationReads execute locally, Updates get certified

Certification ensures transaction serializability

Replicas decide independently about certification result

Deferred Update DB Replication

ReadPrimary

WritePrimaryPrimaryPrimary

Rs/Ws/U

The speaker says...

One property of transactions is isolation. Isolation is also know as serializability: the concurrent execution of transactions should be equivalent to a serial execution of the same transactions. In Deferred Update system, read transactions are processed and terminated on one replica and serialized locally.

Updates must be certified. After the transaction processing the readset, writeset and updates are sent to all other replicas. The servers then decide in a deterministic procedure whether (one-copy) serializability holds, if the transaction commits. Because its a deterministic procedure, the servers can certify transactions independently!

Atomic Broadcast based this is what is used, by MySQL, by DBSM

Optimization: Reordering (atop of Atomic Broadcast) in theory it means less transaction aborts

Optimization limit: Generic Broadcast based this has issues, which make it nasty

Atomic Commit based more transactions than atomic broadcast

Options for termination

The speaker says...

There are several ways of implementing the termination protocol and the certification. There are two truly distinct choices: atomic broadcast and atomic commit. Atomic commit causes more transaction aborts than atomic broadcast. So, it's out and atomic broadcast remains.

Atomic broadcast can in theory be further optimized towards less transaction aborts using reordering. For practically matters, this is about where the optimizations end. A weaker (and possibly faster) generic broadcast causes problems in the transactional model. For databases, it could be an over-optimization.

Transactions have a stateExecuting, Comitting, Comitted, Aborted

Reads are handled locally

Updates are send to all replicasReadset and writeset are forwarded

On each replica: search for 'conflicting' transactionsCan be serialized with all previous transactions? Commit!

Commit? Abort local transaction that overlap with update

Generic certification test

The speaker says...

No matter what termination procedure is used, the basic procedure for certification in the deferred update model is always the same. Updates/writes need certification. The data read and the data written by a transaction is forwarded to all other replicas.

Every replica searches for potentially 'conflicting' transactions, the details depend on the termination procedure. A transaction is decided to commit if it does not violate serializability with all previous transactions. Any local transaction currently running and conflicting with the update is aborted.

Deferred Update Database Replication as a state machineAtomic Broadcast based termination

Database State Machine

Plugin ServicesMySQLTransaction hooksPluginsMySQL Group ReplicationCaptureApplyRecoverReplication Protocol incl. termination protocol/certifierGroup Communication System

The speaker says...

The Database State Machine Approach combines all the bits and pieces. Let's do a bottom up summary. Atomic broadcast not only free's the database developer to bother about networking APIs it also solves the nasty bits of communicating in an asynchronous network. It provides properties that meet the requirements of the state machine replication. A deterministic state machine is what one needs to implement the termination protocol within deferred update replication. Deferred update replication does not use distributed locking which Gray proved problematic and it combines the best of active and passive replication. Side effects: simple replication protocol, layered code.

Updates are send to all replicasReadset and writeset are forwarded

Step 1 - On each replica: certifyIs there any comitted transaction that conflicts?
(In the original paper: check for write-read conflicts between comitting transaction and comitted transactions using. Does the committing transaction readset overlap with any comitted transactions writeset. Works slightly different in MySQL.)

Step 2 On each replica: commitmentApply transactions decided to commit

Handle concurrent local transactions: remote wins

The termination algorithm

The speaker says...

The termination process has two logical steps, just like the general one presented earlier. The very details of how exactly two transactions are checked for conflicts in the first step don't matter here. MySQL Group Replication is using a refinement of the algorithm tailored to its own needs. As a developer all you need to know is: a remote transaction always wins no matter how expensive local transactions are. And, keep conflicting writes on one replica. It's faster.

The puzzling bit on the slide is the rule to check check a commiting transaction against any commited transaction for conflicts. Any !? Not any... only concurrent.

What's concurrent?

ReplicaReplicaReplicaReplicaReplica

Total order11

22

1212Broadcast

Delivery

Any other transaction that precedes the current oneRecall: total ordering

Recall: asynchronous, delay between broadcast and delivery

The speaker says...

The definition of what concurrent means is a bit tricky. Its defined through a negation and that's confusing on the first look but becomes hopefully clear on the next slide.Concurrent to a transaction is any other transaction that does precede it. If we know the order of all transactions in the entire cluster -, then we can which transactions precede one another.
Atomic broadcast ensures total order on delivery. Some implementations decide on ordering when sending and that number (logical clock) could be be used. Any logical clock works.

Certify against all previous?

ReplicaReplicaReplicaReplicaReplica

Total order3Transaction(2)Certification

22234344

Broadcast:
Transaction 4 is based on all previous up to 2

Certification when 4 is delivered:Check conflicts with trx >2 and trx < 4

The speaker says...

The slide has an example how to find any other transaction that precedes one. When a transaction enters the committing state and is broadcasted, the broadcast includes the logical time (= total order number on the slide) of the latest transaction comitted on the replica. Eventually the transaction is delivered on all sites. Upon delivery the certification considers all transactions that happend after the logical time of the to be certified transaction. All those transactions precede the one to be certified, they executed concurrently at different replicas. We don't have to look further in the past. Further in the past is stuff that's been decided on already.

TIME TO BREATH
MySQL is different anyway...

The speaker says...

Good news! The algorithm used by MySQL Group Replication is different and simpler. For correctness, the precedes relation is still relevant. But it comes for free...

A developers view on commit

ReplicaReplicaReplicaReplicaReplica

t(3)Certify

44

CertifyApplyClientClient

Execute

BEGINCOMMIT

Result

The speaker says...

We are not done with the theory yet but let's do some slides that take the developers perspective. Assuming you have to scale a PHP application, assuming a small cluster of a handful MySQL servers is enough and assuming these servers are co-located on racks, then MySQL Group Replication is your best possible choice.
Did you get this from the theory? Replication is 'synchronous'. On commit you wait only for the server you are connected to. Once your transaction is broadcasted, you are done. You don't wait for the other servers to execute the transaction. With uniform atomic broadcast, once your transaction is broadcasted, it cannot get lost. (That's why I torture you with theory.)

MySQL Replication

MasterSlaveReplicaReplicaReplica

Fetch

Bin log etc.ApplyClientClient

execute

BEGINCOMMIT

OK

The speaker says...

If your network is slow or mother earth, the speed of light and network message round trip time adds too much too your transaction execution time, then asynchronous MySQL Replication is a better choice.

In MySQL Replication the master (primary) never waits for the network. Not even to broadcast updates. Slaves asynchronously pull changes. Despite pushing work on the developer this approach has the downsite that a hardware crash on the master can cause transaction loss. Slaves may or may not have pulled the latest data.

MySQL Semi-sync Replication

MasterSlaveReplicaReplicaReplica

Fetch

Bin logApplyClientClient

Execute

BEGINCOMMIT

OK

SlaveReplica

FetchApply

Wait for first ACK

The speaker says...

In the times of MySQL 5.0 the MySQL Community suggested that to avoid transaction loss the master should wait for one slave to acknowledge it has fetched the update from the master. The fact that it's fetched does not mean that it's been applied. The update may not be visible to clients yet.

It is a back and forth whether database replication should be asynchronous or not. It depends on your needs.

Back to theory after this break.

Back to theory!
Virtual Synchrony?

Groups and viewsA turbo-charged veryion of Atomic Broadcast

Virtual Synchrony

P1P2P3P4

M1

M2

VC

M3

M4

G1 = {P1, P2, P3}

G2 = {P1, P2, P3, P4}

The speaker says...

Good news! Virtual Synchrony and Atomic Broadcast are the same. Our Atomic Broadcast definition assumes a static group. Adding group members, removing members or detecting failed ones is covered.

Virtual Synchrony handles all these membership changes. Whenever an existing group agrees on changes, a new view is installed through a view change (VC) event.(The term 'virtual': it's not synchronous. There is a delay we don't want to wait for short message delays. Yet, the system appears to be synchronous to most real life observers.)

View changes act as a message barrierThat's a case causing troubles in Two-Phase Commit

Virtual Synchrony

P1P2P3P4

M5

VC

M6

G2 = {P1, P2, P3, P4}

G3 = {P1, P2, P3}

M7

M8

The speaker says...

View changes are message barriers. If the group members suspect a member to have failed they install a new view.

Maybe the former member was not dead but just too slow to respond, or disconnected for a brief period. False alarm. The former member then tries to broadcast some updates. Virtual Synchrony ensures that the updates will not be seen by the remaining members. Furthermore the former member will realize that it was excluded.Some GCS implementing virtual synchrony even provide abstractions that ensure a joining member learns all updates it missed (state transfer) before it rejoins.

Auto-everything: failover

MySQLMySQLMySQL

MySQLMySQL

MySQLMySQL Group Replication has a pluggable GCS APISplit brain handling? Depends onGCS and/or GCS config

Default GCS is Corosync

The speaker says...

Good news! The Virtual Synchrony group membership advantages are fully exposed to the user level: node failures are detected and handled automatically. PECL/mysqlnd_ms can help you with the client site. It's a minor tweak to have it automatically learn about remaining MySQL server. Expect and update release soon.

MySQL Group Replication works with any Group Communication system that can be accessed from C and implements Virtual Synchrony. The default choice is Corosync. Split brain handling is GCS dependent. MySQL follows view change notifications of the GCS.

Auto-everything: joining

MySQLMySQLMySQL

Elastic cluster grows and shrinks on demandState transfer done via asynch replication channel

MySQLMySQLMySQL

Donor

State transfer

Joiner

The speaker says...

Good news! When adding a server you don't fiddle with the very details. You start the server, tell it to join the cluster and wait for it to catch up. The server picks a donor, begins fetching updates using much of the existing MySQL Replication code infrastructure and that's it.

Back to theory!
Generalized Snapshot Isolation

Transaction read set does not need to be broadcastedReadset is hard to extract and can be huge

Weaker serializability level than 1SR

Sufficient for InnoDB default isolation

Deferred Update tweak

ReadPrimary

WritePrimaryPrimaryPrimary

V/Ws/U

The speaker says...

Good news! This is last bit of theory. The original Database State Machine proposal was followed by a simpler to implement proposal in 2005. If the clusters serialization level is marginally lowered to snapshot, certification becomes easier. Generalized snapshot isolation can be achieved without having to broadcast the readset of transactions. Recording the readset of a transaction is difficult in most existing databases. Also, readsets can be huge. Snapshot isolation is an isolation level for multi-version concurrency control. MVCC? InnoDB! Somehow... Whatever this is the MySQL Group Replication termination base algorithm.

Conflict (both change x)

Concurrent and write conflict? First comitter wins!Reads use snapshot from the beginning of the transaction

Snapshot Isolation

T1T2T1

T2

BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1

BEGIN(v1), W(v1, x=2), , , COMMIT?

Concurrent write (version 1)

First committer

The speaker says...

In Snapshot Isolations transactions take a snapshot when they begin. All reads return data from this snapshot. Although any other concurrent transaction may update the underlying data while the transaction still runs, the change is unvisiable, the transaction runs in isolation. If two concurrent transactions change the same data item they conflict. In case of conflicts, the first comitter wins.

MVCC requires that as part update of an data item its version is incremented. Future transactions will base their snapshot on the new version.

The actual termination protocol

ReplicaReplicaReplicaReplicaReplica

Write(v2, x=1)Certification

ObjectLatest version

x1

y13

OK

The speaker says...

Every replica checks the version of a write during certification. It compares the writes data items version number with the latest it knows of. If the version is higher or equal than the one found in the replicas certification index, the write is accepted. A lower number indicates that someone has already updated the data item before. Because the first comitter must win a write showing a lower version number than is in the certification index must abort.

(The certification index fills over time and is truncated periodically by MySQL. MySQL reports the size through Performance Schema tables.)

Hmm...
Does it work?

It's a preview there are limits

General InnoDB only

Corosync lacks uniform agreement

No rules to prevent split-brain (it's a preview, you're allowed to fool yourself if you misconfigure the GCS!)

Isolation levelPrimary Key based

Foreign Keys and Unique Keys not supported yet

No concurrent DDL

That's it, folks!
Questions?

The speaker says...

(Oh, a question. Flips slide)

Network messages pffft!

@markcallaghan Sep 30

For MySQL sync replication, when all commits originate from 1 master is there 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group-replication-hello-world @Ulf_Wendel@markcallaghan AFAIK, on the logical level, there should be one. Some of your questions might depend on the GCS used. The GCS is pluggable

@markcallaghan@Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I remain confused

MySQL super hero at Facebook

GCS != MySQL Semi-sync

MySQLMySQLIt's many round trips, how many depends on GCSDefault GCS is Corosync, Corosyc is Totem Ring

Corosync uses a privilege-based approach for total ordering

Many options: fixed sequencer, moving sequencer, ...

Where you run your updates only impacts collision rate

CorosyncCorosyncMySQLCorosync

The speaker says...

No Mark, MySQL Group Replication cannot be understood as a replacement for MySQL Semi-sync Replication. The question about network round trips is hard to answer. Atomic Broadcast and Virtual Synchrony stack many subprotocols together. Let's consider a stable group, no network failure, Totem. Totem orders messages using a token that circulates along a virtual ring of all members. Whoever has the token, has the priviledge to broadcast. Others wait for the token to appear. Atomic Broadcast gives us all or nothing messaging. It takes at least another full round on the ring to be sure the broadcast has been received by all. How many round trips are that? Welcome to distributed systems...

THE END

Contact: [email protected]

The speaker says...

Thank you for your attendance!
Upcoming shows:

Talk&Show! - YourPlace, any time