Top Banner
Testing for DR Failover Testing Zehua Liu Zendesk Singapore SRECon Asia / Australia 2017 23 May 2017
30

23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Mar 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Testing for DR Failover TestingZehua Liu

Zendesk Singapore

SRECon Asia / Australia 2017

23 May 2017

Page 2: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Zehua Liu○ With Zendesk Singapore since 2015○ Worked at startups at various stages

(Atlassian, mig33, Circos Brand Karma)○ Leads the tooling team at Zendesk SG

About Me

Page 3: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Disaster Recovery Failover Testing

The Parent Problem

Failing over fromthe production data centre

tothe DR data centre

Page 4: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● A type of DiRT (Disaster Recovery Testing)● Part of the BCDR project

○ Business Continuity and Disaster Recovery● Our focus here

○ Testing lost of the data centre○ Testing only customer facing features

■ Internal tools are excluded

The Parent Problem

Page 5: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Compliance - SOC2 Testing twice a year● Customer Agreements: Advanced Security Add-On

○ Recovery Time Objective - 8 hours○ Recovery Point Objective - 0 hours

● Test and verify the procedures and documentation● Identify gaps● Improve the overall DR process● Training for Responding Parties

Why conduct DR failover testing

Page 6: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Two DR failover testing exercises○ Four DR failover tests

● Encountered various issues○ Infrastructure, e.g., database, network○ Configuration○ Application, couldn’t handle failure in infrastructure

● Examples of issues○ Double billing customers○ iOS app did not work○ DB replication back to original production was too slow

Past attempts of DR failover testing

Page 7: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Can we increaseour confidence in

DR Failover Testing?

The Problem

Page 8: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Test the DR environment before failing over

The Answer

Page 9: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Testing the DR environment

● Ideal: automated testing while DR is still in standby mode○ Run the exact same tests that we run for production○ Automatically triggered after a change to DR

● Issues:○ Most tests inevitably write data about the test accounts

to the DBs in DR○ Run just the read only tests?

● The big question:○ Should we allow direct write into data stores in DR??

Page 10: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Should we allow direct write into data stores in DR?

The Big Question

Page 11: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Testing the DR environment

● The big question:○ Should we allow direct write into data stores in DR??

● A trade-off between risk of production failure and risk of failed DR failover○ writing to DR DB => risk of production failure○ test coverage => risk of failed DR failover

Page 12: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Zendesk Chat Technical Architecture

WidgetDashboard Mobile Apps

Mediator (US)

Data API Service

Live Chat Service

Data Centre

Mediator (DE) Mediator (SG)

Mobile SDK

ElasticSearch Memcached MySQL RedisRiak Cluster

Account Service

......

Mediators

Static Assets

Web Servers

Cloudflare

Consul

...

Page 13: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

DR Failover

WidgetDashboard Mobile Apps

Mediator (US)

APILCProduction Data Centre

Mediator (DE) Mediator (SG)

Mobile SDK

ES MC MySQL RedisRiak

Acct

......

Mediators

Static Assets

Web

...

APILCDR Data Centre

ES MC MySQL RedisRiak

AcctWeb

Cloudflare

Page 14: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Zendesk Chat Technical Architecture

Data API Service

Live Chat Service

Core Services

ElasticSearch Memcached MySQL RedisRiak Cluster

Account ServiceWeb Servers

Consul

Page 15: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Data Stores

● MySQL○ master ⇒ slave replication (DR DB as read only slave)○ Least confident, might cause data corruption, stop

replication, etc● Riak

○ Commercial license with multi-dc sync support● ElasticSearch

○ Could be rebuilt from source of truth● Redis: ephemeral data● Memcached: cold start?

Page 16: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Approach

● Good news○ The applications mostly partition data by accounts! ○ We could use a dedicated set of test accounts that

would never get used on prod■ In theory, these test data is isolated from other

customer account data in data stores■ Good to replicate back and forth between DR and

production MySQL DBs

Page 17: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Avoid writing to the real DR DBs?● Allow writing to only less risky DBs?● Allow writing to all DBs

Alternatives

Page 18: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Setup a different set of test data store servers○ Configure the apps to use them only during test○ Switch back before the actual failover○ Does not test the physical connection

ElasticSearch Memcached MySQL RedisRiak Cluster

Alternatives - Avoid writing to the real DR DBs?

Data API Service

Live Chat Service

Core Services

ElasticSearch Memcached MySQL RedisRiak Cluster

Account Service

Consul

Test DBs

DR DBs

Page 19: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Setup the different set of DBs on the same physical servers as the real ones○ Naming tricks:

■ test_account_db to mirror account database■ test_chat_history for ES indices, etc

○ Covers the physical connection

Alternatives - Avoid writing to the real DR DBs?

Page 20: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Setup the different set of DBs on the same physical servers as the real ones

Alternatives - Avoid writing to the real DR DBs?

ElasticSearch Memcached MySQL RedisRiak Cluster

Data API Service

Live Chat Service

Core Services

ElasticSearch Memcached MySQL RedisRiak Cluster

Account Service

Consul

Test DBs

DR DBs

Page 21: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Use the real ones for all DBs, except MySQL○ Use a test DB for MySQL

■ MySQL is the most risky one to allow writes○ Setup the test DB as a writable slave of the DR DB?

Alternatives - Allow partial writes

MySQL

Data API Service

Live Chat Service

Core Services

ElasticSearch Memcached MySQL RedisRiak Cluster

Account Service

Consul

Test DB

DR DBs

Page 22: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

● Use all real ones!○ Data in DR DB will have to be eventually replicated

back to production DB○ Risks of test data in DR causing conflicts when

replicated back to production DB

Alternatives - Writing to real DR DBs!

Data API Service

Live Chat Service

Core Services

ElasticSearch Memcached MySQL RedisRiak Cluster

Account Service

ConsulDR DBs

Page 23: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Testing the DR environment

● The big question:○ Should we allow direct write into data stores in DR??

● A trade-off between risk of production failure and risk of failed DR failover○ writing to DR DB => risk of production failure

■ Yes, let’s do it!○ test strategy/coverage => risk of failed DR failover

■ ?

Page 24: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Final Proposal

● More issues:○ Some tables use auto-increment column as primary

key○ Insertion into those tables in DR ⇒ replication conflicts

● Solutions:○ Play with auto_increment_increment and offset○ Avoid insertion into those tables

■ Identify those tables and avoid running tests that create new data in them

■ Luckily there are only a few non-critical ones

Page 25: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Final Proposal

● More issues:○ Someone might run the excluded tests and create new

rows in the auto-increment tables in DR!● Solution:

○ Use a different user with restricted permission○ Switch back to a full access user before failover

Page 26: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Final Proposal

● DR apps use real DR DBs○ No test DBs in DR○ Same configuration as production

● MySQL master-master replication between prod and DR● Avoid doing insertion in tables with auto-increment pkey

○ Exclude integration tests that do such insertions○ Setup a MySQL user with restricted access

● We could run end-to-end browser tests against DR while it’s in standby mode!

Page 27: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Final Proposal

● The trade-off between risk of production failure and risk of failed DR failover○ writing to DR DB => low risk of production failure

■ Replication might fail, but we would know it early○ test strategy/coverage => low risk of failed DR failover

■ Application on DR might fail in the excluded test cases, but not critical

Page 28: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

The Final Proposal - Caveats

● Does not cover all aspects of DR failover readiness○ Only functional tests○ A bit of network link testing via MySQL replication

● Adds to the complexity of DR failover○ More steps to be performed during the failover

Page 29: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Conclusion

● It is possible to test the DR env in standby mode● It is a trade-off between risk of production failure and risk

of failed DR failover● Avoid using auto-increment keys if multi-DC support is

needed

Page 30: 23 May 2017 SRECon Asia / Australia 2017 Zendesk Singapore ... · Use a test DB for MySQL MySQL is the most risky one to allow writes Setup the test DB as a writable slave of the

Questions?