Zero Downtime Postgres Upgrades

Post on 11-Apr-2017

1232 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

Transcript

Zero-downtime Postgres upgrades

Restarting databases without the apps noticing

@ChrisSinjo

GOCARDLESS

POST /cash/monies HTTP/1.1

{ amount: 100 }

💰💰💰

High 💵 per-request

Uptime is 🔑

Good durability guarantees

Good durability guarantees

Feature-cautious

Good durability guarantees

Feature-cautious

Transactions are cool

–Postgres

“Speak to this one node.”

Client

Postgres

Client

PostgresPostgresReplication

Client

PostgresPostgresReplication

Wake a human up

Client

PostgresPostgresReplication

Client

PostgresPostgres

Client

PostgresPostgres

Client

PostgresPostgresReplication

Awful time-to-recovery

Error-prone

You gotta perform:

- Many steps - In the right order - Perfectly

Don’t make a

tired SRE think

Add automation

Pacemaker

A clustering tool

Client

PostgresPostgresReplication

How do we know a node has failed?

Jepsenhttps://aphyr.com/tags/jepsen

https://aphyr.com/posts/317-jepsen-elasticsearch

Client

PostgresPostgresReplication

Client

PostgresPostgresPostgresRepl Repl

Client

PostgresPostgresPostgres Repl Repl

Pacemaker Pacemaker Pacemaker

Client

PostgresPostgresPostgres Repl Repl

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres Repl Repl

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

PostgresPostgresPostgresRepl

Pacemaker Pacemaker Pacemaker

Client

VIP

PostgresPostgresPostgresRepl

Pacemaker Pacemaker Pacemaker

Client

VIP

PostgresPostgresPostgresRepl

Pacemaker Pacemaker Pacemaker

Client

VIP

Client

PostgresPostgresPostgres Repl

Repl

VIP

Pacemaker Pacemaker Pacemaker

$💯

Seems hard, right?

It kinda is

You gotta know:

- Postgres - Distributed systems - Pacemaker

Get someone else to run it for you

Client

PostgresPostgresPostgres Repl Repl

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

Every move means a connection reset

Every move means dropped requests

POST /cash/monies HTTP/1.1

{ amount: 100 }

💰💰💰

POST /cash/monies HTTP/1.1

{ amount: 100 }

500 Internal Server Error

What does this mean for upgrades?

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

9.4.9 9.4.9 9.4.9

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

9.4.9 9.4.9 9.4.9

Repl Repl

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

9.4.10 9.4.9 9.4.10

Repl Repl

VIP

Client

PostgresPostgresPostgres Repl

Repl

VIP

Pacemaker Pacemaker Pacemaker

9.4.10 9.4.9 9.4.10

Every upgrade means a connection reset

Every upgrade means dropped requests

POST /cash/monies HTTP/1.1

{ amount: 100 }

500 Internal Server Error

Solution: never upgrade

🙄

Not upgrading is

never an option

Solution: never upgrade

Solution: never upgrade

Solution: ???

1thing missing

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

VIP

PgBouncer has This One Weird Trick™

PAUSE;

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

VIP

PAUSE;

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

PAUSE;

VIP

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

PAUSE;

VIP

So what does this mean for upgrades?

Client

PostgresPostgresPostgres

Pacemaker Pacemaker Pacemaker

PgBouncerPgBouncer PgBouncerVIP

VIP

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

VIP

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

VIP

9.4.10 9.4.9 9.4.10

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

VIP

9.4.10 9.4.9 9.4.10

PAUSE;

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

9.4.10 9.4.9 9.4.10

VIP

PAUSE;

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

9.4.10 9.4.9 9.4.10

VIP

RESUME;

Client

PostgresPostgresPostgres

PgBouncerPgBouncer PgBouncerVIP

9.4.10 9.4.10 9.4.10

VIP

RESUME;

$💯

Caveats

Minor versions

9.4.9 → 9.4.10

pglogical

Minor versions

Long-running transactions

while(running_queries): if(now > timeout): abandon_migration else: sleep(0.1)

promote_new_primary

Minor versions

Long-running transactions

Pause length

7-10s total

$💯

One more thing… (#sorrynotsorry)

github.com/gocardless/our-postgresql-setup

We’re hiring✌❤

@ChrisSinjo @GoCardlessEng

Thank you✌❤

@ChrisSinjo @GoCardlessEng

Questions?✌❤

@ChrisSinjo @GoCardlessEng

top related