Top Banner
HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior Software Engineer [email protected] Charles Wang Software Engineer [email protected]
34

HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

Aug 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

HA for OpenStack, from the control plane to instancesTheory

Adam Spiers

Senior Software Engineer

[email protected]

Charles Wang

Software Engineer

[email protected]

Page 2: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

2

Agenda

● HA in a Typical OpenStack Cloud Today● When do we need HA for Compute Nodes?● Architectural Challenges● Solution in SUSE OpenStack Cloud

Page 3: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

3

HA in OpenStack Today

Page 4: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

4

Typical HA Control Plane● Automatic restart of controller services● Increases uptime of cloud

Services Cluster

Node 1 Node 2 Node 3

Orchestration

Keystone

GlanceNova

Dashboard

Cinder

Telemetry

Neutron

Pacemaker Cluster

Control Node 1 Node

DRBDPostgreSQL

RabbitMQ

KeystoneGlanceNova

DashboardCinder

Neutron

Database Cluster

Node 1 Node 2

DRBD or shared storage

Database

Message Queue

Page 5: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

5

Under the Covers● Recommended by official HA guide

Services Cluster

Node 1 Node 2 Node 3

Corosync

Pacemaker

HAProxy

SOLVED

SOLVED

SOLVED

SOLVED

(mostly)

(mostly)

Page 6: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

6

If Only the Control Plane is HA...

HA Cluster

Control node

OS

Message queue

Database

Identity

Block storage

Networking

Dashboard

Compute

Compute node

libvirt

OS

Compute node

nova-compute

libvirt

Compute node

nova-compute

libvirt

nova-compute

Image

OS

OS

Page 7: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

7

When is Compute HA important?

Page 8: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

8

Addressing the White Elephant in the Room

Pets in thecloud?!

Page 9: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

9

Pets vs Cattle● Pets are given names like mittens.mycompany.com

● Cattle are given names like vm0213.cloud.mycompany.com

● Each one is unique, lovingly hand-raised and cared for

● They are almost identical to other cattle

● When they get ill, you nurse them back to health

● When one gets ill, you get another one

Page 10: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

10

What does that mean in practice?

● VM instances often stateful, with mission-critical data

● Stateless, or ephemeral (disposable) storage

● Needs automated recovery with data protection

● Already ideal for cloud … but automated recovery still needed!

● Service downtime when a pet dies

● Service resilient to instances dying

Page 11: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

11

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

Page 12: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

12

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

Page 13: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

13

Do we really need compute HA in OpenStack?

YesNo

Why?● Compute HA needed for cattle as well as pets● Valid reasons for running pets in OpenStack

● Manageability benefits● Want to avoid multiple virtual estates● Too expensive to cloudify legacy workloads

Page 14: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

14

Architectural Challenges

Page 15: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

15

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.● Protection for pets:

– per availability zone?

– per project?

– per pet?

• If nova-compute fails, VMs are still perfectly healthy but unmanageable

– Should they be automatically killed? Depends on the workload.

Page 16: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

16

Compute Plane Needs to Scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Page 17: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

17

Full Mesh Clusters Don't Scale

Page 18: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

18

Addressing Scalability

The obvious workarounds are ugly!● Multiple compute clusters introduce unwanted artificial boundaries● Clusters inside / between guest VM instances are not OS-agnostic, and

require cloud users to modify guest images (installing & configuring cluster software)

● Cloud is supposed to make things easier not harder!

Page 19: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

19

Common Architecture

pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

Page 20: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

20

Reliability Challenges pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

Reliability Challenges

Page 21: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

21

Compute HA in SUSE OpenStack Cloud

Page 22: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

22

NovaCompute / NovaEvacuate OCF Agents

pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pac

emak

er

NovaEvacuate

nova-api

Control plane

Compute plane

PacemakerCIB

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute NovaCompute

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute NovaCompute

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute NovaCompute

OCF Resource Agents

fence_compute

Page 23: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

23

NovaCompute / NovaEvacuate OCF Agents

Pros● Ready for production now● Commercially supported by SUSE● RAs upstream in openstack-resource-agents repo

Cons● Known limitations (known bugs):

– Only handles failure of compute node, not of VMs, or nova-compute

– Some corner cases still problematic, e.g. if nova fails during recovery

Page 24: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

24

Brief Interlude: nova evacuate

Page 25: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

25

nova’s Recovery API

pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

OS

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

OS

Page 26: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

26

Public Health Warning

nova evacuate does not really mean evacuation!

Page 27: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

27

Think About Natural Disasters

Not too late to evacuate

Too late to evacuate

Page 28: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

28

nova Terminology

nova live-migration

nova evacuate ?!

Page 29: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

29

Public Health Warning

● In Vancouver, nova developers considered a rename● Has not happened yet● Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Page 30: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

30

Shared Storage

Page 31: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

31

Where can we have Shared Storage?

Two key areas:● /var/lib/glance/images on controller nodes● /var/lib/nova/instances on compute nodes

Page 32: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

32

When do we need Shared Storage?

● If /var/lib/nova/instances is shared:● VM's ephemeral disk will be preserved during recovery

● Otherwise:● VM disk will be lost● recovery will need to rebuild VM from image

● Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)● otherwise nova might fail to retrieve image from glance

Page 33: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior

33

Questions?

Page 34: HA for OpenStack, from the control plane to instancescharleswang.us/susecon2017/Files/HO128394.pdf · HA for OpenStack, from the control plane to instances Theory Adam Spiers Senior