Top Banner
CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor
32

CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Dec 14, 2015

Download

Documents

Lizeth Ailstock
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

CloudStack ScalabilityTesting, Development, Results, and Futures

Anthony Xu

Apache CloudStack contributor

Page 2: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

•Secure, multi-tenant cloud orchestration platform– Turnkey platform for delivering IaaS clouds– Hypervisor agnostic– Highly scalable, secure and open– Complete Self-service portal– Open source, open standards– Deploys on premise

Apache CloudStack: a project in incubation

Page 3: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Router

L3 Core SwitchTop of Rack

Switch

………… …Availability Zone 1

Servers

Management Server Cluster

Object Storage

Pod 1 Pod 2 Pod 3 Pod N

Primary MySQL

Load Balancer

Admin

Internet

Backup MySQL

Manage hosts, create VMs, virtual disks, virtual networks, meter usage, ….

Page 4: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Thinking about cloud orchestration at scale

•Host management

•Capacity management

•What host to use to deploy a new VM

•Failure handling

•Security group propagation

•Set a goal

Page 5: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

We can’t afford this as our QA lab

Page 6: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

User API

Admin API

Load Balancer

Mgmt. Server

Mgmt. Server

Mgmt. Server

MySQL

Zone Simulator

MySQL

Simulator enables scale testing

Mgmt. Server

Page 7: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

User API

Admin API

Load Balancer

Mgmt. Server

Mgmt. Server

Mgmt. Server

MySQL

Zone Simulator

MySQL

Environment

Mgmt. Server

2 cores, 4 with Hyper Threading. 2.2 GHz Xeon. 16 GB RAM. 12 GB JVM

Heap.Single spinning disk, later single SSD. 32 GB RAM.

MySQL 5.5.

Page 8: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.
Page 9: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Allocator performance is awful with 1000 hosts

•Two minutes to decide which host to use for a new VM!

•Computing capacity for every pod repeatedly

•Fixed that, but still 12 seconds to decide

•Use host tags, down to 2 seconds

•Major changes required to improve further

•In 2.2.0, store capacity info in DB, skip pod altogether

•Harness the power of SQL select and all is well

Page 10: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Polling doesn’t scale

TRUE? FALSE?Sometimes, it is good enough

Page 11: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Host management

•Check host state via TCP connection

•Check every minute

•30,000 checks per minute, 500 per second

•But they take 10 seconds, so 5000 in parallel

•Not using async I/O so 5000 threads required…

•Single JVM can support 5000+ threads so this is concerning but may not be the limiting factor

Page 12: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Host management

•What is the maximum feasible JVM heap size?

•Some people use heaps with hundreds of GB

•Commercial tools can help, but cost

•We decided to stay below 20 GB (GC concerns)

•How much CPU is required for background processing?

Page 13: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

CPU utilization while deploying 30,000 VMs on 30,000 hostsC

PU

Util

izat

ion.

40

0% is

max

imum

Time

20,0005000 5000

Idle

Page 14: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Deploy time from 25,000 to 30,000 VMsS

econ

ds t

o de

ploy

VM number: 25,000 plus X

Page 15: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Problem: agent load balancing

•Management servers start/stop/fail/crash

•How do newly started Management Servers get agents / work?

•When a Management Server exits, how do others pick up its load?

•When new hosts are added how is the load distributed?

Mgmt Server 1

Mgmt Server 2

Agent 3

Agent 4

Agent 5

Agent 6

Agent 1

Agent 2

Page 16: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Common use case timings at scale

•30,000 hosts and 4 Management Servers

•4 Management Servers running, 1 fails: 10 minutes to redistribute 7500 agents

•3 Management Servers running, add a fourth: 40 minutes to redistribute load evenly

•0 Management Servers running, start all 4 simultaneously: 16 minutes to connect to all 30,000 hosts

IMPORTANT

Page 17: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

DB Security Group

WebSecurity Group

Understanding security groups

… …

Web VM

Web VM

Web VM

Web VM

DB VM

Web VM

DB VM

Web VM

Ingress Rule: Allow VMs in Web Security Group access to VMs in DB Security Group on Port 3306

Page 18: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

L3 isolation with distributed firewallsTenant 1 VM 1

10.1.0.2

Tenant 2 VM 1

10.1.0.3

Tenant 1 VM 2

10.1.0.4

Public Internet

10.1.0.1

Public IP address 65.37.141.1165.37.141.2465.37.141.3665.37.141.80

Load Balancer

L3 Core

Pod 1 L2 Switch

Pod 3 L2 Switch

10.1.16.1

…10.1.8.1Pod 2 L2 Switch

Page 19: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

L3 isolation with distributed firewallsTenant 1 VM 1

10.1.0.2

Tenant 2 VM 1

10.1.0.3

Tenant 1 VM 2

10.1.0.4

Tenant 1 VM 3

10.1.16.47

Tenant 1 VM 4

10.1.16.85

Public Internet

10.1.0.1

Public IP address 65.37.141.1165.37.141.2465.37.141.3665.37.141.80

Load Balancer

L3 Core

Pod 1 L2 Switch

Pod 3 L2 Switch

10.1.16.1

…10.1.8.1Pod 2 L2 Switch

Page 20: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

L3 isolation with distributed firewallsTenant 1 VM 1

10.1.0.2

Tenant 2 VM 1

10.1.0.3

Tenant 1 VM 2

10.1.0.4

Tenant 2 VM 2

10.1.16.12

Tenant 2 VM 3 10.1.16.21

Tenant 1 VM 3

10.1.16.47

Tenant 1 VM 4

10.1.16.85

Public Internet

10.1.0.1

Public IP address 65.37.141.1165.37.141.2465.37.141.3665.37.141.80

Load Balancer

L3 Core

Pod 1 L2 Switch

Pod 3 L2 Switch

10.1.16.1

…10.1.8.1Pod 2 L2 Switch

Page 21: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

1 Firewall per Virtual Machine

Page 22: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

…VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

VMVM

VM…

One million firewalls?

Page 23: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Well-known software scaling techniques• Message queues• Consistency tradeoffs• Idempotent configuration & retries

CloudStack uses • Special purpose queues• Optimized for large security groups• Eventual consistency for rule updates

Orchestrating hundreds of thousands of firewalls

Page 24: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Problem: firewall rules explosion in dom0

-A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.16.31 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.45.112 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.189.5 – j ACCEPT

-A FORWARD -m tcp –p tcp –dport 3060 –src 10.21.9.77 – j ACCEPT…

Performance suffers for large security groups

Allow Security Group {Web} on TCP port 3060

Page 25: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

ipset –N web_sg iptreemapipset –A web_sg 10.1.16.31 ipset –A web_sg 10.1.16.112 ipset –A web_sg 10.1.189.5

ipset –A web_sg 10.21.9.77

-A FORWARD –p tcp –m tcp –dport 3060 –m set –match-set web_sg src -j ACCEPT…

Fix with ipsets:

Problem: firewall rules explosion in dom0

See also http://daemonkeeper.net/781/mass-blocking-ip-addresses-with-ipset/

Page 26: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Security group propagation timeS

econ

ds t

o fu

lly

sync

ed

Number of VMs in security group

Page 27: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Problem: database connection management

•Scale testing resulted in several “too many open connections” errors from MySQL

•Common problem: holding open connections while doing long-running operations

•Took some code clean up and refactoring

•No longer an issue

•10,000 connections are OK

•CloudStack is far below that

Page 28: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

DB connections per MS while deploying 30,000 VMsN

umbe

r of

DB

co

nnec

tions

Time

20,000

5,0005,000

Page 29: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Other considerations (beyond control plane)

•Network design and devices

•Object store scalability

•Per-host and cluster scalability

•Storage

•Understand your workload

Page 30: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Future work

•Improve simulator accuracy

•Publish results of advanced network (VLAN) testing

•Verify assumption of VM density not impacting scale

Page 31: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

More information and joining the project

Project web site: http://incubator.apache.org/projects/cloudstack.html

Mailing lists:

[email protected]

[email protected]

Scalability study:http://wiki.cloudstack.org/pages/viewpage.action?pageId=14320020

Page 32: CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.

Q&A