Ceph Day London 2014 - Deploying ceph in the wild

Deploying Ceph in the wild

Who am I?

● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a

dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009

– Wrote the Apache CloudStack integration

– libvirt RBD storage pool support

– PHP and Java bindings for librados

What is 42on?

● Consultancy company focused on Ceph and it's eco-system

● Founded in 2012● Based in the Netherlands● I'm the only employee

– My consultancy company

Deploying Ceph

● As a consultant I see a lot of different organizations– From small companies to large governments

– I see Ceph being used in all kinds of deployments

● It starts with gathering information about the use-case– Deployment application: RBD? Objects?

– Storage requirements: TBs or PBs?

– I/O requirements

I/O is EXPENSIVE

● Everybody talks about storage capacity, almost nobody talks about Iops

● Think about IOps first and then about TerraBytes

Storage type € per I/O Remark

HDD € 1,60 Seagate 3TB drive for €150 with 90 IOps

SSD € 0,01 Intel S3500 480GB with 25k iops for €410

Design for I/O

● Use more, but smaller disks– More spindles means more I/O

– Can go for consumer drives, cheaper

● Maybe deploy SSD-only– Intel S3500 or S3700 SSDs are reliable and fast

● You really want I/O during recovery operations– OSDs replay PGLogs and scan directories

– Recovery operations require a lot of I/O

Deployments

● I've done numerous Ceph deployments– From tiny to large

● Want to showcase two of the deployments– Use cases

– Design principles

Ceph with CloudStack

● Location: Belgium● Organization: Government● Use case:

– RBD for CloudStack

– S3 compatible storage

● Requirements:– Storage for ~1000 Virtual Machines

● Including PostgreSQL databases

– TBs of S3 storage● Actual data is unknown to me


● Cluster:– 16 nodes with 24 drives

● 19 1TB 7200RPM 2.5”● 2 Intel S3700 200GB SSDs for journaling● 2 Intel S3700 480GB SSDs for SSD-only storage● 64GB of memory● Xeon E5-2609 2.5Ghz CPU

– 3x replication and 80% rounding provides:● 81TB HDD storage● 8TB SSD storage

– 3 small nodes as monitors● SSD for Operating System and monitor data



● If we detect the OSD is running on a SSD it goes into a different 'host' in the CRUSH Map– Rack is encoded in hostname (dc2-rk01)

ROTATIONAL=$(cat /sys/block/$DEV/queue/rotational)

if [ $ROTATIONAL -eq 1 ]; then echo "root=hdd rack=${RACK}-hdd host=${HOST}-hdd"else echo "root=ssd rack=${RACK}-ssd host=${HOST}-ssd"fi

-48 2.88 rack rk01-ssd-33 0.72 host dc2-rk01-osd01-ssd252 0.36 osd.252 up 1253 0.36 osd.253 up 1

-41 69.16 rack rk01-hdd-10 17.29 host dc2-rk01-osd01-hdd20 0.91 osd.20 up 119 0.91 osd.19 up 117 0.91 osd.17 up 1

Ceph with CloudStack● Download the script on my Github page:

– Url: https://gist.github.com/wido

– Place it in /usr/local/bin

● Configure it in your ceph.conf– Push the config to your nodes using Puppet, Chef,

Ansible, ceph-deploy, etc

[osd]osd_crush_location_hook = /usr/local/bin/crush-location-looukp

Ceph with CloudStack● Highlights:

– Automatic assignment of OSDs to right type

– Designed for IOps. More, smaller drives● SSD for the real high I/O applications

– RADOS Gateway for object storage● Trying to push developers towards objects instead of

shared filesystems. A challenge!

● Future:– Double cluster size within 6 months

RBD with OCFS2

● Location: Netherlands● Organization: ISP● Use case:

– RBD for OCFS2

● Requirements:– Shared filesystem between webservers

● Until CephFS is stable

RBD with OCFS2

● Cluster:– 9 nodes with 8 drives

● 1 SSD for Operating System● 7 Samsung 840 Pro 512GB SSDs● 10Gbit network (20Gbit LACP)

– At 3x replication and 80% filling it provides 8.6TB of storage

– 3 small nodes as monitors

RBD with OCFS2

RBD with OCFS2

● “OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high performance and high availability.”– RBD disks are shared

– ext4 or XFS can't be mounted on multiple locations at the same time

RBD with OCFS2

● All the challenges were in OCFS2, not in Ceph nor RBD– Running 3.14.17 kernel due to OCFS2 issues

– Limited OCFS2 volumes to 200GB to minimize impact in case of volume corruption

– Done multiple hardware upgrades without any service interruption

● Runs smoothly while waiting for CephFS to mature

RBD with OCFS2

● 10Gbit network for lower latency:– Lower network latency provides more performance

– Lower latency means more IOps● Design for I/O!

● 16k packet roundtrip times:– 1GbE: 0.8 ~ 1.1ms

– 10GbE: 0.3 ~ 0.4ms

● It's not about the bandwidth, it's about latency!

RBD with OCFS2

● Highlights:– Full SSD cluster

– 10GbE network for lower latency

– Replaced all hardware since cluster was build● From 8 to 16 bays machines

● Future:– Expand when required. No concrete planning

DO and DON'T● DO

– Design for I/O, not raw TerraBytes

– Think about network latency● 1GbE vs 10GbE

– Use small(er) machines

– Test recovery situations● Pull the plug out of those machines!

– Reboot your machines regularly to verify it all works● So do update those machines!

– Use dedicated hardware for your monitors● With a SSD for storage

DO and DON'T

DO and DON'T

● DON'T– Create to many Placement Groups

● It might overload your CPUs during recovery situations

– Fill your cluster over 80%

– Try to be smarter then Ceph● It's auto-healing. Give it some time.

– Buy the most expensive machines● Better to have two cheap(er) ones

– Use RAID-1 for journaling SSDs● Spread your OSDS over them

DO and DON'T

REMEMBER

● Hardware failure is the rule, not the exception!● Consistency goes over availability● Ceph is designed to run on commodity

hardware● There is no more need for RAID

– forget it ever existed

Questions?

● Twitter: @widodh● Skype: @widodh● E-Mail: [email protected]● Github: github.com/wido● Blog: http://blog.widodh.nl/

Ceph Day London 2014 - Deploying ceph in the wild

Technology

ocfs2 ocfs2

ssd host

ssd rack

tb ssd storage

ssd cluster

ocfs2 cluster

io use

ocfs2 location