Top Banner
Deploying Ceph in the wild
26

Ceph Day London 2014 - Deploying ceph in the wild

May 25, 2015

Download

Technology

Ceph Community

Wido den Hollander, 42on.com
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ceph Day London 2014 - Deploying ceph in the wild

Deploying Ceph in the wild

Page 2: Ceph Day London 2014 - Deploying ceph in the wild

Who am I?

● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a

dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009

– Wrote the Apache CloudStack integration

– libvirt RBD storage pool support

– PHP and Java bindings for librados

Page 3: Ceph Day London 2014 - Deploying ceph in the wild

What is 42on?

● Consultancy company focused on Ceph and it's eco-system

● Founded in 2012● Based in the Netherlands● I'm the only employee

– My consultancy company

Page 4: Ceph Day London 2014 - Deploying ceph in the wild

Deploying Ceph

● As a consultant I see a lot of different organizations– From small companies to large governments

– I see Ceph being used in all kinds of deployments

● It starts with gathering information about the use-case– Deployment application: RBD? Objects?

– Storage requirements: TBs or PBs?

– I/O requirements

Page 5: Ceph Day London 2014 - Deploying ceph in the wild

I/O is EXPENSIVE

● Everybody talks about storage capacity, almost nobody talks about Iops

● Think about IOps first and then about TerraBytes

Storage type € per I/O Remark

HDD € 1,60 Seagate 3TB drive for €150 with 90 IOps

SSD € 0,01 Intel S3500 480GB with 25k iops for €410

Page 6: Ceph Day London 2014 - Deploying ceph in the wild

Design for I/O

● Use more, but smaller disks– More spindles means more I/O

– Can go for consumer drives, cheaper

● Maybe deploy SSD-only– Intel S3500 or S3700 SSDs are reliable and fast

● You really want I/O during recovery operations– OSDs replay PGLogs and scan directories

– Recovery operations require a lot of I/O

Page 7: Ceph Day London 2014 - Deploying ceph in the wild

Deployments

● I've done numerous Ceph deployments– From tiny to large

● Want to showcase two of the deployments– Use cases

– Design principles

Page 8: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack

● Location: Belgium● Organization: Government● Use case:

– RBD for CloudStack

– S3 compatible storage

● Requirements:– Storage for ~1000 Virtual Machines

● Including PostgreSQL databases

– TBs of S3 storage● Actual data is unknown to me

Page 9: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack

● Cluster:– 16 nodes with 24 drives

● 19 1TB 7200RPM 2.5”● 2 Intel S3700 200GB SSDs for journaling● 2 Intel S3700 480GB SSDs for SSD-only storage● 64GB of memory● Xeon E5-2609 2.5Ghz CPU

– 3x replication and 80% rounding provides:● 81TB HDD storage● 8TB SSD storage

– 3 small nodes as monitors● SSD for Operating System and monitor data

Page 10: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack

Page 11: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack

● If we detect the OSD is running on a SSD it goes into a different 'host' in the CRUSH Map– Rack is encoded in hostname (dc2-rk01)

ROTATIONAL=$(cat /sys/block/$DEV/queue/rotational)

if [ $ROTATIONAL -eq 1 ]; then echo "root=hdd rack=${RACK}-hdd host=${HOST}-hdd"else echo "root=ssd rack=${RACK}-ssd host=${HOST}-ssd"fi

-48 2.88 rack rk01-ssd-33 0.72 host dc2-rk01-osd01-ssd252 0.36 osd.252 up 1253 0.36 osd.253 up 1

-41 69.16 rack rk01-hdd-10 17.29 host dc2-rk01-osd01-hdd20 0.91 osd.20 up 119 0.91 osd.19 up 117 0.91 osd.17 up 1

Page 12: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack● Download the script on my Github page:

– Url: https://gist.github.com/wido

– Place it in /usr/local/bin

● Configure it in your ceph.conf– Push the config to your nodes using Puppet, Chef,

Ansible, ceph-deploy, etc

[osd]osd_crush_location_hook = /usr/local/bin/crush-location-looukp

Page 13: Ceph Day London 2014 - Deploying ceph in the wild

Ceph with CloudStack● Highlights:

– Automatic assignment of OSDs to right type

– Designed for IOps. More, smaller drives● SSD for the real high I/O applications

– RADOS Gateway for object storage● Trying to push developers towards objects instead of

shared filesystems. A challenge!

● Future:– Double cluster size within 6 months

Page 14: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● Location: Netherlands● Organization: ISP● Use case:

– RBD for OCFS2

● Requirements:– Shared filesystem between webservers

● Until CephFS is stable

Page 15: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● Cluster:– 9 nodes with 8 drives

● 1 SSD for Operating System● 7 Samsung 840 Pro 512GB SSDs● 10Gbit network (20Gbit LACP)

– At 3x replication and 80% filling it provides 8.6TB of storage

– 3 small nodes as monitors

Page 16: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

Page 17: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● “OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high performance and high availability.”– RBD disks are shared

– ext4 or XFS can't be mounted on multiple locations at the same time

Page 18: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● All the challenges were in OCFS2, not in Ceph nor RBD– Running 3.14.17 kernel due to OCFS2 issues

– Limited OCFS2 volumes to 200GB to minimize impact in case of volume corruption

– Done multiple hardware upgrades without any service interruption

● Runs smoothly while waiting for CephFS to mature

Page 19: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● 10Gbit network for lower latency:– Lower network latency provides more performance

– Lower latency means more IOps● Design for I/O!

● 16k packet roundtrip times:– 1GbE: 0.8 ~ 1.1ms

– 10GbE: 0.3 ~ 0.4ms

● It's not about the bandwidth, it's about latency!

Page 20: Ceph Day London 2014 - Deploying ceph in the wild

RBD with OCFS2

● Highlights:– Full SSD cluster

– 10GbE network for lower latency

– Replaced all hardware since cluster was build● From 8 to 16 bays machines

● Future:– Expand when required. No concrete planning

Page 21: Ceph Day London 2014 - Deploying ceph in the wild

DO and DON'T● DO

– Design for I/O, not raw TerraBytes

– Think about network latency● 1GbE vs 10GbE

– Use small(er) machines

– Test recovery situations● Pull the plug out of those machines!

– Reboot your machines regularly to verify it all works● So do update those machines!

– Use dedicated hardware for your monitors● With a SSD for storage

Page 22: Ceph Day London 2014 - Deploying ceph in the wild

DO and DON'T

Page 23: Ceph Day London 2014 - Deploying ceph in the wild

DO and DON'T

● DON'T– Create to many Placement Groups

● It might overload your CPUs during recovery situations

– Fill your cluster over 80%

– Try to be smarter then Ceph● It's auto-healing. Give it some time.

– Buy the most expensive machines● Better to have two cheap(er) ones

– Use RAID-1 for journaling SSDs● Spread your OSDS over them

Page 24: Ceph Day London 2014 - Deploying ceph in the wild

DO and DON'T

Page 25: Ceph Day London 2014 - Deploying ceph in the wild

REMEMBER

● Hardware failure is the rule, not the exception!● Consistency goes over availability● Ceph is designed to run on commodity

hardware● There is no more need for RAID

– forget it ever existed

Page 26: Ceph Day London 2014 - Deploying ceph in the wild

Questions?

● Twitter: @widodh● Skype: @widodh● E-Mail: [email protected]● Github: github.com/wido● Blog: http://blog.widodh.nl/