OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph

John Spray

john.spray@redhat.comjcsp on #ceph-devel

Datacenter Storage with Ceph OSDC.de 2015

Agenda

● What is Ceph?

● How does Ceph store your data?

● Interfaces to Ceph: RBD, RGW, CephFS

● Latest development updates

What is Ceph?● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.

A general purpose storage system

● You feed it commodity disks and ethernet

● In return, it gives your apps a storage service

● It doesn't lose your data

● It doesn't need babysitting

● It's portable

Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

OBJECT STORAGE

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

Clones

Snapshots

CIFS/NFS

Distributed Metadata

Linux Kernel

Ceph Architecture(how your data is stored)

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Components

RGWA web services

with S3 and Swift

APP HOST/VM CLIENT

ReliableAutonomousDistributedObject Store

RADOS Components

10s to 10000s in a cluster

One per disk (or one per SSD, RAID group…)

Serve stored objects to clients

Intelligently peer for replication & recovery

Monitors:Maintain cluster membership and stateProvide consensus for distributed decision-

makingSmall, odd numberThese do not serve stored objects to clients

Object Storage Daemons

xfsext4btrfs

Rados Cluster

APPLICATION

RADOS CLUSTER

Where do objects live?

APPLICATION

OBJECT

A Metadata Server?

APPLICATION

Calculated placement

FAPPLICATION

CRUSH: Dynamic data placement

Pseudo-random placement algorithm● Fast calculation, no lookup● Repeatable, deterministic

● Statistically uniform distribution● Stable mapping

● Limited data migration on change● Rule-based configuration

● Infrastructure topology aware● Adjustable replication● Weighting

CRUSH: Replication

RADOS CLUSTER

0110 10 01

CRUSH: Topology-aware placement

RADOS CLUSTER

10 01 01 11

10 10 0101

10 0101

RACK A RACK B

CRUSH is a quick calculation

RADOS CLUSTER

0110 10 01

CRUSH rules● Simple language● Specify which copies go where (across racks,

servers, datacenters, disk types)

rule <rule name> { ruleset <ruleset name> type <replicated | erasure> step take <buckettype> step [choose|chooseleaf] [firstn|indep] <N> <type> step emit}

Pools and Placement Groups● Trick: apply CRUSH placement to fixed

number of placement groups instead of N objects.

● Manage recovery/backfill at PG granularity: less per-object metadata.

● Typically a few 100 PGs per OSD● Pool is logical collection of PGs, using a

particular CRUSH rule

Recovering from failures● OSDs notice when their peers stop responding, report

this to monitors● Monitors make decision that an OSD is now “down”

● Peers continue to serve data, but it's in a degraded state

● After some time, monitors mark the OSD “out”● New peers selected by CRUSH, data is re-replicated

across whole cluster ● Faster than RAID rebuild because we share the load● Does not require administrator intervention

RADOS advanced features

● Not just puts and gets!

● More feature rich than typical object stores● Partial object updates & appends

● Key-value stores (OMAPs) within objects

● Copy-on-write snapshots

● Watch/notify to for pushing events

● Extensible with Object Classes: perform arbitrary transactions on OSDs.

Choosing hardware● Cheap hardware mitigates cost of replication● OSD data journalling: a separate SSD is useful.

Approx 1 SSD for every 4 OSD disks.● OSDs are more CPU/RAM intensive than

legacy storage: approx 8 or so per host.● Many cheaper servers better than few

expensive: distribute the load of rebalancing.● Consider your bandwidth/capacity ratio and

your read/write ratio

Interfaces to applications: RGW, RBD, and CephFS

Components

RGWA web services

with S3 and Swift

APP HOST/VM CLIENT

RBD: Virtual disks in Ceph

RADOS Block Device:Storage of disk images in RADOSDecouples VMs from host Images are striped across the cluster

(pool)SnapshotsCopy-on-write clonesSupport in:

Mainline Linux Kernel (2.6.39+)Qemu/KVMOpenStack, CloudStack, OpenNebula,

Proxmox

Storing virtual disks

RADOS CLUSTER

HYPERVISOR

LIBRBD

Architectural Components

RGWA web services

with S3 and Swift

APP HOST/VM CLIENT

RGW: HTTP object store

RADOSGW:● REST-based object storage proxy● Compatible with S3 and Swift applications● Uses RADOS to store objects● API supports buckets, accounts● Usage accounting for billing

RADOS Gateway

MRADOS CLUSTER

RADOSGW

LIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

APPLICATION

Architectural Components

RGWA web services

with S3 and Swift

APP HOST/VM CLIENT

Ceph Filesystem (CephFS)

POSIX-compliant shared filesystem

Client:● Userspace (FUSE) or Kernel● Looks like like a local filesystem● Sends data directly to RADOS

Metadata server:● Filesystem metadata:

● Directory hierarchy● Inode metadata (owner, timestamps, mode)

● Stores metadata in RADOS● Does not serve file data to clients

Storing Data and Metadata

LINUX HOST

MRADOS CLUSTER

KERNEL MODULE

datametadata 0110

CephFS

● Advanced features:

– Subdirectory snapshots

– Recursive statistics

– Multiple metadata servers

● Coming soon:

– Online consistency checking

– Scalable repair/recovery tools

CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph

Beyond Replication: Erasure Coding and Cache Tiering

● Use one Ceph pool as a cache to another:

– e.g. Flash cache to a spinning disk pool

● Configurable policies for eviction based on capacity, object count, lifetime.

● Configurable mode:

– writeback: all client I/O to the cache

– readonly: client writes to backing pool, reads from the cache

Cache Tiering

● Split objects into M data chunks and K parity chunks, with configurable M and K

● An alternative to replication, providing a different set of tradeoffs:

– Consume less storage capacity

– Consume less write bandwidth

– Reads scattered across OSDs

– Modifications are expensive

● Plugin interface for encoding schemes

Erasure Coding

Cache Tiering + EC

replica 1

replica 2

M1 M2 M3 K1 K2

replica 3

Pool ‘cache’

Pool ‘cold’

Client I/O

Writeback cache policy

200% overhead

66% overhead

Example: Cache Tiering & EC

# ceph osd erasure-code-profile set ecdemo k=3 m=2

# ceph osd pool create cold 384 384 erasure ecdemo

# ceph osd pool create cache 384 384

# ceph osd tier add cold cache

# ceph osd tier cache-mode cache writeback

# ceph osd tier set-overlay cold cache

What's new?

Ceph 0.94

Emperor

Firefly

Hammer

Infernalis

● Performance:

– more IOPs

– exploit flash backends

– exploit many-cored machines

● CRUSH straw2 algorithm:

– reduced data migration on changes

● Cache tiering:

– read performance, reduce unnecessary promotions

● Object maps:

– per-image metadata, identifies which extents are allocated.

– optimisation for clone/export/delete.

● Mandatory locking:

– prevent multiple clients writing to same image

● Copy-on-read:

– improve performance for some workloads

● S3 object versioning API

– when enabled, all objects maintain history

– GET ID+version to see an old version

● Bucket sharding

– spread bucket index across multiple RADOS objects

– avoid oversized OMAPs

– avoid hotspots

CephFS

● Diagnostics & health checks

● Journal recovery tools

● Initial online metadata scrub

● Refined ENOSPC handling

● Soft client quotas

● General hardening and resilience

Finally...

Get involvedEvaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

Ceph Days

Ceph Day Berlin is next Tuesday!

http://ceph.com/cephdays/ceph-day-berlin/

Axica Convention Center

April 28 2015

OSDC 2015: John Spray | The Ceph Storage System

cephdevel datacenter

portable datacenter

intelligent storage

ceph architecturehow

distributed object store

general purpose storage

distributed block device

inception flexible object

Documents

Osdc - Meteor Intorduction

Osdc services catalogue

Re-imagining CakePHP (OSDC 2010)

OSDC 2014: Fabrizio Manfredi - Data replication

OSDC 2015: Timo Derstappen | Microservice orchestration

Ceph Day LA: Ceph Ecosystem Update

OpenNebulaConf 2014 - Using Ceph to provide scalable storage...

Ruby19 osdc-090418222718-phpapp02

MySQL and Ceph - Percona · Agenda • Ceph Introduction...

Trac Osdc Nov 2007

Wissbi osdc pdf

Ceph Day Beijing - SPDK for Ceph

CephFS as a service with OpenStack Manila as a service with....

Asset pipeline osdc

Ceph: a large open source C++...

Storage with ceph (osdc 2013)