Transcript

Datacenter Storage with Ceph

John Spray

john.spray@redhat.comjcsp on #ceph-devel

Datacenter Storage with Ceph OSDC.de 2015

Agenda

● What is Ceph?

● How does Ceph store your data?

● Interfaces to Ceph: RBD, RGW, CephFS

● Latest development updates

Datacenter Storage with Ceph OSDC.de 2015

What is Ceph?● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.

Datacenter Storage with Ceph OSDC.de 2015

A general purpose storage system

● You feed it commodity disks and ethernet

● In return, it gives your apps a storage service

● It doesn't lose your data

● It doesn't need babysitting

● It's portable

Datacenter Storage with Ceph OSDC.de 2015

Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

RBD

OBJECT STORAGE

RGW

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

iSCSI

Clones

Snapshots

CIFS/NFS

HDFS

Distributed Metadata

Linux Kernel

POSIX

Datacenter Storage with Ceph OSDC.de 2015

Ceph Architecture(how your data is stored)

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

RADOS

ReliableAutonomousDistributedObject Store

RADOS Components

OSDs:

10s to 10000s in a cluster

One per disk (or one per SSD, RAID group…)

Serve stored objects to clients

Intelligently peer for replication & recovery

Monitors:Maintain cluster membership and stateProvide consensus for distributed decision-

makingSmall, odd numberThese do not serve stored objects to clients

M

Object Storage Daemons

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

xfsext4btrfs

M

M

M

Rados Cluster

APPLICATION

M M

M M

M

RADOS CLUSTER

Where do objects live?

??

APPLICATION

M

M

M

OBJECT

A Metadata Server?

1

APPLICATION

M

M

M

2

Calculated placement

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

CRUSH: Dynamic data placement

Pseudo-random placement algorithm● Fast calculation, no lookup● Repeatable, deterministic

● Statistically uniform distribution● Stable mapping

● Limited data migration on change● Rule-based configuration

● Infrastructure topology aware● Adjustable replication● Weighting

CRUSH: Replication

RADOS CLUSTER

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

DATA

CRUSH: Topology-aware placement

RADOS CLUSTER

10 01 01 11

01

1010

10 10 0101

01 11

10 0101

RACK A RACK B

CRUSH is a quick calculation

RADOS CLUSTER

DATA

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

CRUSH rules● Simple language● Specify which copies go where (across racks,

servers, datacenters, disk types)

rule <rule name> {  ruleset <ruleset name>  type <replicated | erasure>  step take <bucket­type>  step [choose|chooseleaf] [firstn|indep] <N> <type>  step emit}

Pools and Placement Groups● Trick: apply CRUSH placement to fixed

number of placement groups instead of N objects.

● Manage recovery/backfill at PG granularity: less per-object metadata.

● Typically a few 100 PGs per OSD● Pool is logical collection of PGs, using a

particular CRUSH rule

Recovering from failures● OSDs notice when their peers stop responding, report

this to monitors● Monitors make decision that an OSD is now “down”

● Peers continue to serve data, but it's in a degraded state

● After some time, monitors mark the OSD “out”● New peers selected by CRUSH, data is re-replicated

across whole cluster ● Faster than RAID rebuild because we share the load● Does not require administrator intervention

RADOS advanced features

● Not just puts and gets!

● More feature rich than typical object stores● Partial object updates & appends

● Key-value stores (OMAPs) within objects

● Copy-on-write snapshots

● Watch/notify to for pushing events

● Extensible with Object Classes: perform arbitrary transactions on OSDs.

Choosing hardware● Cheap hardware mitigates cost of replication● OSD data journalling: a separate SSD is useful.

Approx 1 SSD for every 4 OSD disks.● OSDs are more CPU/RAM intensive than

legacy storage: approx 8 or so per host.● Many cheaper servers better than few

expensive: distribute the load of rebalancing.● Consider your bandwidth/capacity ratio and

your read/write ratio

Datacenter Storage with Ceph OSDC.de 2015

Interfaces to applications: RGW, RBD, and CephFS

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

RBD: Virtual disks in Ceph

RADOS Block Device:Storage of disk images in RADOSDecouples VMs from host Images are striped across the cluster

(pool)SnapshotsCopy-on-write clonesSupport in:

Mainline Linux Kernel (2.6.39+)Qemu/KVMOpenStack, CloudStack, OpenNebula,

Proxmox

Storing virtual disks

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

28

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

RGW: HTTP object store

RADOSGW:● REST-based object storage proxy● Compatible with S3 and Swift applications● Uses RADOS to store objects● API supports buckets, accounts● Usage accounting for billing

RADOS Gateway

M M

MRADOS CLUSTER

RADOSGW

LIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

APPLICATION

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Ceph Filesystem (CephFS)

POSIX-compliant shared filesystem

Client:● Userspace (FUSE) or Kernel● Looks like like a local filesystem● Sends data directly to RADOS

Metadata server:● Filesystem metadata:

● Directory hierarchy● Inode metadata (owner, timestamps, mode)

● Stores metadata in RADOS● Does not serve file data to clients

Storing Data and Metadata

LINUX HOST

M M

MRADOS CLUSTER

KERNEL MODULE

datametadata 0110

CephFS

● Advanced features:

– Subdirectory snapshots

– Recursive statistics

– Multiple metadata servers

● Coming soon:

– Online consistency checking

– Scalable repair/recovery tools

Datacenter Storage with Ceph OSDC.de 2015

CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph

Datacenter Storage with Ceph OSDC.de 2015

Beyond Replication: Erasure Coding and Cache Tiering

● Use one Ceph pool as a cache to another:

– e.g. Flash cache to a spinning disk pool

● Configurable policies for eviction based on capacity, object count, lifetime.

● Configurable mode:

– writeback: all client I/O to the cache

– readonly: client writes to backing pool, reads from the cache

Cache Tiering

● Split objects into M data chunks and K parity chunks, with configurable M and K

● An alternative to replication, providing a different set of tradeoffs:

– Consume less storage capacity

– Consume less write bandwidth

– Reads scattered across OSDs

– Modifications are expensive

● Plugin interface for encoding schemes

Erasure Coding

Cache Tiering + EC

replica 1

replica 2

M1 M2 M3 K1 K2

replica 3

Pool ‘cache’

Pool ‘cold’

Client I/O

Writeback cache policy

200% overhead

66% overhead

Example: Cache Tiering & EC

# ceph osd erasure-code-profile set ecdemo k=3 m=2

# ceph osd pool create cold 384 384 erasure ecdemo

# ceph osd pool create cache 384 384

# ceph osd tier add cold cache

# ceph osd tier cache-mode cache writeback

# ceph osd tier set-overlay cold cache

Datacenter Storage with Ceph OSDC.de 2015

What's new?

Datacenter Storage with Ceph OSDC.de 2015

Ceph 0.94

Emperor

Firefly

Giant

Hammer

Infernalis

Jewel

Datacenter Storage with Ceph OSDC.de 2015

RADOS

● Performance:

– more IOPs

– exploit flash backends

– exploit many-cored machines

● CRUSH straw2 algorithm:

– reduced data migration on changes

● Cache tiering:

– read performance, reduce unnecessary promotions

Datacenter Storage with Ceph OSDC.de 2015

RBD

● Object maps:

– per-image metadata, identifies which extents are allocated.

– optimisation for clone/export/delete.

● Mandatory locking:

– prevent multiple clients writing to same image

● Copy-on-read:

– improve performance for some workloads

Datacenter Storage with Ceph OSDC.de 2015

RGW

● S3 object versioning API

– when enabled, all objects maintain history

– GET ID+version to see an old version

● Bucket sharding

– spread bucket index across multiple RADOS objects

– avoid oversized OMAPs

– avoid hotspots

CephFS

● Diagnostics & health checks

● Journal recovery tools

● Initial online metadata scrub

● Refined ENOSPC handling

● Soft client quotas

● General hardening and resilience

Datacenter Storage with Ceph OSDC.de 2015

Finally...

Datacenter Storage with Ceph OSDC.de 2015

Get involvedEvaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

Bugs:

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

Ceph Days

Ceph Day Berlin is next Tuesday!

http://ceph.com/cephdays/ceph-day-berlin/

Axica Convention Center

April 28 2015

Datacenter Storage with Ceph OSDC.de 2015

top related