Top Banner
Datacenter Storage with Ceph John Spray [email protected] jcsp on #ceph-devel
51
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph

John Spray

[email protected] on #ceph-devel

Page 2: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Agenda

● What is Ceph?

● How does Ceph store your data?

● Interfaces to Ceph: RBD, RGW, CephFS

● Latest development updates

Page 3: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

What is Ceph?● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.

Page 4: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

A general purpose storage system

● You feed it commodity disks and ethernet

● In return, it gives your apps a storage service

● It doesn't lose your data

● It doesn't need babysitting

● It's portable

Page 5: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

RBD

OBJECT STORAGE

RGW

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

iSCSI

Clones

Snapshots

CIFS/NFS

HDFS

Distributed Metadata

Linux Kernel

POSIX

Page 6: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Ceph Architecture(how your data is stored)

Page 7: OSDC 2015: John Spray | The Ceph Storage System

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 8: OSDC 2015: John Spray | The Ceph Storage System

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 9: OSDC 2015: John Spray | The Ceph Storage System

RADOS

ReliableAutonomousDistributedObject Store

Page 10: OSDC 2015: John Spray | The Ceph Storage System

RADOS Components

OSDs:

10s to 10000s in a cluster

One per disk (or one per SSD, RAID group…)

Serve stored objects to clients

Intelligently peer for replication & recovery

Monitors:Maintain cluster membership and stateProvide consensus for distributed decision-

makingSmall, odd numberThese do not serve stored objects to clients

M

Page 11: OSDC 2015: John Spray | The Ceph Storage System

Object Storage Daemons

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

xfsext4btrfs

M

M

M

Page 12: OSDC 2015: John Spray | The Ceph Storage System

Rados Cluster

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 13: OSDC 2015: John Spray | The Ceph Storage System

Where do objects live?

??

APPLICATION

M

M

M

OBJECT

Page 14: OSDC 2015: John Spray | The Ceph Storage System

A Metadata Server?

1

APPLICATION

M

M

M

2

Page 15: OSDC 2015: John Spray | The Ceph Storage System

Calculated placement

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Page 16: OSDC 2015: John Spray | The Ceph Storage System

CRUSH: Dynamic data placement

Pseudo-random placement algorithm● Fast calculation, no lookup● Repeatable, deterministic

● Statistically uniform distribution● Stable mapping

● Limited data migration on change● Rule-based configuration

● Infrastructure topology aware● Adjustable replication● Weighting

Page 17: OSDC 2015: John Spray | The Ceph Storage System

CRUSH: Replication

RADOS CLUSTER

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

DATA

Page 18: OSDC 2015: John Spray | The Ceph Storage System

CRUSH: Topology-aware placement

RADOS CLUSTER

10 01 01 11

01

1010

10 10 0101

01 11

10 0101

RACK A RACK B

Page 19: OSDC 2015: John Spray | The Ceph Storage System

CRUSH is a quick calculation

RADOS CLUSTER

DATA

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 20: OSDC 2015: John Spray | The Ceph Storage System

CRUSH rules● Simple language● Specify which copies go where (across racks,

servers, datacenters, disk types)

rule <rule name> {  ruleset <ruleset name>  type <replicated | erasure>  step take <bucket­type>  step [choose|chooseleaf] [firstn|indep] <N> <type>  step emit}

Page 21: OSDC 2015: John Spray | The Ceph Storage System

Pools and Placement Groups● Trick: apply CRUSH placement to fixed

number of placement groups instead of N objects.

● Manage recovery/backfill at PG granularity: less per-object metadata.

● Typically a few 100 PGs per OSD● Pool is logical collection of PGs, using a

particular CRUSH rule

Page 22: OSDC 2015: John Spray | The Ceph Storage System

Recovering from failures● OSDs notice when their peers stop responding, report

this to monitors● Monitors make decision that an OSD is now “down”

● Peers continue to serve data, but it's in a degraded state

● After some time, monitors mark the OSD “out”● New peers selected by CRUSH, data is re-replicated

across whole cluster ● Faster than RAID rebuild because we share the load● Does not require administrator intervention

Page 23: OSDC 2015: John Spray | The Ceph Storage System

RADOS advanced features

● Not just puts and gets!

● More feature rich than typical object stores● Partial object updates & appends

● Key-value stores (OMAPs) within objects

● Copy-on-write snapshots

● Watch/notify to for pushing events

● Extensible with Object Classes: perform arbitrary transactions on OSDs.

Page 24: OSDC 2015: John Spray | The Ceph Storage System

Choosing hardware● Cheap hardware mitigates cost of replication● OSD data journalling: a separate SSD is useful.

Approx 1 SSD for every 4 OSD disks.● OSDs are more CPU/RAM intensive than

legacy storage: approx 8 or so per host.● Many cheaper servers better than few

expensive: distribute the load of rebalancing.● Consider your bandwidth/capacity ratio and

your read/write ratio

Page 25: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Interfaces to applications: RGW, RBD, and CephFS

Page 26: OSDC 2015: John Spray | The Ceph Storage System

Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 27: OSDC 2015: John Spray | The Ceph Storage System

RBD: Virtual disks in Ceph

RADOS Block Device:Storage of disk images in RADOSDecouples VMs from host Images are striped across the cluster

(pool)SnapshotsCopy-on-write clonesSupport in:

Mainline Linux Kernel (2.6.39+)Qemu/KVMOpenStack, CloudStack, OpenNebula,

Proxmox

Page 28: OSDC 2015: John Spray | The Ceph Storage System

Storing virtual disks

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

28

Page 29: OSDC 2015: John Spray | The Ceph Storage System

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 30: OSDC 2015: John Spray | The Ceph Storage System

RGW: HTTP object store

RADOSGW:● REST-based object storage proxy● Compatible with S3 and Swift applications● Uses RADOS to store objects● API supports buckets, accounts● Usage accounting for billing

Page 31: OSDC 2015: John Spray | The Ceph Storage System

RADOS Gateway

M M

MRADOS CLUSTER

RADOSGW

LIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

APPLICATION

Page 32: OSDC 2015: John Spray | The Ceph Storage System

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 33: OSDC 2015: John Spray | The Ceph Storage System

Ceph Filesystem (CephFS)

POSIX-compliant shared filesystem

Client:● Userspace (FUSE) or Kernel● Looks like like a local filesystem● Sends data directly to RADOS

Metadata server:● Filesystem metadata:

● Directory hierarchy● Inode metadata (owner, timestamps, mode)

● Stores metadata in RADOS● Does not serve file data to clients

Page 34: OSDC 2015: John Spray | The Ceph Storage System

Storing Data and Metadata

LINUX HOST

M M

MRADOS CLUSTER

KERNEL MODULE

datametadata 0110

Page 35: OSDC 2015: John Spray | The Ceph Storage System

CephFS

● Advanced features:

– Subdirectory snapshots

– Recursive statistics

– Multiple metadata servers

● Coming soon:

– Online consistency checking

– Scalable repair/recovery tools

Page 36: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph

Page 37: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Beyond Replication: Erasure Coding and Cache Tiering

Page 38: OSDC 2015: John Spray | The Ceph Storage System

● Use one Ceph pool as a cache to another:

– e.g. Flash cache to a spinning disk pool

● Configurable policies for eviction based on capacity, object count, lifetime.

● Configurable mode:

– writeback: all client I/O to the cache

– readonly: client writes to backing pool, reads from the cache

Cache Tiering

Page 39: OSDC 2015: John Spray | The Ceph Storage System

● Split objects into M data chunks and K parity chunks, with configurable M and K

● An alternative to replication, providing a different set of tradeoffs:

– Consume less storage capacity

– Consume less write bandwidth

– Reads scattered across OSDs

– Modifications are expensive

● Plugin interface for encoding schemes

Erasure Coding

Page 40: OSDC 2015: John Spray | The Ceph Storage System

Cache Tiering + EC

replica 1

replica 2

M1 M2 M3 K1 K2

replica 3

Pool ‘cache’

Pool ‘cold’

Client I/O

Writeback cache policy

200% overhead

66% overhead

Page 41: OSDC 2015: John Spray | The Ceph Storage System

Example: Cache Tiering & EC

# ceph osd erasure-code-profile set ecdemo k=3 m=2

# ceph osd pool create cold 384 384 erasure ecdemo

# ceph osd pool create cache 384 384

# ceph osd tier add cold cache

# ceph osd tier cache-mode cache writeback

# ceph osd tier set-overlay cold cache

Page 42: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

What's new?

Page 43: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Ceph 0.94

Emperor

Firefly

Giant

Hammer

Infernalis

Jewel

Page 44: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

RADOS

● Performance:

– more IOPs

– exploit flash backends

– exploit many-cored machines

● CRUSH straw2 algorithm:

– reduced data migration on changes

● Cache tiering:

– read performance, reduce unnecessary promotions

Page 45: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

RBD

● Object maps:

– per-image metadata, identifies which extents are allocated.

– optimisation for clone/export/delete.

● Mandatory locking:

– prevent multiple clients writing to same image

● Copy-on-read:

– improve performance for some workloads

Page 46: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

RGW

● S3 object versioning API

– when enabled, all objects maintain history

– GET ID+version to see an old version

● Bucket sharding

– spread bucket index across multiple RADOS objects

– avoid oversized OMAPs

– avoid hotspots

Page 47: OSDC 2015: John Spray | The Ceph Storage System

CephFS

● Diagnostics & health checks

● Journal recovery tools

● Initial online metadata scrub

● Refined ENOSPC handling

● Soft client quotas

● General hardening and resilience

Page 48: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Finally...

Page 49: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015

Get involvedEvaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

Bugs:

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

Page 50: OSDC 2015: John Spray | The Ceph Storage System

Ceph Days

Ceph Day Berlin is next Tuesday!

http://ceph.com/cephdays/ceph-day-berlin/

Axica Convention Center

April 28 2015

Page 51: OSDC 2015: John Spray | The Ceph Storage System

Datacenter Storage with Ceph OSDC.de 2015