Jul 15, 2015
Datacenter Storage with Ceph OSDC.de 2015
Agenda
● What is Ceph?
● How does Ceph store your data?
● Interfaces to Ceph: RBD, RGW, CephFS
● Latest development updates
Datacenter Storage with Ceph OSDC.de 2015
What is Ceph?● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image service, and S3-compatible object storage service.
Datacenter Storage with Ceph OSDC.de 2015
A general purpose storage system
● You feed it commodity disks and ethernet
● In return, it gives your apps a storage service
● It doesn't lose your data
● It doesn't need babysitting
● It's portable
Datacenter Storage with Ceph OSDC.de 2015
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
Datacenter Storage with Ceph OSDC.de 2015
Ceph Architecture(how your data is stored)
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
RADOS
ReliableAutonomousDistributedObject Store
RADOS Components
OSDs:
10s to 10000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:Maintain cluster membership and stateProvide consensus for distributed decision-
makingSmall, odd numberThese do not serve stored objects to clients
M
Object Storage Daemons
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfsext4btrfs
M
M
M
Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER
Where do objects live?
??
APPLICATION
M
M
M
OBJECT
A Metadata Server?
1
APPLICATION
M
M
M
2
Calculated placement
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
CRUSH: Dynamic data placement
Pseudo-random placement algorithm● Fast calculation, no lookup● Repeatable, deterministic
● Statistically uniform distribution● Stable mapping
● Limited data migration on change● Rule-based configuration
● Infrastructure topology aware● Adjustable replication● Weighting
CRUSH: Replication
RADOS CLUSTER
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
DATA
CRUSH: Topology-aware placement
RADOS CLUSTER
10 01 01 11
01
1010
10 10 0101
01 11
10 0101
RACK A RACK B
CRUSH is a quick calculation
RADOS CLUSTER
DATA
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
CRUSH rules● Simple language● Specify which copies go where (across racks,
servers, datacenters, disk types)
rule <rule name> { ruleset <ruleset name> type <replicated | erasure> step take <buckettype> step [choose|chooseleaf] [firstn|indep] <N> <type> step emit}
Pools and Placement Groups● Trick: apply CRUSH placement to fixed
number of placement groups instead of N objects.
● Manage recovery/backfill at PG granularity: less per-object metadata.
● Typically a few 100 PGs per OSD● Pool is logical collection of PGs, using a
particular CRUSH rule
Recovering from failures● OSDs notice when their peers stop responding, report
this to monitors● Monitors make decision that an OSD is now “down”
● Peers continue to serve data, but it's in a degraded state
● After some time, monitors mark the OSD “out”● New peers selected by CRUSH, data is re-replicated
across whole cluster ● Faster than RAID rebuild because we share the load● Does not require administrator intervention
RADOS advanced features
● Not just puts and gets!
● More feature rich than typical object stores● Partial object updates & appends
● Key-value stores (OMAPs) within objects
● Copy-on-write snapshots
● Watch/notify to for pushing events
● Extensible with Object Classes: perform arbitrary transactions on OSDs.
Choosing hardware● Cheap hardware mitigates cost of replication● OSD data journalling: a separate SSD is useful.
Approx 1 SSD for every 4 OSD disks.● OSDs are more CPU/RAM intensive than
legacy storage: approx 8 or so per host.● Many cheaper servers better than few
expensive: distribute the load of rebalancing.● Consider your bandwidth/capacity ratio and
your read/write ratio
Datacenter Storage with Ceph OSDC.de 2015
Interfaces to applications: RGW, RBD, and CephFS
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
RBD: Virtual disks in Ceph
RADOS Block Device:Storage of disk images in RADOSDecouples VMs from host Images are striped across the cluster
(pool)SnapshotsCopy-on-write clonesSupport in:
Mainline Linux Kernel (2.6.39+)Qemu/KVMOpenStack, CloudStack, OpenNebula,
Proxmox
Storing virtual disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
28
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
RGW: HTTP object store
RADOSGW:● REST-based object storage proxy● Compatible with S3 and Swift applications● Uses RADOS to store objects● API supports buckets, accounts● Usage accounting for billing
RADOS Gateway
M M
MRADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGWLIBRADOS
APPLICATION APPLICATION
REST
APPLICATION
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
Ceph Filesystem (CephFS)
POSIX-compliant shared filesystem
Client:● Userspace (FUSE) or Kernel● Looks like like a local filesystem● Sends data directly to RADOS
Metadata server:● Filesystem metadata:
● Directory hierarchy● Inode metadata (owner, timestamps, mode)
● Stores metadata in RADOS● Does not serve file data to clients
Storing Data and Metadata
LINUX HOST
M M
MRADOS CLUSTER
KERNEL MODULE
datametadata 0110
CephFS
● Advanced features:
– Subdirectory snapshots
– Recursive statistics
– Multiple metadata servers
● Coming soon:
– Online consistency checking
– Scalable repair/recovery tools
Datacenter Storage with Ceph OSDC.de 2015
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
Datacenter Storage with Ceph OSDC.de 2015
Beyond Replication: Erasure Coding and Cache Tiering
● Use one Ceph pool as a cache to another:
– e.g. Flash cache to a spinning disk pool
● Configurable policies for eviction based on capacity, object count, lifetime.
● Configurable mode:
– writeback: all client I/O to the cache
– readonly: client writes to backing pool, reads from the cache
Cache Tiering
● Split objects into M data chunks and K parity chunks, with configurable M and K
● An alternative to replication, providing a different set of tradeoffs:
– Consume less storage capacity
– Consume less write bandwidth
– Reads scattered across OSDs
– Modifications are expensive
● Plugin interface for encoding schemes
Erasure Coding
Cache Tiering + EC
replica 1
replica 2
M1 M2 M3 K1 K2
replica 3
Pool ‘cache’
Pool ‘cold’
Client I/O
Writeback cache policy
200% overhead
66% overhead
Example: Cache Tiering & EC
# ceph osd erasure-code-profile set ecdemo k=3 m=2
# ceph osd pool create cold 384 384 erasure ecdemo
# ceph osd pool create cache 384 384
# ceph osd tier add cold cache
# ceph osd tier cache-mode cache writeback
# ceph osd tier set-overlay cold cache
Datacenter Storage with Ceph OSDC.de 2015
What's new?
Datacenter Storage with Ceph OSDC.de 2015
Ceph 0.94
Emperor
Firefly
Giant
Hammer
Infernalis
Jewel
Datacenter Storage with Ceph OSDC.de 2015
RADOS
● Performance:
– more IOPs
– exploit flash backends
– exploit many-cored machines
● CRUSH straw2 algorithm:
– reduced data migration on changes
● Cache tiering:
– read performance, reduce unnecessary promotions
Datacenter Storage with Ceph OSDC.de 2015
RBD
● Object maps:
– per-image metadata, identifies which extents are allocated.
– optimisation for clone/export/delete.
● Mandatory locking:
– prevent multiple clients writing to same image
● Copy-on-read:
– improve performance for some workloads
Datacenter Storage with Ceph OSDC.de 2015
RGW
● S3 object versioning API
– when enabled, all objects maintain history
– GET ID+version to see an old version
● Bucket sharding
– spread bucket index across multiple RADOS objects
– avoid oversized OMAPs
– avoid hotspots
CephFS
● Diagnostics & health checks
● Journal recovery tools
● Initial online metadata scrub
● Refined ENOSPC handling
● Soft client quotas
● General hardening and resilience
Datacenter Storage with Ceph OSDC.de 2015
Finally...
Datacenter Storage with Ceph OSDC.de 2015
Get involvedEvaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS
Ceph Days
Ceph Day Berlin is next Tuesday!
http://ceph.com/cephdays/ceph-day-berlin/
Axica Convention Center
April 28 2015
Datacenter Storage with Ceph OSDC.de 2015