Ceph Intro & Architectural Overview Ross Turk VP Community, Inktank
Jan 15, 2015
Ceph Intro & Architectural OverviewRoss TurkVP Community, Inktank
2
ME ME ME ME ME ME.I made a slide today. It’s all about me.
Ross TurkVP Community, Inktank
[email protected]@rossturk
inktank.com | ceph.com
3
CLOUD SERVICES
COMPUTE NETWORK STORAGE
the future of storage™
4
HUMAN
COMPUTER TAPE
HUMAN
ROCK
HUMAN
INK
PAPER
5
HUMAN
COMPUTER TAPE
6
YOUTECHNOLOG
YYOUR DATA
7
How Much Store Things All Human History?!writing
paper
computers
distributed storage
cloud computing
gaaaaaaaaahhhh!!!!!!
carving
8
HUMAN COMPUTER DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
9
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMANHUMAN
HUMANHUMAN
HUMAN
HUMANHUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
COMPUTER
10
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMANHUMAN
HUMANHUMAN
HUMAN
HUMANHUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
GIANT SPENDY
COMPUTER
11
DISKCOMPUTE
R
HUMAN
HUMAN
HUMAN
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
12
HUMAN
HUMAN
HUMAN
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
13
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
“STORAGE APPLIANCE”
14Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0
15
SUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY HARDWARE
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
34% of 2012 revenue(5.2 billion dollars)
1.1 billion in R&Dspent in 2012
1.6 million square feetof manufacturing space
16
1010100110
1010110011
1001100101
1001101011
1001100111
1001010011
THE CLOUD
17
SUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY HARDWARE
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
STANDARD HARDWARE
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
OPEN SOURCE SOFTWARE
ENTERPRISE SUBSCRIPTION
(optional)
18
19
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASEDSELF-
MANAGING
philosophy design
20
8 years & 20,000 commits later…
21
22
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
23
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
24
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
25
M
M
M
HUMAN
26
Monitors:• Maintain cluster
membership and state• Provide consensus for
distributed decision-making• Small, odd number• These do not serve stored
objects to clients
M
OSDs:• 10s to 10000s in a cluster• One per disk• (or one per SSD, RAID group…)• Serve stored objects to
clients• Intelligently peer to perform
replication and recovery tasks
27
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
28
LIBRADOS
M
M
M
APP
socket
LLIBRADOS• Provides direct access to
RADOS for applications• C, C++, Python, PHP, Java,
Erlang• Direct access to storage
nodes• No HTTP overhead
30
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
31
M
M
M
LIBRADOS
RADOSGW
APP
socket
REST
32
RADOS Gateway:• REST-based object
storage proxy• Uses RADOS to store
objects• API supports buckets,
accounts• Usage accounting for
billing• Compatible with S3 and
Swift applications
33
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
34
M
M
M
VM
LIBRADOS
LIBRBD
VIRTUALIZATION CONTAINER
35
LIBRADOS
M
M
M
LIBRBD
CONTAINER
LIBRADOS
LIBRBD
CONTAINERVM
36
LIBRADOS
M
M
M
KRBD (KERNEL MODULE)
HOST
37
RADOS Block Device:• Storage of disk images in
RADOS• Decouples VMs from host• Images are striped across
the cluster (pool)• Snapshots• Copy-on-write clones• Support in:• Mainline Linux Kernel
(2.6.39+)• Qemu/KVM, native Xen
coming soon• OpenStack, CloudStack,
Nebula, Proxmox
38
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
39
M
M
M
CLIENT
0110
datametadata
40
Metadata Server• Manages metadata for a
POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to
clients• Only required for shared
filesystem
41
What Makes Ceph Unique?Part one: CRUSH
42
APP??
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
43How Long Did It Take You To Find Your Keys This Morning?azmeen, Flickr / CC BY 2.0
44
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
45Dear Diary: Today I Put My Keys on the Kitchen CounterBarnaby, Flickr / CC BY 2.0
46
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
A-G
H-N
O-T
U-Z
F*
47I Always Put My Keys on the Hook By the Doorvitamindave, Flickr / CC BY 2.0
48
HOW DO YOUFIND YOUR KEYS
WHEN YOUR HOUSEIS
INFINITELY BIGAND
ALWAYS CHANGING?
49The Answer: CRUSH!!!!!pasukaru76, Flickr / CC SA 2.0
50
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
51
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
52
CRUSH• Pseudo-random placement
algorithm• Fast calculation, no lookup• Repeatable, deterministic• Statistically uniform
distribution• Stable mapping• Limited data migration on
change• Rule-based configuration• Infrastructure topology aware• Adjustable replication• Weighting
53
CLIENT
??
54
55
56
57
CLIENT
??
58
What Makes Ceph UniquePart two: thin provisioning
59
LIBRADOS
M
M
M
VM
LIBRBD
VIRTUALIZATION CONTAINER
60
HOW DO YOUSPIN UP
THOUSANDS OF VMsINSTANTLY
ANDEFFICIENTLY?
61
144 0 0 0 0
instant copy
= 144
62
4144
CLIENT
write
write
write
= 148
write
63
4144
CLIENTread
read
read
= 148
64
What Makes Ceph Unique?Part three: clustered metadata
65POSIX Filesystem MetadataBarnaby, Flickr / CC BY 2.0
66
M
M
M
CLIENT
0110
67
M
M
M
68
one tree
three metadata servers
??
69
70
71
72
73
DYNAMIC SUBTREE PARTITIONING
74
Getting Started With Ceph
Read about the latest version of Ceph.• The latest stuff is always at http://ceph.com/get
Deploy a test cluster using ceph-deploy.• Read the quick-start guide at http://ceph.com/qsg
Deploy a test cluster on the AWS free-tier using Juju.• Read the guide at http://ceph.com/juju
Read the rest of the docs!• Find docs for the latest release at http://ceph.com/docs
Have a working cluster up quickly.
75
Getting Involved With Ceph
Most project discussion happens on the mailing list.• Join or view archives at http://ceph.com/list
IRC is a great place to get help (or help others!)• Find details and historical logs at http://ceph.com/irc
The tracker manages our bugs and feature requests.• Register and start looking around at
http://ceph.com/tracker
Doc updates and suggestions are always welcome.• Learn how to contribute docs at http://ceph.com
/docwriting
Help build the best storage system around!
76
Ceph Cuttlefish (v0.61.x)
1. New ceph-deploy provisioning tool2. New Chef cookbooks3. Fully-tested packages for RHEL (in EPEL)4. RGW authentication management API5. RADOS pool quotas6. New ceph df7. RBD incremental snapshots
Best Ceph ever.