Software Defined Storage: What Makes Ceph Unique Federico Lucifredi Product Management Director, Ceph Storage Boston/Guadalajara, December 14 th , 2015
Software Defined Storage:What Makes Ceph UniqueFederico LucifrediProduct Management Director, Ceph StorageBoston/Guadalajara, December 14th, 2015
2
CLOUD SERVICES
COMPUTE NETWORK STORAGE
the future of storage™
3
HUMANHUMAN COMPUTERCOMPUTER TAPETAPE
HUMANHUMAN ROCKROCK
HUMANHUMAN
INKINK
PAPERPAPER
4
HUMANHUMAN COMPUTERCOMPUTER TAPETAPE
5
YOUYOU TECHNOLOGYTECHNOLOGY YOUR DATAYOUR DATA
6
How Much Store Things All Human History?!writing
paper
computers
distributed storage
cloud computing
gaaaaaaaaahhhh!!!!!!
carving
7
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
8
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
COMPUTERCOMPUTER
9
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
GIANT SPENDY
COMPUTER
GIANT SPENDY
COMPUTER
10
DISKDISKCOMPUTERCOMPUTER
HUMANHUMAN
HUMANHUMAN
HUMANHUMANDISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
11
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
12
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
“STORAGE APPLIANCE”
Storage ApplianceMichael Moll, Wikipedia / CC BY-SA 2.0 13
SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY SOFTWARE
14
PROPRIETARY HARDWARE
PROPRIETARY HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
34% of revenue(5.2 billion dollars)
1.1 billion in R&DSpent in a year
1.6 million square feetof manufacturing space
15
1010100110
1010110011
1001100101
1001101011
1001100111
1001010011
THE CLOUD
SUPPORT AND MAINTENANCESUPPORT AND MAINTENANCE
PROPRIETARY SOFTWARE
PROPRIETARY SOFTWARE
16
PROPRIETARY HARDWARE
PROPRIETARY HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
STANDARD HARDWARESTANDARD HARDWARE
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
DISKDISKCOMPUTERCOMPUTER
OPEN SOURCE SOFTWARE
OPEN SOURCE SOFTWARE
ENTERPRISE SUBSCRIPTION
ENTERPRISE SUBSCRIPTION
(optional)
17
18
OPEN SOURCEOPEN SOURCE
COMMUNITY-FOCUSEDCOMMUNITY-FOCUSED
SCALABLESCALABLE
NO SINGLE POINT OF FAILURENO SINGLE POINT OF FAILURE
SOFTWARE BASEDSOFTWARE BASED
SELF-MANAGINGSELF-MANAGING
philosophy design
19
8 years & 20,000 commits later…
20
21
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
22
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
23
DISKDISK
FSFS
DISKDISK DISKDISK
OSDOSD
DISKDISK DISKDISK
OSDOSD OSDOSD OSDOSD OSDOSD
FSFS FSFS FSFSFSFS btrfsxfsext4
MMMMMM
24
MM
MM
MM
HUMANHUMAN
25
Monitors:• Maintain cluster membership
and state• Provide consensus for
distributed decision-making• Small, odd number• These do not serve stored
objects to clients
MM
OSDs:• 10s to 10000s in a cluster• One per disk• (or one per SSD, RAID group…)
• Serve stored objects to clients• Intelligently peer to perform
replication and recovery tasks
26
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
LIBRADOSLIBRADOS
MM
MM
MM
27
APPAPP
socket
LLLIBRADOS• Provides direct access to
RADOS for applications• C, C++, Python, PHP, Java,
Erlang• Direct access to storage nodes• No HTTP overhead
29
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
30
MM
MM
MM
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
socket
REST
31
RADOS Gateway:• REST-based object storage
proxy• Uses RADOS to store objects• API supports buckets,
accounts• Usage accounting for billing• Compatible with S3 and
Swift applications
32
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
33
MM
MM
MM
VMVM
LIBRADOSLIBRADOSLIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
LIBRADOSLIBRADOS
34
MM
MM
MM
LIBRBDLIBRBD
CONTAINERCONTAINER
LIBRADOSLIBRADOSLIBRBDLIBRBD
CONTAINERCONTAINERVMVM
LIBRADOSLIBRADOS
35
MM
MM
MM
KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)
HOSTHOST
36
RADOS Block Device:• Storage of disk images in
RADOS• Decouples VMs from host• Images are striped across the
cluster (pool)• Snapshots• Copy-on-write clones• Support in:• Mainline Linux Kernel (2.6.39+)• Qemu/KVM• OpenStack, CloudStack
37
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
38
MM
MM
MM
CLIENTCLIENT
01100110
datametadata
39
Metadata Server• Manages metadata for a POSIX-
compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS• Does not serve file data to
clients• Only required for shared
filesystem
What Makes Ceph Unique?Part one: CRUSH
40
41
APPAPP??
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
How Long Did It Take You To Find Your Keys This Morning?azmeen, Flickr / CC BY 2.0 42
43
APPAPP
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
Dear Diary: Today I Put My Keys on the Kitchen CounterBarnaby, Flickr / CC BY 2.0 44
45
APPAPP
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
DDCC
A-G
H-N
O-T
U-Z
F*F*
I Always Put My Keys on the Hook By the Doorvitamindave, Flickr / CC BY 2.0 46
HOW DO YOUFIND YOUR KEYS
WHEN YOUR HOUSEIS
INFINITELY BIGAND
ALWAYS CHANGING?
47
The Answer: CRUSH!!!!!pasukaru76, Flickr / CC SA 2.0 48
49
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
50
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
51
CRUSH• Pseudo-random placement
algorithm• Fast calculation, no lookup• Repeatable, deterministic
• Statistically uniform distribution• Stable mapping• Limited data migration on change
• Rule-based configuration• Infrastructure topology aware• Adjustable replication• Weighting
52
CLIENTCLIENT
??
53
NAME: "foo"POOL: "bar"
0101 11111001 00111010 11010011 1011 "bar" = 3
hash("foo") % 256 = 0x23
OBJECT PLACEMENT GROUP
243
12
CRUSH TARGET OSDsPLACEMENT GROUP
3.23
3.23
54
55
56
CLIENTCLIENT
??
What Makes Ceph UniquePart two: thin provisioning
57
LIBRADOSLIBRADOS
58
MM
MM
MM
VMVM
LIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
HOW DO YOUSPIN UP
THOUSANDS OF VMsINSTANTLY
ANDEFFICIENTLY?
59
144144
60
00 00 00 00
instant copy
= 144
44144144
61
CLIENTCLIENT
write
write
write
= 148
write
44144144
62
CLIENTCLIENTread
read
read
= 148
What Makes Ceph Unique?Part three: clustered metadata
63
POSIX Filesystem MetadataBarnaby, Flickr / CC BY 2.0 64
65
MM
MM
MM
CLIENTCLIENT
01100110
66
MM
MM
MM
67
one tree
three metadata servers
??
68
69
70
71
72
DYNAMIC SUBTREE PARTITIONING
Getting Started With Ceph
Read about the latest version of Ceph.• The latest stuff is always at http://ceph.com/get
Deploy a test cluster using ceph-deploy.• Read the quick-start guide at http://ceph.com/qsg
Deploy a test cluster on the AWS free-tier using Juju.• Read the guide at http://ceph.com/juju
Read the rest of the docs!• Find docs for the latest release at http://ceph.com/docs
73
Have a working cluster up quickly.
Getting Involved With Ceph
Most project discussion happens on the mailing list.• Join or view archives at http://ceph.com/list
IRC is a great place to get help (or help others!)• Find details and historical logs at http://ceph.com/irc
The tracker manages our bugs and feature requests.• Register and start looking around at http://ceph.com/tracker
Doc updates and suggestions are always welcome.• Learn how to contribute docs at http://ceph.com/docwriting
74
Help build the best storage system around!
Ceph Hammer (v0.94.x)
1. Rados Performance enhancements: All Flash environments2. Simplified RGW deployment3. RGW Object Versioning and Bucket Sharding4. RBD Mandatory Locking, Object Maps, Copy on Read5. CephFS Snapshot improvements
and many more. See https://ceph.com/releases/v0-94-hammer-released/
75
Best Ceph ever.