Inktank Delivering the Future of Storage
Nov 28, 2014
Inktank Delivering the Future of Storage
ME ME ME ME ME ME.
2
I made a slide today. It’s all about me.
ME ME ME ME ME ME.
3
I made a slide today. It’s all about me.
Ross Turk VP Community, Inktank [email protected] @rossturk inktank.com | ceph.com
4
5
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
Let’s Start With a Good, Old-Fashioned Origin Story JD Hancock, Flickr / CC BY 2.0 6
The Evolution of Storage A brief history of information storage technology
7
Cave Paintings: The Earliest Form (maybe) of Information Storage Chico.Ferreira, Flickr / CC BY 2.0 8
Technology Review: Cave Painting The good • Low cost per smudge • Multitouch
The bad • Limited storage capacity
• 10 caveman ideas per wall
• No support for CIFS
9
10
x1000
== x1
= + HUMAN WRITING
Technology Review: Books and Libraries The good • Cost per scroll is high
• Can be eased w/slave labor
The bad • No automatic replication
• Must complete backups before Caesar’s invasion of Egypt!
11
Books (Strahov, Prague Library) Moyan_Brenn, Flickr / CC BY-ND 2.0 12
Printing Press FateDenied, Flickr / CC BY 2.0 13
14
x1000
== x1
+ magnet = tape magnetic tape
IBM System 360 Tape Drives Erik Pitti, Wikipedia / CC BY-ND 2.0
15
16
HUMAN COMPUTER TAPE
HUMAN ROCK
HUMAN
INK
PAPER
17
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011
==
Tape Is Stupid Mrs. Gemstone, Flickr / CC BY-SA 2.0 18
Computers Need Programmers (and Operators) USDAgov, Flickr / CC BY 2.0 19
20
HUMAN COMPUTER TAPE
Throughput Becomes Important rfduck, Flickr / CC BY-ND 2.0 21
Hard Drive Jeff Kubina, Flickr / CC-BY-SA 2.0 22
Hard Drives Are Totally Better
23
amazing spinny hard drives sucky stupid tape
24
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 011 db
25
owner: rturk created: aug12
last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101
10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
file
26
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 db 01 10
Humanity Outgrows the Hard Drive Mr. T in DC, Flickr / CC BY 2.0 27
28
HUMAN COMPUTER
DISK
DISK
DISK
DISK
DISK
DISK
DISK
What Happens When Two HUMANs Need Access to the Same Resource? wFourier, Flickr / CC BY 2.0 29
30
HUMAN COMPUTER
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
31
(COMPUTER)
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN HUMAN
HUMAN
HUMAN HUMAN
HUMAN HUMAN
HUMAN
HUMAN HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN (actually more like this…)
32
DISK COMPUTER
HUMAN
HUMAN
HUMAN
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
33
texture: crunch flavor: smoke, salt
nutrition: none color: bacon
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
object
34
000
aa
ac ab
ba
111010
bb bc
110
010 111
dc
101
da 000
110 001
010 011 db X
35
DISK COMPUTER
APP
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
36
DISK
COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
COMPUTER
DISK
37
DISK
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
COMPUTER
VM
VM
VM
The Current State of Storage How people store information today, and why it’s still not perfect yet
38
How Much Store Things All Human History!! Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is. 39
Writing
Computers
Shared storage
Distributed storage
Cloud computing
Ceph
Painting
40
DISK COMPUTER
HUMAN
HUMAN
HUMAN
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
41
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
DISK COMPUTER
42
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
43
HUMAN
HUMAN
HUMAN
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
Storage Hardware Michael Moll, Wikipedia / CC BY-SA 2.0 44
6.4 Million Square Feet of Expensive Factory Buildings Dude94111, Flickr / CC BY 2.0 45
Storage Hardware Vendors Have Bills to Pay CarbonNYC, Flickr / CC BY 2.0 46
…Which Means That Customers Do Too 401K 2012, Flickr / CC BY-SA 2.0 47
Technology Is Becoming a Commodity RaeAllen, Flickr / CC-BY 2.0 48
Commodity Prices Fluctuate
49
May-07 May-08 May-09 May-10 May-11 May-12
Growing With Hardware Appliances First PB • Proprietary
storage hardware • Well-known
storage vendor $14 b’zillion
Second PB • Proprietary storage
hardware • Same storage
vendor Another $14 b’zillion
50
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
Dedicated Hardware Appliances Are OLD TECHNOLOGY Paul Keller, Flickr / CC BY 2.0 51
52 Source: http://www.cpubenchmark.net/high_end_cpus.html
53
FLAGSHIP PRODUCT
“I'm sick of paying for hardware with a three-year-old proc in it!” Mel B., Flickr / CC BY 2.0 54
Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0 55
56
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
57
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++ X
58
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
HUMAN [DEVELOPER]
!!
Give More Money To The Big Proprietary Vendors It will make them very, very happy. 59
Storage Should Be Better
People need storage solutions that… • …are open • …are easy to manage • …satisfy their requirements
• performance • functional • financial
60
The Birth of a New Storage Solution We think our roots are showing
61
DreamHost 62
Sage Weil Co-founder of DreamHost Inventor of Ceph CEO of Inktank
63
DreamHost DreamHost is staffed by extraordinarily hip people 64
65
+
New Monthly Code Commits
66
0
100
200
300
400
500
600
700
2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07
Ceph Starts Popping Up
67
68
OPEN SOURCE
philosophy design
Open Source is the Best Way to Spread Ideas orchidgalore, Flickr / CC BY 2.0 69
70
OPEN SOURCE
COMMUNITY-FOCUSED
philosophy design
All of Us Are Smarter Than Some of Us rturk, Linkedin Inmap 71
72
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
philosophy design
Ceph is Built to Scale Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is. 73
Too much for a book
Too much for a drive
Too much for a computer
Too much for a room
Ceph
Too much for a cave
74
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
philosophy design
Ariolimax Californicus aroid, Flickr / CC BY 2.0 75
The Octopus (A Metaphor) I love speaking in metaphors. 76
single point of failure
replicated replicated
The Beehive (A Better Metaphor) blumenbiene, Flickr / CC BY 2.0 77
78
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASED
philosophy design
79
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
80
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++ ✔
81
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASED SELF-
MANAGING
philosophy design
Hard Drives Are Tiny Record Players and They Fail Often jon_a_ross, Flickr / CC BY 2.0 82
83
D
55 times / day
= D
D D
x 1 MILLION
D D
D D
Enter: Ceph An architectural and functional overview of the Ceph system
84
85
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
86
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
87
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FS FS btrfs xfs ext4
M M M
88
M
M
M
HUMAN
89
Monitors: • Maintain cluster map • Provide consensus for
distributed decision-making • Must have an odd number • These do not serve stored
objects to clients
M
OSDs: • One per disk (recommended) • At least three in a cluster • Serve stored objects to
clients • Intelligently peer to perform
replication tasks • Supports object classes
90
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
LIBRADOS
M
M
M
91
APP
socket
L
92
LIBRADOS • Provides direct access to
RADOS for applications • C, C++, Python, PHP, Java • No HTTP overhead
93
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
94
M
M
M
LIBRADOS RADOSGW
APP
socket
REST
95
RADOS Gateway: • REST-based interface to
RADOS • Supports buckets,
accounting • Compatible with S3 and
Swift applications
96
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
97
M
M
M
VM
LIBRADOS LIBRBD
VIRTUALIZATION CONTAINER
LIBRADOS
98
M
M
M
LIBRBD CONTAINER
LIBRADOS LIBRBD
CONTAINER VM
LIBRADOS
99
M
M
M
KRBD (KERNEL MODULE) HOST
100
RADOS Block Device: • Storage of virtual disks in
RADOS • Allows decoupling of VMs and
containers • Live migration!
• Images are striped across the cluster
• Boot support in QEMU, KVM, and OpenStack Nova
• Mount support in the Linux kernel
101
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
102
M
M
M
CLIENT
01 10
data metadata
103
Metadata Server • Manages metadata for a
POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner,
timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to
clients • Only required for shared
filesystem
What Makes Ceph Unique? Part one: CRUSH
104
105
APP ??
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
106
APP
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
How Long Did It Take You To Find Your Keys This Morning? azmeen, Flickr / CC BY 2.0 107
Dear Diary: Today I Put My Keys on the Kitchen Counter Barnaby, Flickr / CC BY 2.0 108
109
APP
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
D C
A-G
H-N
O-T
U-Z
F*
I Always Put My Keys on the Hook By the Door vitamindave, Flickr / CC BY 2.0 110
HOW DO YOU FIND YOUR KEYS
WHEN YOUR HOUSE IS
INFINITELY BIG AND
ALWAYS CHANGING?
111
The Answer: CRUSH!!!!! pasukaru76, Flickr / CC SA 2.0 112
113
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
114
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
115
CRUSH • Pseudo-random placement
algorithm • Ensures even distribution • Repeatable, deterministic • Rule-based configuration
• Replica count • Infrastructure topology • Weighting
116
CLIENT
??
117
118
119
CLIENT
??
What Makes Ceph Unique Part two: thin provisioning
120
LIBRADOS
121
M
M
M
VM
LIBRBD VIRTUALIZATION CONTAINER
HOW DO YOU SPIN UP
THOUSANDS OF VMs INSTANTLY
AND EFFICIENTLY?
122
144
123
0 0 0 0
instant copy
= 144
4 144
124
CLIENT
write
write
write
= 148
write
4 144
125
CLIENT read
read
read
= 148
What Makes Ceph Unique? Part three: clustered metadata
126
Metadata for a POSIX-Compliant Filesystem Barnaby, Flickr / CC BY 2.0 127
128
M
M
M
CLIENT
01 10
129
M
M
M
130
one tree
three metadata servers
??
131
132
133
134
135
DYNAMIC SUBTREE PARTITIONING
And Now: Backpedaling
136
137
ALMOST EVERYTHING
WORKS
138
RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP
RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLY AWESOME
AWESOME AWESOME
AWESOME
AWESOME
139
LAN SCALE!! *
* OR REALLY REALLY SCARY FAST WAN
What is Inktank? I really like your polo shirt, please tell me what it means!
140
141
Who? • Ceph’s inventor and (most) developers
142
Why? • To ensure the long-term success of Ceph
• To help companies adopt Ceph through services, support, training, and consulting
143
When? • Founded: December 28, 2011
• Brand Launched: April 2012
144
What do we want from you?? • Try Ceph! Tell us what you think. Ask if
you need help. Help others if you can!
• Are you a company? Consider dedicating dev resources to the project.
145