Top Banner
Inktank Delivering the Future of Storage The End of RAID as you know it with Ceph Replication March 28, 2013
32

End of RAID as we know it with Ceph Replication

Jan 16, 2015

Download

Technology

Inktank

Our VP of Engineering Mark Kampe explores the technical and financial advantages of Ceph replication over RAID.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: End of RAID as we know it with Ceph Replication

Inktank

Delivering the Future of Storage

The End of RAID as you know it with Ceph Replication March 28, 2013

Page 2: End of RAID as we know it with Ceph Replication

Agenda l  Intank and Ceph Introduction

l  Ceph Technology l  Challenges of Raid l  Ceph Advantages

l  Q&A l  Resources and Moving Forward

Page 3: End of RAID as we know it with Ceph Replication

•  Distributed unified object, block and file storage platform

•  Created by storage experts

•  Open source

•  In the Linux Kernel

•  Integrated into Cloud Platforms

•  Company that provides professional services and support for Ceph

•  Founded in 2011

•  Funded by DreamHost

•  Mark Shuttleworth invested $1M

•  Sage Weil, CTO and creator of Ceph

Page 4: End of RAID as we know it with Ceph Replication

Ceph Technology Overview

Page 5: End of RAID as we know it with Ceph Replication

Ceph Technological Foundations

Ceph was built with the following goals:

•  Every component must scale •  There can be no single points of failure •  The solution must be software-based, not an appliance •  Must be open source •  Should run on readily-available, commodity hardware •  Everything must self-manage wherever possible

5

Page 6: End of RAID as we know it with Ceph Replication

Ceph Innovations

CRUSH data placement algorithm Algorithm is infrastructure aware and quickly adjusts to failures Data location is computed rather than locked up Enables clients to directly directly communicate with servers that store their data Enables clients to perform parallel I/O for greatly enhanced throughput Reliable Autonomic Distributed Object Store Storage devices assume complete responsibility for data integrity They operate independently, in parallel, without central choreography Very efficient. Very fast. Very scalable. CephFS Distributed Metadata Server Highly scalable to large numbers of active/active metadata servers and high throughput Highly reliable and available, with full Posix semantics and consistency guarantees Has both a FUSE client and a client fully integrated into the Linux kernel Advanced Virtual Block Device Enterprise storage capabilities from utility server hardware Thin Provisioned, Allocate-on-Write Snapshots, LUN cloning In the Linux kernel and integrated with OpenStack components

6

Page 7: End of RAID as we know it with Ceph Replication

Unified Storage Platform Object

•  Archival and backup storage •  Primary data storage •  S3-like storage •  Web services and platforms •  Application development

Block

•  SAN replacement •  Virtual block device, VM images

File

•  HPC •  Posix-compatible applications

Page 8: End of RAID as we know it with Ceph Replication

OBJECTS VIRTUAL DISKS FILES & DIRECTORIES

CEPH FILE SYSTEM

A distributed, scale-out filesystem with POSIX

semantics that provides storage for a legacy and

modern applications

Ceph Unified Storage Platform

CEPH GATEWAY

A powerful S3- and Swift-compatible gateway that brings the power of the Ceph Object Store to modern applications

CEPH BLOCK DEVICE

A distributed virtual block device that delivers high-

performance, cost-effective storage for virtual machines

and legacy applications

CEPH OBJECT STORE

A reliable, easy to manage, next-generation distributed object store that provides storage of unstructured data for applications

Page 9: End of RAID as we know it with Ceph Replication

9

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FS FS btrfs xfs ext4

M M M

RADOS Cluster Makeup

RADOS Cluster

RADOS Node

Page 10: End of RAID as we know it with Ceph Replication

10

Intelligent Storage Servers

•  Serve stored objects to clients •  OSD is primary for some objects

•  Responsible for replication •  Responsible for coherency •  Responsible for re-balancing •  Responsible for recovery

•  OSD is secondary for some objects •  Under control of primary •  Capable of becoming primary

•  Supports extended object classes •  Atomic transactions •  Synchronization and notifications •  Send computation to the data

RADOS Object Storage Daemons

Page 11: End of RAID as we know it with Ceph Replication

11

Pseudo-random placement algorithm

•  Deterministic function of inputs

•  Clients can compute data location

Rule-based configuration

•  Desired/required replica count

•  Affinity/distribution rules •  Infrastructure topology •  Weighting

Excellent data distribution •  Declustered placement •  Excellent data re-distribution •  Migration proportional to

change

CRUSH

Page 12: End of RAID as we know it with Ceph Replication

12

CLIENT

??

Page 13: End of RAID as we know it with Ceph Replication

13

Stewards of the Cluster

•  Distributed consensus (Paxos) •  Odd number required (quorum) •  Maintain/distribute cluster map •  Authentication/key servers •  Monitors are not in the data path •  Clients talk directly to OSDs

M

RADOS Monitors

Page 14: End of RAID as we know it with Ceph Replication

RAID and its Challenges

14

Page 15: End of RAID as we know it with Ceph Replication

15

Enhanced Reliability

•  RAID-1 mirroring •  RAID-5/6 parity (reduced overhead) •  Automated recovery

Enhanced Performance

•  RAID-0 striping •  SAN interconnects •  Enterprise SAS drives •  Proprietary H/W RAID controllers

Economical Storage Solutions

•  Software RAID implementations •  iSCSI and JBODs

Enhanced Capacity

•  Logical volume concatenation

Redundant Array of Inexpensive Disks

Page 16: End of RAID as we know it with Ceph Replication

•  Storage economies in disks come from more GB per spindle •  NRE rates are flat (typically estimated at 10E-15/bit)

•  4% chance of NRE while recovering a 4+1 RAID-5 set and it goes up with the number of volumes in the set many RAID controllers fail the recovery after an NRE

•  Access speed has not kept up with density increases

•  27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time a second drive can fail

•  Managing the risk of second failures requires hot-spares

•  Defeating some of the savings from parity redundancy

RAID Challenges: Capacity/Speed

Page 17: End of RAID as we know it with Ceph Replication

•  The next generation of disks will be larger and cost less per GB. We would like to use these as we expand

•  Most RAID replication schemes require identical disk meaning new disks cannot be added to an old set meaning failed disks must be replaced with identical units

•  Proprietary appliances may require replacements from

manufacturer (at much higher than commodity prices) •  Many storage systems reach a limit beyond which they

cannot be further expanded (e.g. fork-lift upgrade) •  Re-balancing existing data over new volumes is non-trivial

RAID Challenges: Expansion

Page 18: End of RAID as we know it with Ceph Replication

•  RAID-5 can only survive a single disk failure •  The odds of an NRE during recovery are significant •  Odds of a second failure during recovery are non-negligible •  Annual peta-byte durability for RAID-5 is only 3 nines

•  RAID-6 redundancy protects against two disk failures

•  Odds of an NRE during recovery are still significant •  Client data access will be starved out during recovery •  Throttling recovery increases the risk of data loss

•  Even RAID-6 can't protect against:

•  Server failures •  NIC failures •  Switch failures •  OS crashes •  Facility or regional disasters

RAID Challenges: Reliability/Availability

Page 19: End of RAID as we know it with Ceph Replication

Capital Expenses … good RAID costs

•  Significant mark-up for enterprise hardware •  High performance RAID controllers can add $50-100/disk •  SANs further increase •  Expensive equipment, much of which is often poorly used •  Software RAID is much less expensive, and much slower

Operating Expenses … RAID doesn't manage itself

•  RAID group, LUN and pool management •  Lots of application-specific tunable parameters •  Difficult expansion and migration •  When a recovery goes bad, it goes very bad •  Don't even think about putting off replacing a failed drive

RAID Challenges: Expense

Page 20: End of RAID as we know it with Ceph Replication

Ceph Advantages

Page 21: End of RAID as we know it with Ceph Replication

Ceph VALUE PROPOSITION

•  Self-managing •  OK to batch drive replacements •  Emerging platform integration

SAVES TIME

INCREASES FLEXIBILITY

LOWERS RISK

•  Object, block, & filesystem storage •  Highly adaptable software solution •  Easier deployment of new services

•  No vendor lock in •  Rule configurable failure-zones •  Improved reliability and availability

SAVES MONEY •  Open source •  Runs on commodity hardware •  Runs in heterogeneous

environments

Page 22: End of RAID as we know it with Ceph Replication

•  Consider a failed 2TB RAID mirror

•  We must copy 2TB from the survivor to the successor •  Survivor and successor are likely in same failure zone

•  Consider two RADOS objects clustered on the same primary •  Surviving copies are declustered (on different secondaries) •  New copies will be declustered (on different successors) •  Copy 10GB from each of 200 survivors to 200 successors •  Survivors and successors are in different failure zones

•  Benefits •  Recovery is parallel and 200x faster •  Service can continue during the recovery process •  Exposure to 2nd failures is reduced by 200x •  Zone aware placement protects against higher level failures •  Recovery is automatic and does not await new drives •  No idle hot-spares are required

Ceph Advantage: Declustered Placement

Page 23: End of RAID as we know it with Ceph Replication

23

CLIENT

??

Page 24: End of RAID as we know it with Ceph Replication

•  Consider a failed 2TB RAID mirror •  To recover it we must read and write (at least) 2TB •  Successor must be same size as failed volume •  An error in recovery will probably lose the file system

•  Consider a failed RADOS OSD •  To recover it we must read and write thousands of objects •  Successor OSDs must have each have some free space •  An error in recovery will probably lose one object

•  Benefits •  Heterogeneous commodity disks are easily supported •  Better and more uniform space utilization •  Per-object updates always preserve causality ordering •  Object updates are more easily replicated over WAN links •  Greatly reduced data loss if errors do occur

Ceph Advantage: Object Granularity

Page 25: End of RAID as we know it with Ceph Replication

•  Intelligent OSDs automatically rebalance data •  When new nodes are added •  When old nodes fail or are decommissioned •  When placement policies are changed

•  The resulting rebalancing is very good: •  Even distribution of data across all OSDs •  Uniform mix of old and new data across all OSDs •  Moves only as much data as required

•  Intelligent OSDs continuously scrub their objects •  To detect and correct silent write errors before another failure

•  This architecture scales from petabytes to exabytes •  A single pool of thin provisioned, self-managing storage •  Serving a wide range of block, object, and file clients

Ceph Advantage: Intelligent Storage

Page 26: End of RAID as we know it with Ceph Replication

•  Can leverage commodity hardware for lowest costs •  Not locked in to single vendor; get best deal over time •  RAID not required, leading to lower component costs

Enterprise RAID Ceph Replication

Raw $/GB $3 $0.50

Protected $/GB $4 (RAID6 6+2) $1.50 (3 copies)

Usable (90%) $4.44 $1.67

Replicated $8.88 (Main + Bkup) $1.67 (3 copies)

Relative Expense 533% storage cost Baseline (100%)

Ceph Advantage: Price

Page 27: End of RAID as we know it with Ceph Replication

Q&A

Page 28: End of RAID as we know it with Ceph Replication

Leverage great online resources

Documentation on the Ceph web site: •  http://ceph.com/docs/master/

Blogs from Inktank and the Ceph community: •  http://www.inktank.com/news-events/blog/ •  http://ceph.com/community/blog/

Developer resources: •  http://ceph.com/resources/development/ •  http://ceph.com/resources/mailing-list-irc/ •  http://dir.gmane.org/gmane.comp.file-

systems.ceph.devel

Page 29: End of RAID as we know it with Ceph Replication

Leverage Ceph Expert Support

Inktank will partner with you for complex deployments •  Solution design and Proof-of-Concept •  Solution customization •  Capacity planning •  Performance optimization

Having access to expert support is a production best practice •  Troubleshooting •  Debugging

A full description of our services can be found at the following: Consulting Services: http://www.inktank.com/consulting-services/ Support Subscriptions: http://www.inktank.com/support-services/

29

Page 30: End of RAID as we know it with Ceph Replication

Check out our upcoming webinars

Ceph Unified Storage for OpenStack •  April 4, 2013 •  10:00AM PT, 12:00PM CT, 1:00PM ET

Technical Deep Dive Into Ceph Object Storage •  April 10, 2013 •  10:00AM PT, 12:00PM CT, 1:00PM ET

Register today at: http://www.inktank.com/news-events/webinars/

Page 31: End of RAID as we know it with Ceph Replication

Contact Us [email protected] 1-855-INKTANK Don’t forget to follow us on:

Twitter: https://twitter.com/inktank

Facebook: http://www.facebook.com/inktank

YouTube: http://www.youtube.com/inktankstorage

Page 32: End of RAID as we know it with Ceph Replication

Thank you for joining!