Top Banner

Click here to load reader

End of RAID as we know it with Ceph Replication

Jan 16, 2015

ReportDownload

Technology

inktank

Our VP of Engineering Mark Kampe explores the technical and financial advantages of Ceph replication over RAID.

  • 1. InktankDelivering the Future of StorageThe End of RAID as you know it with Ceph ReplicationMarch 28, 2013

2. Agendal Intank and Ceph Introductionl Ceph Technologyl Challenges of Raidl Ceph Advantagesl Q&Al Resources and Moving Forward 3. Company that provides Distributed unified object, professional services andblock and file storage support for Cephplatform Founded in 2011 Created by storage Funded by DreamHostexperts Mark Shuttleworth Open source invested $1M In the Linux Kernel Sage Weil, CTO and creator of Ceph Integrated into CloudPlatforms 4. Ceph Technology Overview 5. Ceph Technological FoundationsCeph was built with the following goals: Every component must scale There can be no single points of failure The solution must be software-based, not an appliance Must be open source Should run on readily-available, commodity hardware Everything must self-manage wherever possible 5 6. Ceph InnovationsCRUSH data placement algorithmAlgorithm is infrastructure aware and quickly adjusts to failuresData location is computed rather than locked upEnables clients to directly directly communicate with servers that store their dataEnables clients to perform parallel I/O for greatly enhanced throughputReliable Autonomic Distributed Object StoreStorage devices assume complete responsibility for data integrityThey operate independently, in parallel, without central choreographyVery efficient. Very fast. Very scalable.CephFS Distributed Metadata ServerHighly scalable to large numbers of active/active metadata servers and high throughputHighly reliable and available, with full Posix semantics and consistency guaranteesHas both a FUSE client and a client fully integrated into the Linux kernelAdvanced Virtual Block DeviceEnterprise storage capabilities from utility server hardwareThin Provisioned, Allocate-on-Write Snapshots, LUN cloningIn the Linux kernel and integrated with OpenStack components 7. Unified Storage PlatformObject Archival and backup storage Primary data storage S3-like storage Web services and platforms Application developmentBlock SAN replacement Virtual block device, VM imagesFile HPC Posix-compatible applications 8. Ceph Unified Storage PlatformOBJECTSVIRTUAL DISKS FILES & DIRECTORIES CEPHCEPH CEPHGATEWAY BLOCK DEVICEFILE SYSTEMA powerful S3- and Swift-A distributed virtual block A distributed, scale-outcompatible gateway thatdevice that delivers high- filesystem with POSIX brings the power of theperformance, cost-effectivesemantics that providesCeph Object Store tostorage for virtual machines storage for a legacy and modern applicationsand legacy applicationsmodern applicationsCEPH OBJECT STORE A reliable, easy to manage, next-generation distributed objectstore that provides storage of unstructured data for applications 9. RADOS Cluster MakeupOSDOSDOSDOSDOSDRADOS Node btrfsFS FS FS FS FS xfs ext4DISK DISK DISK DISK DISKRADOSM M MCluster 9 10. RADOS Object Storage Daemons Intelligent Storage Servers Serve stored objects to clients OSD is primary for some objects Responsible for replication Responsible for coherency Responsible for re-balancing Responsible for recovery OSD is secondary for some objects Under control of primary Capable of becoming primary Supports extended object classes Atomic transactions Synchronization and notifications Send computation to the data10 11. CRUSH Pseudo-random placement algorithm Deterministic function ofinputs Clients can compute datalocation Rule-based configuration Desired/required replicacount Affinity/distribution rules Infrastructure topology Weighting Excellent data distribution Declustered placement Excellent data re-distribution Migration proportional tochange 11 12. CLIENT ??12 13. RADOS MonitorsStewards of the Cluster M Distributed consensus (Paxos) Odd number required (quorum) Maintain/distribute cluster map Authentication/key servers Monitors are not in the data path Clients talk directly to OSDs13 14. RAID and its Challenges14 15. Redundant Array of Inexpensive DisksEnhanced Reliability RAID-1 mirroring RAID-5/6 parity (reduced overhead) Automated recoveryEnhanced Performance RAID-0 striping SAN interconnects Enterprise SAS drives Proprietary H/W RAID controllersEconomical Storage Solutions Software RAID implementations iSCSI and JBODsEnhanced Capacity Logical volume concatenation15 16. RAID Challenges: Capacity/Speed Storage economies in disks come from more GB per spindle NRE rates are flat (typically estimated at 10E-15/bit) 4% chance of NRE while recovering a 4+1 RAID-5 setand it goes up with the number of volumes in the setmany RAID controllers fail the recovery after an NRE Access speed has not kept up with density increases 27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time a second drive can fail Managing the risk of second failures requires hot-spares Defeating some of the savings from parityredundancy 17. RAID Challenges: Expansion The next generation of disks will be larger and cost less per GB. We would like to use these as we expand Most RAID replication schemes require identical disk meaning new disks cannot be added to an old set meaning failed disks must be replaced with identical units Proprietary appliances may require replacements from manufacturer (at much higher than commodity prices) Many storage systems reach a limit beyond which they cannot be further expanded (e.g. fork-lift upgrade) Re-balancing existing data over new volumes is non-trivial 18. RAID Challenges: Reliability/Availability RAID-5 can only survive a single disk failure The odds of an NRE during recovery are significant Odds of a second failure during recovery are non-negligible Annual peta-byte durability for RAID-5 is only 3 nines RAID-6 redundancy protects against two disk failures Odds of an NRE during recovery are still significant Client data access will be starved out during recovery Throttling recovery increases the risk of data loss Even RAID-6 cant protect against: Server failures NIC failures Switch failures OS crashes Facility or regional disasters 19. RAID Challenges: ExpenseCapital Expenses good RAID costs Significant mark-up for enterprise hardware High performance RAID controllers can add $50-100/disk SANs further increase Expensive equipment, much of which is often poorly used Software RAID is much less expensive, and much slowerOperating Expenses RAID doesnt manage itself RAID group, LUN and pool management Lots of application-specific tunable parameters Difficult expansion and migration When a recovery goes bad, it goes very bad Dont even think about putting off replacing a failed drive 20. Ceph Advantages 21. Ceph VALUE PROPOSITION Open source Runs on commodity hardwareSAVES MONEY Runs in heterogeneous environments Self-managing SAVES TIME OK to batch drive replacements Emerging platform integration Object, block, & filesystem storageINCREASES FLEXIBILITY Highly adaptable software solution Easier deployment of new services No vendor lock inLOWERS RISK Rule configurable failure-zones Improved reliability and availability 22. Ceph Advantage: Declustered Placement Consider a failed 2TB RAID mirror We must copy 2TB from the survivor to the successor Survivor and successor are likely in same failure zone Consider two RADOS objects clustered on the same primary Surviving copies are declustered (on different secondaries) New copies will be declustered (on different successors) Copy 10GB from each of 200 survivors to 200 successors Survivors and successors are in different failure zones Benefits Recovery is parallel and 200x faster Service can continue during the recovery process Exposure to 2nd failures is reduced by 200x Zone aware placement protects against higher level failures Recovery is automatic and does not await new drives No idle hot-spares are required 23. CLIENT ??23 24. Ceph Advantage: Object Granularity Consider a failed 2TB RAID mirror To recover it we must read and write (at least) 2TB Successor must be same size as failed volume An error in recovery will probably lose the file system Consider a failed RADOS OSD To recover it we must read and write thousands of objects Successor OSDs must have each have some free space An error in recovery will probably lose one object Benefits Heterogeneous commodity disks are easily supported Better and more uniform space utilization Per-object updates always preserve causality ordering Object updates are more easily replicated over WAN links Greatly reduced data loss if errors do occur 25. Ceph Advantage: Intelligent Storage Intelligent OSDs automatically rebalance data When new nodes are added When old nodes fail or are decommissioned When placement policies are changed The resulting rebalancing is very good: Even distribution of data across all OSDs Uniform mix of old and new data across all OSDs Moves only as much data as required Intelligent OSDs continuously scrub their objects To detect and correct silent write errors before another failure This architecture scales from petabytes to exabytes A single pool of thin provisioned, self-managing storage Serving a wide range of block, object, and file clients 26. Ceph Advantage: Price Can leverage commodity hardware for lowest costs Not locked in to single vendor; get best deal over time RAID not required, leading to lower component costsEnterprise RAID Ceph ReplicationRaw $/GB$3$0.50Protected $/GB$4 (RAID6 6+2)$1.50 (3 copies)Usable (90%)$4.44 $1.67Replicated$8.88 (Main + Bkup) $1.67 (3 copies)Relative Expense533% storage cost Baseline (100%) 27. Q&A 28. Leverage great online resourcesDocumentation on the Ceph web site: http://ceph.com/docs/master/Blogs from Inktank and the Ceph community: http://www.inktank.com/news-events/blog/ http://ceph.com/community/blog/Developer resources: http://ceph.com/resources/development/ http://ceph.com/resources/mailing-list-irc/ http://dir.gmane.org/g