Top Banner
Ceph: one decade in Sage Weil
53

Ceph Day New York: Ceph: one decade in

May 25, 2015

Download

Software

Ceph Community

Keynote by Sage Weil, Red Hat
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ceph Day New York: Ceph: one decade in

Ceph: one decade inSage Weil

Page 2: Ceph Day New York: Ceph: one decade in
Page 3: Ceph Day New York: Ceph: one decade in
Page 4: Ceph Day New York: Ceph: one decade in

RESEARCH

Page 5: Ceph Day New York: Ceph: one decade in

INCUBATIONRESEARCH

Page 6: Ceph Day New York: Ceph: one decade in

INKTANKINCUBATIONRESEARCH

Page 7: Ceph Day New York: Ceph: one decade in

Research beginnings

7

Page 8: Ceph Day New York: Ceph: one decade in

RESEARCH

Page 9: Ceph Day New York: Ceph: one decade in

UCSC research grant

● “Petascale object storage”● DOE: LANL, LLNL, Sandia

● Scalability, reliability, performance

● HPC file system workloads

● Scalable metadata management

● First line of Ceph code● Summer internship at LLNL

● High security national lab environment

● Could write anything, as long as it was OSS

Page 10: Ceph Day New York: Ceph: one decade in

The rest of Ceph

● RADOS – distributed object storage cluster (2005)

● EBOFS – local object storage (2004/2006)

● CRUSH – hashing for the real world (2005)

● Paxos monitors – cluster consensus (2006)

→ emphasis on consistent, reliable storage

→ scale by pushing intelligence to the edges

→ a different but compelling architecture

Page 11: Ceph Day New York: Ceph: one decade in
Page 12: Ceph Day New York: Ceph: one decade in

Industry black hole

● Many large storage vendors● Proprietary solutions that don't scale well

● Few open source alternatives (2006)● Very limited scale, or

● Limited community and architecture (Lustre)

● No enterprise feature sets (snapshots, quotas)

● PhD grads all built interesting systems...● ...and then went to work for Netapp, DDN, EMC, Veritas.

● They want you, not your project

Page 13: Ceph Day New York: Ceph: one decade in

A different path?

● Change the storage world with open source● Do what Linux did to Solaris, Irix, Ultrix, etc.

● License● LGPL: share changes, okay to link to proprietary code

● Avoid unfriendly practices● Dual licensing

● Copyright assignment

● Platform● Remember sourceforge.net?

Page 14: Ceph Day New York: Ceph: one decade in

Incubation

14

Page 15: Ceph Day New York: Ceph: one decade in

INCUBATIONRESEARCH

Page 16: Ceph Day New York: Ceph: one decade in

DreamHost!

● Move back to LA, continue hacking

● Hired a few developers

● Pure development

● No deliverables

Page 17: Ceph Day New York: Ceph: one decade in

Ambitious feature set

● Native Linux kernel client (2007-)

● Per-directory snapshots (2008)

● Recursive accounting (2008)

● Object classes (2009)

● librados (2009)

● radosgw (2009)

● strong authentication (2009)

● RBD: rados block device (2010)

Page 18: Ceph Day New York: Ceph: one decade in

The kernel client

● ceph-fuse was limited, not very fast

● Build native Linux kernel implementation

● Began attending Linux file system developer events (LSF)

● Early words of encouragement from ex-Lustre dev

● Engage Linux fs developer community as peer

● Initial attempts merge rejected by Linus● Not sufficient evidence of user demand

● A few fans and would-be users chimed in...

● Eventually merged for v2.6.34 (early 2010)

Page 19: Ceph Day New York: Ceph: one decade in

Part of a larger ecosystem

● Ceph need not solve all problems as monolithic stack

● Replaced ebofs object file system with btrfs● Same design goals; avoid reinventing the wheel

● Robust, supported, well-optimized

● Kernel-level cache management

● Copy-on-write, checksumming, other goodness

● Contributed some early functionality● Cloning files

● Async snapshots

Page 20: Ceph Day New York: Ceph: one decade in

Budding community

● #ceph on irc.oftc.net, [email protected]

● Many interested users

● A few developers

● Many fans

● Too unstable for any real deployments

● Still mostly focused on right architecture and technical solutions

Page 21: Ceph Day New York: Ceph: one decade in

Road to product

● DreamHost decides to build an S3-compatible object storage service with Ceph

● Stability● Focus on core RADOS, RBD, radosgw

● Paying back some technical debt● Build testing automation

● Code review!

● Expand engineering team

Page 22: Ceph Day New York: Ceph: one decade in

The reality

● Growing incoming commercial interest● Early attempts from organizations large and small

● Difficult to engage with a web hosting company

● No means to support commercial deployments

● Project needed a company to back it● Fund the engineering effort

● Build and test a product

● Support users

● Orchestrated a spin out of DreamHost in 2012

Page 23: Ceph Day New York: Ceph: one decade in

Inktank

23

Page 24: Ceph Day New York: Ceph: one decade in

INKTANKINCUBATIONRESEARCH

Page 25: Ceph Day New York: Ceph: one decade in

Do it right

● How do we build a strong open source company?

● How do we build a strong open source community?

● Models?● Red Hat, SUSE, Cloudera, MySQL, Canonical, …

● Initial funding from DreamHost, Mark Shuttleworth

Page 26: Ceph Day New York: Ceph: one decade in

Goals

● A stable Ceph release for production deployment● DreamObjects

● Lay foundation for widespread adoption● Platform support (Ubuntu, Red Hat, SUSE)

● Documentation

● Build and test infrastructure

● Build a sales and support organization

● Expand engineering organization

Page 27: Ceph Day New York: Ceph: one decade in

Branding

● Early decision to engage professional agency

● Terms like● “Brand core”

● “Design system”

● Company vs Project● Inktank != Ceph

● Establish a healthy relationship with the community

● Aspirational messaging: The Future of Storage

Page 28: Ceph Day New York: Ceph: one decade in

Slick graphics● broken powerpoint template

28

Page 29: Ceph Day New York: Ceph: one decade in

Traction

● Too many production deployments to count● We don't know about most of them!

● Too many customers (for me) to count

● Growing partner list

● Lots of buzz

● OpenStack

Page 30: Ceph Day New York: Ceph: one decade in

Quality

● Increased adoption means increased demands on robust testing

● Across multiple platforms● Include platforms we don't use

● Upgrades● Rolling upgrades

● Inter-version compatibility

Page 31: Ceph Day New York: Ceph: one decade in

Developer community

● Significant external contributors

● First-class feature contributions from contributors

● Non-Inktank participants in daily stand-ups

● External access to build/test lab infrastructure

● Common toolset● Github

● Email (kernel.org)

● IRC (oftc.net)

● Linux distros

Page 32: Ceph Day New York: Ceph: one decade in

CDS: Ceph Developer Summit

● Community process for building project roadmap

● 100% online● Google hangouts

● Wikis

● Etherpad

● First was in Spring 2013, sixth is coming up

● (Continuing to) indoctrinate our own developers to an open development model

Page 33: Ceph Day New York: Ceph: one decade in

And then...

s/Red Hat of Storage/Storage of Red Hat/

Page 34: Ceph Day New York: Ceph: one decade in

Calamari

● Inktank strategy was to package Ceph for the Enterprise

● Inktank Ceph Enterprise (ICE)● Ceph: a hardened, tested, validated version

● Calamari: management layer and GUI (proprietary!)

● Enterprise integrations: SNMP, HyperV, VMWare

● Support SLAs

● Red Hat model is pure open source● Open sourced Calamari

Page 35: Ceph Day New York: Ceph: one decade in

The Present

35

Page 36: Ceph Day New York: Ceph: one decade in

Tiering

● Client side caches are great, but only buy so much.

● Can we separate hot and cold data onto different storage devices?

● Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool

● Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding, NYI)

● Very Cold Pools (efficient erasure coding, compression, osd spin down to save power) OR tape/public cloud

● How do you identify what is hot and cold?

● Common in enterprise solutions; not found in open source scale-out systems

→ cache pools new in Firefly, better in Giant

Page 37: Ceph Day New York: Ceph: one decade in

Erasure coding

● Replication for redundancy is flexible and fast

● For larger clusters, it can be expensive

● We can trade recovery performance for storage

● Erasure coded data is hard to modify, but ideal for cold or read-only objects

● Cold storage tiering

● Will be used directly by radosgw

Storage overhead

Repair traffic

MTTDL (days)

3x replication 3x 1x 2.3 E10

RS (10, 4) 1.4x 10x 3.3 E13

LRC (10, 6, 5) 1.6x 5x 1.2 E15

Page 38: Ceph Day New York: Ceph: one decade in

Erasure coding (cont'd)

● In firefly

● LRC in Giant

● Intel ISA-L (optimized library) in Giant, maybe backported to Firefly

● Talk of ARM optimized (NEON) jerasure

Page 39: Ceph Day New York: Ceph: one decade in

Async Replication in RADOS

● Clinic project with Harvey Mudd College

● Group of students working on real world project

● Reason about bounds on clock drift so we can achieve point-in-time consistency across a distributed set of nodes

Page 40: Ceph Day New York: Ceph: one decade in

CephFS

● Dogfooding for internal QA infrastructure

● Learning lots

● Many rough edges, but working quite well!

● We want to hear from you!

Page 41: Ceph Day New York: Ceph: one decade in

The Future

41

Page 42: Ceph Day New York: Ceph: one decade in

CephFS

→ This is where it all started – let's get there

● Today● QA coverage and bug squashing continues

● NFS and CIFS now large complete and robust

● Multi-MDS stability continues to improve

● Need● QA investment

● Snapshot work

● Amazing community effort

Page 43: Ceph Day New York: Ceph: one decade in

The larger ecosystem

Page 44: Ceph Day New York: Ceph: one decade in

Storage backends

● Backends are pluggable

● Recent work to use rocksdb everywhere leveldb can be used (mon/osd); can easily plug in other key/value store libraries

● Other possibilities include LMDB or NVNKV (from fusionIO)

● Prototype kinetic backend

● Alternative OSD backends● KeyValueStore – put all data in a k/v db (Haomai @

UnitedStack)

● KeyFileStore initial plans (2nd gen?)

● Some partners looking at backends tuned to their hardware

Page 45: Ceph Day New York: Ceph: one decade in

Governance

How do we strengthen the project community?

● Recognize project leads● RBD, RGW, RADOS, CephFS, Calamari, etc.

● Formalize processes around CDS, community roadmap

● Formal foundation?● Community build and test lab infrastructure

● Build and test for broad range of OSs, distros, hardware

Page 46: Ceph Day New York: Ceph: one decade in

Technical roadmap

● How do we reach new use-cases and users?

● How do we better satisfy existing users?

● How do we ensure Ceph can succeed in enough markets for business investment to thrive?

● Enough breadth to expand and grow the community

● Enough focus to do well

Page 47: Ceph Day New York: Ceph: one decade in

Performance

● Lots of work with partners to improve performance

● High-end flash back ends. Optimize hot paths to limit CPU usage, drive up IOPS

● Improve threading, fine-grained locks

● Low-power processors. Run well on small ARM devices (including those new-fangled ethernet drives)

Page 48: Ceph Day New York: Ceph: one decade in

Ethernet Drives

● Multiple vendors are building 'ethernet drives'

● Normal hard drives w/ small ARM host on board

● Could run OSD natively on the drive, completely remove the “host” from the deployment

● Many different implementations, some vendors need help w/ open architecture and ecosystem concepts

● Current devices are hard disks; no reason they couldn't also be flash-based, or hybrid

● This is exactly what we were thinking when Ceph was originally designed!

Page 49: Ceph Day New York: Ceph: one decade in

Big data?

● Hadoop map/reduce built on a weak “file system” model

● s/HDFS/CephFS/

● It is easier to move computation than data

● RADOS “classes” allow computation to be moved to storage nodes

● New “methods” (beyond read/write) for your “objects”

● Simple sandbox, very extensible

● Still looking for killer use-case

Page 50: Ceph Day New York: Ceph: one decade in

Archival storage

● Erasure coding makes storage cost compelling● Especially when combined with ethernet drives

● Already do background scrubbing, repair

● Data integrity● Yes: over the wire

● Yes: at rest (...for erasure coded content)

● Yes: background scrubbing

● Still need end-to-end verification

Page 51: Ceph Day New York: Ceph: one decade in

The enterprise

How do we pay for all our toys?

● Support legacy and transitional interfaces● iSCSI, NFS, pNFS, CIFS

● Vmware, Hyper-v

● Identify the beachhead use-cases● Only takes one use-case to get in the door

● Single platform – shared storage resource

● Bottom-up: earn respect of engineers and admins

● Top-down: strong brand and compelling product

Page 52: Ceph Day New York: Ceph: one decade in

Why we can beat the old guard

● Strong foundational architecture

● It is hard to compete with free and open source software

● Unbeatable value proposition

● Ultimately a more efficient development model

● It is hard to manufacture community

● Native protocols, Linux kernel support● Unencumbered by legacy protocols like NFS

● Move beyond traditional client/server model

● Ongoing paradigm shift● Software defined infrastructure, data center

Page 53: Ceph Day New York: Ceph: one decade in

Thanks!