Ceph Software-Defined Storage
“The power to do more!”
Andrew Underwood – HPC Enterprise Technologist
ANZ Solutions Engineering Team
Agenda
Melbourne, Australia – 2015 Open Storage Workshop
- The Changing Storage Market
- Introduction to Ceph
- Ceph Architecture
- Key Benefits of Ceph
- Case Study – Some of Dell’s Australian Customers
- Next Steps to Building Your Scalable Storage Solution….
Storage Market Is Changing
• Storage needs are exploding
• There are limited highly scalable storage options available
• Continued cost pressures and budget squeezes are limiting IT expenditure
• The cost of proprietary technologies is increasing with annual license fees
• Current storage technologies are limiting when development teams want to
scale-out their workloads
• New data sets are changing the way IT departments need to think about storage
HPCEnabling exascale computing on massive data sets
OpenStack Helping enterprises build open interoperable clouds
Big DataTurning customer data into value
The forces that drive Dell, also drive our customers
Introduction to Ceph
File SystemBlock StorageObject Storage
User-Management
OpenStack Keystone
Disaster Recovery
Multi-tenant
OpenStack Swift API
Copy-on-Write Cloning
In-Memory Caching
Native Linux Kernel
Support
Snapshots
Thinly Provisioned
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
Billing CapableSupport for KVM & Xen
Up to 16 Exabyte
Snapshots
Dynamic Rebalancing
Introduction to Ceph (continued…)
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
RGWA web services gateway
for object storage, compatible with S3 and
Swift
RBDA reliable, fully-distributed block device with cloud
platform integration
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
ApplicationVirtual
MachineClient
Ceph Architecture 101
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
RGWA web services gateway
for object storage, compatible with S3 and
Swift
RBDA reliable, fully-distributed block device with cloud
platform integration
ApplicationVirtual
MachineClient
Object Storage Daemons
Ceph OSDs store data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat.
Monitoring Nodes
Ceph Monitors manage the health of the cluster and
implement the cluster map
Metadata Server
The MDS stores metadata on behalf of CephFS – this
is not a dedicated server and is distributed.
Ceph Architecture 101
The Ceph Storage Cluster - RADOS
CephFS, Ceph Object Storage and Ceph Block Devices read data from and write data to the Ceph Storage Cluster.
Based on RADOS, the Ceph Storage Cluster consists of two daemons: a Ceph OSD Daemon (OSD) stores data as objects on a storage node; and a Ceph Monitor (MON) maintains a master copy of the cluster map.
A Ceph Storage Cluster provides a single logical Object Store to clients, and is responsible for data migration, replication, failure detection, and failure recovery
Object Storage Daemons Monitoring Nodes
CEPHFSRGW RBD
LIBRADOS
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
Ceph Architecture 101
Metadata Server
Object Storage Daemons
Ceph OSDs store data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat.
Monitoring Nodes
Ceph Monitors manage the health of the cluster and
implement the cluster map
Metadata Server
The MDS stores metadata on behalf of CephFS – this
is not a dedicated server and is distributed.
The MDS cluster is diskless and MDSs just serve as an index to the OSD cluster for facilitating read and write. All metadata, as well as data, are stored at the OSD cluster.
Ceph Architecture 101
Monitoring Nodes
Object Storage Daemons
Ceph OSDs store data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat.
Monitoring Nodes
Ceph Monitors manage the health of the cluster and
implement the cluster map
Metadata Server
The MDS stores metadata on behalf of CephFS – this
is not a dedicated server and is distributed.
Ceph’s monitors maintain a master copy of the cluster map. Ceph processes and clients contact the monitor periodically to ensure they have the most recent copy of the cluster map. Ceph seeks agreement among various monitor instances regarding thestate of the cluster. To ensure a consensus Ceph always uses an odd number of monitors (3, 5, 7…) and the Paxos algorithm., high re
The monitoring nodes are critical, but commodity hardware is key – lower TCO, commodity x86, high reliability rack mount servers
Ceph Architecture 101
Ceph Object Storage Daemon
Object Storage Daemons
Ceph OSDs store data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat.
Monitoring Nodes
Ceph Monitors manage the health of the cluster and
implement the cluster map
Metadata Server
The MDS stores metadata on behalf of CephFS – this
is not a dedicated server and is distributed.
The Ceph Object Storage Daemon is the main component of the cluster that serve up the storage objects on the physical disks. x86 servers like the Dell R730XD or the R630 connected to an MD1200 JBOD can be used to house an OSD (physical disks for storage) and are powered by a balanced CPU, memory and network bandwidth ratio to connect them as a distributed cluster running the OSD daemons.
As best practice Dell recommends – 1Ghz of x86 CPU-cycle per OSD (i.e. 1Ghz per HDD/SSD) and 2GB RAM
This would mean a Dell PowerEdge R730XD with 12 x 4TB HDD (48TB) would require 24GB RAM and 1 x 8Core E5-2630L V3 1.8Ghz CPU (1 x 1.8Ghz = 14.4Ghz)
Object Storage Device (OSD)(not to be confused with a Ceph OSD which is a daemon!
File Systembtrfsxfsext4
Ceph Architecture 101
RADOS Cluster Map
Object Storage Daemons
Ceph OSDs store data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph
OSD Daemons for a heartbeat.
Monitoring Nodes
Ceph Monitors manage the health of the cluster and
implement the cluster map
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
btrfsxfsext4
RADOS Node
RADOS Cluster
Node Node Node
Ceph Architecture 101
RADOS Cluster Map
x 64TB OSD
OSD
FS
DISK
= 24TB per Node
x 4224TB per Node = 1PB Cluster
What about replication?
3 Replica’s per object across 1PB
= 344TB usable
Ceph Architecture 101
Erasure Coding Part 1Application
RBD
RADOS
Compute Node 1
The Compute Node / Application initiates a read request and RADOS
sends the request to the primary OSD
The primary OSD reads data from the disk and completes the read request
Storage Node 1Storage Node 2
Storage Node 3
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
Ceph Architecture 101
Erasure Coding Part 1Application
RBD
RADOS
Compute Node 1
Storage Node 1Storage Node 2
Storage Node 3
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
OSD OSD OSD OSD OSD
FS FS FS FSFS
DISK DISK DISK DISK DISK
The Compute Node / Application writes data and RADOS sends the
request to the primary OSD
The primary OSD id’s the replica OSDs and sends data, then writes
data to disk
The replica OSD writes data to disk and informs primary OSD when
complete
The primary OSD informs Compute Node / Application that the process
is complete
Key Benefits
Enterprise Ready
Open Source
Massively Scalable
Extensible
Low TCO
No Single Point of Failure
Self-Managing
Rapid Provisioning
• Enterprise Ready – Dell and our numerous software partners certify and support this solution
• Open Source – Development timeframes are accelerated and there is no proprietary lock-in
• Enables companies to scale out storage demands cost effectively vs other OpenStack & VM solutions
• Provisions storage resources efficiently and is easy to manage at scale
• Commodity storage servers reduces cost per gigabyte compared to proprietary
• Provides scalable and resilient storage solution on commodity hardware
• Intelligent software algorithms to manage placement and replication
• API’s for C, C++, Java, Python, Ruby, PHP – all directly to RADOS
• Unified storage platform (Object + Block + File)
Case study
High-impact research lifts off into a new type of cloud
Read the full Press Release >
Institute builds research infrastructure for the future using community-driven, open-source components
Enables researchers to respond rapidly to new developments by providing instant access to scalable computing resources and applications
Researchers can share computational results easily with collaboration partners around the world
NCI builds country’s first HPC research cloud on OpenStack at Australian National University
The new science cloud will provide Australian researchers, for the first time, on-demand access to high-performance compute and storage resources which provides increased technological capabilities compared to commercial and academic cloud offerings., according to Dr. Joseph Antony, NCI Cloud Services Manager.
Next Steps to Building Your Scalable Storage Solution….
Contact your Dell Solution Consultant to review your enterprise environment
Set up a Pilot Ceph Deployment Scale Out, Not Up
Today Tomorrow The Future
Thank you
Let’s grab a coffee and chat more!
Andrew Underwood – HPC Enterprise Technologist
• Dell.com/OpenStack
• Dell.com/HPC