Using OpenStack Swift for extreme data durability Florent Flament, Cloudwatt Christian Schwede, eNovance OpenStack Summit Paris, November 2014
Jul 13, 2015
Using OpenStack Swift for extreme data durability
Florent Flament, CloudwattChristian Schwede, eNovance
OpenStack Summit Paris, November 2014
Intro - Cloudwatt● Florent Flament
● Dev & Fireman @ Cloudwatt
● Fixing & tuning of OpenStack (Cinder, Keystone, Nova, Swift)
● Email: [email protected]
● IRC: florentflament on #openstack-dev (Freenode)
● Twitter: @florentflament_
● Blogs: http://dev.cloudwatt.com / http://www.florentflament.com
Intro - eNovance● Christian Schwede
● Developer @ eNovance / Red Hat
● Mostly working on Swift, testing, automation and developer tools
● Swift Core
● IRC: cschwede in #openstack-swift
● [email protected] / [email protected]
● Twitter: @cschwede_de
Proxy Node
Proxy Node
Network
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Proxy Node
Proxy Node
Network
Zone 0 Zone 1 Zone 2
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Proxy Node
Proxy Node
Network
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
Ring : the Map of data● One file per type of data. Ring files map each copy of a
data to a physical device through partitions.
● An object’s partition number is computed from the hash
of the object’s name.
● A Ring file is: a (replica, partition) to device ID table, a
devices table and a number of hash bits.
● Visualize a Ring: https://github.com/victorlin/swiftsense
Concrete example of Ring
0 1 2 3 0 1 2 3
1 2 3 0 1 2 3 0
2 3 0 1 2 3 0 1
Partition number
0
1
2
Rep
lica
num
ber
0 1 2 3 4 5 6 7
Replica & Partition to Device ID table Devices table
ID Host Port Device
0 192.168.0.10 6000 sdb1
1 192.168.0.10 6000 sdc1
2 192.168.0.11 6000 sdb1
3 192.168.0.11 6000 sdc1
Bit count (partition power) = 3→ 23 = 8 partitions
Storage policies● Included in the Juno release (Swift > 2.0.0)
● Applied on a per-container basis
● Flexibility to use multiple rings, for example:
○ Basic: 2 replicas on spinning disks, single datacenter
○ Strong: 3 replicas in three different datacenters around the globe
○ Fast: 3 replicas on SSDs and much more powerful proxies
Object durability● Disk failures: pd ~ 2-5% per year
● Unrecoverable bit read errors: pb = 10-15 ⋅ 8 ⋅ objectsize
3 replicas 2 replicas 1 replica Data loss
Failure Failure Failure
Replication ReplicationReplication
● Durability in the range of 10-11 nines with 3 replicas (99.99999999%)
● http://enovance.github.io/swift-durability-calculator/
Object availability & durability
Zone 0 Zone 1
Region 0 (⅔ of the data)
Zone 2
Region 1 (⅓ of the data)
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Zone 0
Disk
Disk
Disk
Maintainability by Simplicity● Standalone `swift-ring-builder` tool to manipulate the Ring
○ Uses `builders` files to keep architectural information on the cluster
○ Smartly assigns partitions to devices
○ Generates Ring files easily checked
● Processes on Swift nodes focus on ensuring that files are stored
uncorrupted at the appropriate location
Splitting a running Swift Cluster● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Natively available in Swift
Splitting a running Swift Cluster● Ensuring no data is lost
○ Move only 1 replica at a time
○ Small steps to limit the impact
○ Check for data corruption
○ Check data location
○ Rollback in case of failure
● Limiting the impact on performance
○ Availability of cluster resources
○ Load incurred by cluster being split
○ Small steps to limit the impact
○ Control nodes accessed by users
Small stepsNew in Swift 2.2 !!
Example of process:
1. Add devices to new region with a very low weight
2. Increase devices’ weights to store 5% of data in the new region
3. Progressively increase by steps of 5% the amount of data in the new region
More details: http://www.florentflament.com/blog/splitting-swift-cluster.html
Add a new region smoothly by limiting the amount of data moved
● really possible since Swift 2.2
● Final weight in new region should be at least ⅓ of the total cluster weight
Adding a new region
Erasure coding● Coming real soon now
● Instead of N copies of each object:
○ apply EC to object, split into multiple fragments, for example 14
○ store them on different disks/nodes
○ objects can be rebuild from 10 fragments
■ Tolerates loss of 4 fragments
● higher durability
■ Only ~ 40% overhead (compared to 200%)
● much cheaper
Durability calculation● More detailed calculation
○ Number of disks, servers, partitions
● Add erasure coding
● Include in Swift documentation?
● Community effort
○ Discussion started last Swift hackathon
■ NTT, Swiftstack, IBM, Seagate, Red Hat / eNovance
○ Ad-Hoc session on Thursday/Friday - join us!
Summary● High availability, even if large parts of the cluster are not accessible
● Automatic failure correction ensures high durability, and depending on
your cluster configuration excels known industry standards
● Swift 2.2 (Juno release)
○ Even smoother and predictable cluster upgrades
○ Storage Policies allow fine grained data placement control
● Erasure Coding increase durability even more while lowering costs