Spectrum Scale 4.1 System Administration Spectrum Scale Elastic Storage Server Spectrum Scale native RAID (GNR) Hints & Tips © Copyright IBM Corporation 2015
Apr 15, 2017
© Copyright IBM Corporation 2015
Spectrum Scale 4.1 System Administration
Spectrum ScaleElastic Storage ServerSpectrum Scale native RAID (GNR)Hints & Tips
© Copyright IBM Corporation 2015
Unit objectivesAfter completing this unit, you should be able to:• Understand all the Elastic Storage Server Options• Understand their Value to Client business• Understand Spectrum Scale Native RAID• Speak to its value and limitiations• Describe the components of GNR and where it is supported• Describe Declustered RAID• Understand some key tips and hints to best practices.
© Copyright IBM Corporation 2015
Introducing the Elastic Storage Server• IBM Elastic Storage Server• The IBM® Elastic Storage Server is a high-performance, GPFS™ network storage disk
solution.
• The IBM Elastic Storage Server features multiple hardware platforms and architectures that create an enterprise-level solution consisting of the following main components: Platform and storage management console: IBM Power® System S812L (8247-21L)
1. Two Basic Storage Models (GS small form factor) & (GL large form factor)Each Model has basic architectural and management requirements1. Network Switches
IBM RackSwitch™ G7028 (7120-24L) IBM RackSwitch G8052 (7120-48E) IBM RackSwitch G8264 (7120-64C)
2. IBM 7042-CR8 Rack-mounted Hardware Management Console (HMC)3. IBM 7014 Rack Model T42 (enterprise rack)
© Copyright IBM Corporation 2015
Introducing the Elastic Storage Server• IBM Elastic Storage Server1. IBM 5146 Model GS1 IBM Elastic Storage Server2. IBM 5146 Model GS2 IBM Elastic Storage Server, 3. IBM 5146 Model GS4 IBM Elastic Storage Server 4. IBM 5146 Model GS6 IBM Elastic Storage Server
IBM Power System S822L (8247-22L) IBM 5887 EXP24S SFF Gen2-bay drawer
5. IBM 5146 Model GL2 IBM Elastic Storage Server6. IBM 5146 Model GL4 IBM Elastic Storage Server7. IBM 5146 Model GL6 IBM Elastic Storage Server
IBM Power System S822L (8247-22L) IBM System Storage DCS3700 Expansion Unit 1818-80E
© Copyright IBM Corporation 2015
• Elastic Storage Server building blocks provide– Simplified bundles of hardware that are optimized for field use– These are either performance or capacity optimized– They only support two array types
• EXP24S (2U 24 x 2.5” SSD or SAS Drive)• DCS370 Expansion (1818-80E) (4U 60 x 2.5”/3.5” NLSAS Drives)
– They only support GNR RAID management– The GS & GL models only support a finite set of drive types– They include a pair of IO servers with each building block– The first building block requires and HMC & EMS (management node)– Each unit supports CLI and GUI for solution management
• * Each storage unit has 2 x SSDs for internal GNR use (not for client access)
* It is not a SONAS replacement & It is not an all-inclusive Appliance
Elastic Storage Server (what it is & what it isn’t)
© Copyright IBM Corporation 2015
Elastic Server GS Models
© Copyright IBM Corporation 2015
Elastic Server GL Models
© Copyright IBM Corporation 2015
A closer look at the GL 6 Components
Power 8 RH LinuxP822L GPFS Storage ServerGPFS 4.1 + GNR RAID Mgr20 Cores, 128GB Memory
Fat Networking
DCS3700 Expansion Tray60 Drive (4U)
1818-80E
DCS3700 Expansion Tray60 Drive (4U)
1818-80E
SAS Connected Storage
IBM 7042-CR8 Rack-mounted Hardware Management Console (HMC)
IBM 7014 Rack Model T42 (enterprise rack)Power 8 RH Linux
P821L EMS/Xcat Server& IBM HMC-7042-CR8Management Console
Derated – Unofficial1.4PB Raw1PB Useable16MB blocksize13.6GB/S Seq Read13.4GB/S Seq Write30K x 8KB IOPS Read6K x 8KB IOPS Write
Sample Configurations & Reference Architecture
© Copyright IBM Corporation 2015
© Copyright IBM Corporation 2015
Installation of Elastic Storage Server (High Level)1. Confirm Private IP range for HMC DHCP server 2. Confirm Private Service network with (6) IPs and private xCat management network with (6) IPs,
• separate networks via switches or VLAN.3. Confirm Public network connections for HMC and EMS - (2) IPs needed.4. Confirm Host->IP mappings for the following (We can use the ESS defaults.) 5. + HMC6. + EMS7. + IO server 1, IO server 2, IO server 3, IO server 48. + 10GigE|40GigE hostname->IP mappings9. Set up domain names for xCAT private net10. Set up domain names for high speed interconnect11. Set up Partition / & partition profile names 12. Confirm Server names 13. Confirm 10GigE/40GigE/IB switches in place and cabled 14. Set up Bonding being used or not?15. Set up Public network in place and cabled to xCAT EMS and HMC (at minimum) 16. Confirm all building block components in frame (4 IO servers, EMS, HMC, HMC console?, switches)17. Set up / confirm Dual feed power to frame components 18. Set up HMC console and/or terminal 19. Prepare for install Redhat 7 ISO or DVD 20. Client should register RH license for all ESS servers.21. Define How many filesystems, block sizes, splitting of metadata?, replication ? (or should we just take defaults?)22. Confirm all disks in place.? Will check with scripts23. Confirm all cabling in place? Will be double checked by scripts 24. Confirm Wifi access in lab to setup sametime meeting room (for IBMr work)25. Confirm client intend to use Standard Spectrum Scale for this ESS install. – Then Follow the 76 Page install guide.
© Copyright IBM Corporation 2015
A look at the Building Block Networking
© Copyright IBM Corporation 2015
End Cluster Result is a = sum of the parts
© Copyright IBM Corporation 2015
What is GNR and How do I communicate the value? Spectrum Scale Native RAID is a software implementation of
storage RAID technologies within Spectrum Scale. It requires special Licensing It is only approved for pre-certified architectures (such as GSS, Elastic Storage Server, DDN GRIDScaler)
Using conventional dual-ported disks in a JBOD configuration, Spectrum Scale Native RAID implements sophisticated data placement and error correction algorithms to deliver high levels of storage reliability, availability, and performance.
Standard Spectrum Scale file systems are created from the NSDs defined through Spectrum Scale Native RAID.
No Hardware Based Controller
© Copyright IBM Corporation 2015
Petascale argument for stronger RAID codes• Disk rebuilding is a fact of life at Petascale level
– With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about four times a day
– 24-hour rebuild implies four concurrent, continuous rebuilds at all times.• Traditional, 1-fault-tolerant RAID-5 is a non-starter
– Disk hard read error rate of 1-in-1015 bits implies data loss every ~26th rebuild
– 1015 / (8 disks-per-RAID-group x 600-GB disks * 8 bits/byte )– Or data loss event every 26/4 = 6.5 days.
• 2-fault-tolerant declustered RAID (8+2P) may not be sufficient– MDDTL ~ 7 years (simulated, MTTFdisk=600Khrs, Weibull, 100-PB usable).
• 3-fault-tolerant declustered RAID (8+3P) is 400,000x better– MDDTL ~ 3x106 years (simulated, MTTFdisk=600Khrs, Weibull, 100-PB
usable)– Guards against unexpected correlated failures.
© Copyright IBM Corporation 2015
Features• Auto rebalancing• Only 2% rebuild performance hit• Reed Solomon erasure code, “8 data +3 parity”• ~105 year MTDDL for 100-PB file system• End-to-end, disk-to-Spectrum Scale-client data checksums
No hardware storage controller• Software RAID on the I/O
Servers– SAS attached JBOD – Special JBOD storage drawer
for very dense drive packing– Solid-state drives (SSDs) for
metadata storage
SAS
vDISK
Local area network (LAN)
NSD servers
SAS
vDISK
JBODs
© Copyright IBM Corporation 2015
Works within Spectrum Scale (GPFS) Network Shared Disk (NSD)
Disks
IO N
ode
Use
r Spa
ce
GPFS NSD Server
Ker
nel S
pace
GPFS Kernel IO Layer
OS Device Driver
HBA Device Driver
Com
pute
Nod
e
Use
r Spa
ceGPFS NSD Client
GPFS
Client Application
Con
trol
RPC
Data
RD
MA
Disk Array Controller
DisksIO
Nod
e
Use
r Spa
ce GPFS NSD Server
Ker
nel S
pace
GPFS Kernel IO Layer
GPFS Vdisk (PERSEUS)
OS Device Driver
HBA Device Driver
Com
pute
Nod
e
Use
r Spa
ce
GPFS NSD Client
GPFS
Client Application
Con
trol
RPC
Dat
a R
DM
A
Removehardwarecontroller
Add GPFSsoftware
controller
Traditional GNR based
© Copyright IBM Corporation 2015
RAID algorithm• Two types of RAID:
• 3 or 4 way replication• 8 + 2 or 3 way parity
• 2-fault and 3-fault tolerant codes (‘RAID-D2, RAID-D3’)
3-way Replication (1+2)8 + 2p Reed Solomon2-faulttolerantcodes
3-faulttolerantcodes
1 strip(GPFSblock)
2 or 3replicated
strips
4-way Replication (1+3)
8 strips(GPFS block)
2 or 3redundancy
strips
8 + 3p Reed Solomon
© Copyright IBM Corporation 2015
Declustered RAID• Data, parity and spare strips are uniformly and independently
distributed across disk array.
• Supports an arbitrary number of disks per array–Not restricted to an integral number of RAID track widths.
Conventional Declustered
© Copyright IBM Corporation 2015
Lower disk rebuild overhead• Improved file system performance during rebuild
– Throughput of all operational disks is used for rebuildingafter disk failure, reducing load on client.
– Why: Since Spectrum Scale stripes data across all storage controllers, without declustering, performance would be gated by slowest rebuilding controller.• In large systems, some array is likely always rebuilding
– 25,000 disks * 24 hours / (600,000-hour disk MTBF) = 1 rebuild / day• Or in smaller storage array with out-of-spec failure rates
– 1,500 disks * 2% per month MTBF * 1/30 month = 1 rebuild / day– With DeClustered GNR RAID
• Non-critical rebuild overhead remains typically < 3%.• If risk should increase with multiple failures priority increases to reduce the
time in exposure.
© Copyright IBM Corporation 2015
7 disks3 groups6 disks
sparedisk
21 virtual tracks(42 strips)
49 strips
7 tracks per group(2 strips per track)
7 sparestrips
3 1-fault-tolerantgroups
Declustered RAID example
Traditional GNR Declustered
© Copyright IBM Corporation 2015
failed disk
Rd-Wr
time
Declustered RAID rebuild
stripTimesdisksstripTimeswrrdeRebuildTim
7277
5.327edupRebuildSpe
stripTimesdisksstripTimeswrrdeRebuildTim
2666
Rd Wr
time
failed disk
© Copyright IBM Corporation 2015
High reliability• Mean time to data loss due with 50,000 disks:
– 3 fault tolerance (8+3P)• MTTDL 200 million years• Annual Failure Rate (47-disk array) 4 x 10-12
– 2 fault tolerance (8+2P)• MTTDL 200 years• Annual Failure Rate (47-disk array) 5 x 10-6
– 1 fault-tolerance• MTTDL 1 week (due to latent sector errors)
– 1015 bits / (8 disks * 600-GB disks * 8 bits/byte ) = 26 rebuilds / 4 rebuilds/day
Simulation assumptions: Disk capacity = 600-GB, MTTF = 600khrs, hard error rate = 1-in-1015 bits,47-HDD declustered arrays, uncorrelated failures
© Copyright IBM Corporation 2015
Deferred disk maintenance• With GNR, when disks fail and are restored before another failure,
multiple disks can sequentially fail without data loss.– For example, RAID-D3 with 2 disks worth of spare space can handle up 5
sequential disk failures.
• With RAID-D3, disk maintenance can be deferred with a policy that replaces a disk after the second disk failure. *Fewer maintenance calls with combined disk replacements.
– Maintenance interval of a month or longer is possible.– No more evening panic calls for immediate maintenance on common FRU
replacements.
• This Reduces probability of improper maintenance and/or unintended side effects.
© Copyright IBM Corporation 2015
Data integrity manager• Highest priority: Restore redundancy after disk failure(s)
– Rebuild data stripes in order of 3, 2, and 1 erasures– Fraction of stripes affected when 3 disks have failed (assuming 8+3p,
47 disks):• 23% of stripes have 1 erasure (= 11/47)• 5% of stripes have 2 erasures (= 11/47 * 10/46)• 1% of stripes have 3 erasures (= 11/47 * 10/46 * 9/45)
• Medium priority: Rebalance spare space after disk install– Restores uniform declustering of data, parity, and spare strips.
• Low priority: Scrub and repair media faults– Verifies checksum/consistency of data and parity/mirror.
© Copyright IBM Corporation 2015
End-to-end checksum• True end-to-end checksum from disk surface to client’s Spectrum Scale interface
– Repairs soft/latent read errors– Repairs lost/missing writes.
• Checksums are maintained on disk and in memory and are transmitted to/from client.
• Checksum is stored in a 64-byte trailer of 32-KiB buffers– 8-byte checksum and 56 bytes of ID and version info– Sequence number used to detect lost/missing writes.
8 data strips 3 parity strips
32-KiB buffer
64B trailer
¼ to 2-KiB terminus
© Copyright IBM Corporation 2015
IO Node Failover
Minimal configuration of two Spectrum Scale Native RAID servers and one storage JBOD. Spectrum Scale Native RAID server 1 is the primary controller for the first recovery group and backup for the second recovery group. Spectrum Scale Native RAID server 2 is the primary controller for the second recovery group and backup for the first recovery group. As shown, when server 1 fails, control of the first recovery group is taken over by its backup server 2. During the failure of server 1, the load on backup server 2 increases by 100% from one to two recovery groups.
© Copyright IBM Corporation 2015
Comprehensive Disk and Path Diagnostics• Asynchronous ‘disk hospital’s design allows for careful
problem determination of disk fault– While a disk is in the disk hospital, reads are parity reconstructed.– For writes, strips are marked stale and repaired later when disk
leaves.– I/Os are resumed in under 10 seconds.
• Thorough Fault Determination– Power-cycling drives to reset them– Neighbor checking– Supports multi-disk carriers.
• Disk Enclosure Management– Uses SES interface for lights, latch locks, disk power, and so on.
• Manages topology and hardware configuration.
© Copyright IBM Corporation 2015
Disk Hospital Operations• Before taking severe actions against a disk, GNR checks
neighboring disks to decide if some systemic problem may be behind the failure.
• Tests paths using SCSI Test Unit Ready commands.• Power-cycles disks to try to clear certain errors.• Reads or writes sectors where an I/O occurred in order to test
for media errors.• Works with higher levels to rewrite bad sectors.• And Polls disabled paths.
Analysis with predictive actions to support best practice healing
(almost like a real hospital)
© Copyright IBM Corporation 2015
Storage Component Hierarchy (GNR+JBOD)
• A Recovery group can have:– max 512 disks– 16 declustered arrays– At least 1 SSD log vdisk– Max 64 vdisks
• A De-clustered array:– Can contain 128 pdisks– Smallest is 4 disks– Must have one large >= 11 disks– Need 1 or more pdisks worth of
spare space• Vdisks
– Vdisks are volumes that become NSDs in Spectrum Scale control.
– Block Size: 1 MiB, 2 MiB, 4 MiB, 8 MiB and 16 MiB
pdisks
Recovery Groupleft
Recovery Groupright
DA DA DA DADeclustered
Arrays
VD VD VDVdisks = NSD VD VD VD VD VD
GNR Commands: pdisks• mmaddpdisk
– Adds a pdisk to a Spectrum Scale Native RAID recovery group.• mmdelpdisk
– Deletes Spectrum Scale Native RAID pdisks.• mmlspdisk
– Lists information for one or more Spectrum Scale Native RAID pdisks.• mmchcarrier
– Allows Spectrum Scale Native RAID Physical Disks (pdisks) to be physically removed and replaced.
© Copyright IBM Corporation 2015
GNR Commands: Recovery groups• mmlsrecoverygroup
– Lists information about Spectrum Scale Native RAID recovery groups.• mmlsrecoverygroupevents
– Displays the Spectrum Scale Native RAID recovery group event log.• mmchrecoverygroup
– Changes Spectrum Scale Native RAID recovery group and declustered array attributes.
• mmcrrecoverygroup – Creates a Spectrum Scale Native RAID recovery group and its
component declustered arrays and pdisks and specifies the servers.• mmdelrecoverygroup
– Deletes a Spectrum Scale Native RAID recovery group.
© Copyright IBM Corporation 2015
GNR Commands: vdisk• mmdelvdisk
– Deletes vdisks from a declustered array in a Spectrum Scale Native RAID recovery group.
• mmlsvdisk – Lists information for one or more Spectrum Scale Native RAID vdisks.
• mmcrvdisk – Creates a vdisk within a declustered array of a Spectrum Scale native
RAID recovery group.
© Copyright IBM Corporation 2015
© Copyright IBM Corporation 2015
Hints and TipsWith Elastic Storage Server the client must become a competent administrator of several technologies:
IBM Power8, AIX, Redhat Enterprise Linux 7, Xcat, Spectrum Scale 4.1, Spectrum Scale Native RAID* You should always suggest adding service for Knowledge Transfer and ensure that your clients have links and document references to support information required to effectively manage their Spectrum Scale or Elastic Storage server systems.
With Elastic Storage Server and GNR you probably don’t want any 256K filesystems as GNR only supports data blocksizes down to 512K. That would mean a non-Vdisk filesystem using 256K block can never have a pool of Vdisk-based storage.
Clients are seeing better large file, sequential performance as they increase filesystem block size, as expected. And as they grow, they can update maxblocksize on all client clusters and move through testing all the way up to the 16M to find the best solution for their workloads, however, with a good distribution of small file sizes they will want to keep blocksize low to prevent subblock waste. As the minimum capacity file data will consume 1/32 of the file system Blocksize. So a 5k file will take up 32K in a file system with a 1MB blocksize.
© Copyright IBM Corporation 2015
Hints and TipsWith Elastic Storage Server make sure that Power is redundantly connected to ensure that power issues do not surprise your clients well into production.
Keep it simple (Left to Right) Fully Redundant
Review• Elastic Storage Server is specifically designed to simplify Building Blocks for
Spectrum Scale file system deployments on from optimized scalable and allow for the integration of GNR
• Elastic Storage Server has 7 models (4 GS models of small form factor for SSD & SAS dirves) and 3 GL models of large form factor for NLSAS drives)
• Elastic Storage Server ships with 1 week of Lab Services for Installation and installation is generally complicated enough to require that week of services, however if is good to pen in additional lab services for knowledge transfer for client with a 1st time install.
• Spectrum Scale Native RAID (GNR) removes the need for a RAID controller and optimizes RAID management for Spectrum Scale file system performance, and reliability
• Declustered RAID and Reed Solomon algorithms allow for Non-critical rebuild overhead to typically remain < 3% of a performance impact.
• A well laid plan is cognizant of the sizing the technology to the workloads and avoiding too many baked in assumptions.
© Copyright IBM Corporation 2015
Any Questions on ESS, GNR, Hints and Tips
Questions
© Copyright IBM Corporation 2015