Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases-hints-tips

© Copyright IBM Corporation 2015

Spectrum Scale 4.1 System Administration

Spectrum ScaleElastic Storage ServerSpectrum Scale native RAID (GNR)Hints & Tips


Unit objectivesAfter completing this unit, you should be able to:• Understand all the Elastic Storage Server Options• Understand their Value to Client business• Understand Spectrum Scale Native RAID• Speak to its value and limitiations• Describe the components of GNR and where it is supported• Describe Declustered RAID• Understand some key tips and hints to best practices.


Introducing the Elastic Storage Server• IBM Elastic Storage Server• The IBM® Elastic Storage Server is a high-performance, GPFS™ network storage disk

solution.

• The IBM Elastic Storage Server features multiple hardware platforms and architectures that create an enterprise-level solution consisting of the following main components: Platform and storage management console: IBM Power® System S812L (8247-21L)

1. Two Basic Storage Models (GS small form factor) & (GL large form factor)Each Model has basic architectural and management requirements1. Network Switches

IBM RackSwitch™ G7028 (7120-24L) IBM RackSwitch G8052 (7120-48E) IBM RackSwitch G8264 (7120-64C)

2. IBM 7042-CR8 Rack-mounted Hardware Management Console (HMC)3. IBM 7014 Rack Model T42 (enterprise rack)


Introducing the Elastic Storage Server• IBM Elastic Storage Server1. IBM 5146 Model GS1 IBM Elastic Storage Server2. IBM 5146 Model GS2 IBM Elastic Storage Server, 3. IBM 5146 Model GS4 IBM Elastic Storage Server 4. IBM 5146 Model GS6 IBM Elastic Storage Server

IBM Power System S822L (8247-22L) IBM 5887 EXP24S SFF Gen2-bay drawer

5. IBM 5146 Model GL2 IBM Elastic Storage Server6. IBM 5146 Model GL4 IBM Elastic Storage Server7. IBM 5146 Model GL6 IBM Elastic Storage Server

IBM Power System S822L (8247-22L) IBM System Storage DCS3700 Expansion Unit 1818-80E


• Elastic Storage Server building blocks provide– Simplified bundles of hardware that are optimized for field use– These are either performance or capacity optimized– They only support two array types

• EXP24S (2U 24 x 2.5” SSD or SAS Drive)• DCS370 Expansion (1818-80E) (4U 60 x 2.5”/3.5” NLSAS Drives)

– They only support GNR RAID management– The GS & GL models only support a finite set of drive types– They include a pair of IO servers with each building block– The first building block requires and HMC & EMS (management node)– Each unit supports CLI and GUI for solution management

• * Each storage unit has 2 x SSDs for internal GNR use (not for client access)

* It is not a SONAS replacement & It is not an all-inclusive Appliance

Elastic Storage Server (what it is & what it isn’t)


Elastic Server GS Models


Elastic Server GL Models


A closer look at the GL 6 Components

Power 8 RH LinuxP822L GPFS Storage ServerGPFS 4.1 + GNR RAID Mgr20 Cores, 128GB Memory

Fat Networking

DCS3700 Expansion Tray60 Drive (4U)

1818-80E

DCS3700 Expansion Tray60 Drive (4U)

1818-80E

SAS Connected Storage

IBM 7042-CR8 Rack-mounted Hardware Management Console (HMC)

IBM 7014 Rack Model T42 (enterprise rack)Power 8 RH Linux

P821L EMS/Xcat Server& IBM HMC-7042-CR8Management Console

Derated – Unofficial1.4PB Raw1PB Useable16MB blocksize13.6GB/S Seq Read13.4GB/S Seq Write30K x 8KB IOPS Read6K x 8KB IOPS Write

Sample Configurations & Reference Architecture



Installation of Elastic Storage Server (High Level)1. Confirm Private IP range for HMC DHCP server 2. Confirm Private Service network with (6) IPs and private xCat management network with (6) IPs,

• separate networks via switches or VLAN.3. Confirm Public network connections for HMC and EMS - (2) IPs needed.4. Confirm Host->IP mappings for the following (We can use the ESS defaults.) 5. + HMC6. + EMS7. + IO server 1, IO server 2, IO server 3, IO server 48. + 10GigE|40GigE hostname->IP mappings9. Set up domain names for xCAT private net10. Set up domain names for high speed interconnect11. Set up Partition / & partition profile names 12. Confirm Server names 13. Confirm 10GigE/40GigE/IB switches in place and cabled 14. Set up Bonding being used or not?15. Set up Public network in place and cabled to xCAT EMS and HMC (at minimum) 16. Confirm all building block components in frame (4 IO servers, EMS, HMC, HMC console?, switches)17. Set up / confirm Dual feed power to frame components 18. Set up HMC console and/or terminal 19. Prepare for install Redhat 7 ISO or DVD 20. Client should register RH license for all ESS servers.21. Define How many filesystems, block sizes, splitting of metadata?, replication ? (or should we just take defaults?)22. Confirm all disks in place.? Will check with scripts23. Confirm all cabling in place? Will be double checked by scripts 24. Confirm Wifi access in lab to setup sametime meeting room (for IBMr work)25. Confirm client intend to use Standard Spectrum Scale for this ESS install. – Then Follow the 76 Page install guide.


A look at the Building Block Networking


End Cluster Result is a = sum of the parts


What is GNR and How do I communicate the value? Spectrum Scale Native RAID is a software implementation of

storage RAID technologies within Spectrum Scale. It requires special Licensing It is only approved for pre-certified architectures (such as GSS, Elastic Storage Server, DDN GRIDScaler)

Using conventional dual-ported disks in a JBOD configuration, Spectrum Scale Native RAID implements sophisticated data placement and error correction algorithms to deliver high levels of storage reliability, availability, and performance.

Standard Spectrum Scale file systems are created from the NSDs defined through Spectrum Scale Native RAID.

No Hardware Based Controller


Petascale argument for stronger RAID codes• Disk rebuilding is a fact of life at Petascale level

– With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about four times a day

– 24-hour rebuild implies four concurrent, continuous rebuilds at all times.• Traditional, 1-fault-tolerant RAID-5 is a non-starter

– Disk hard read error rate of 1-in-1015 bits implies data loss every ~26th rebuild

– 1015 / (8 disks-per-RAID-group x 600-GB disks * 8 bits/byte )– Or data loss event every 26/4 = 6.5 days.

• 2-fault-tolerant declustered RAID (8+2P) may not be sufficient– MDDTL ~ 7 years (simulated, MTTFdisk=600Khrs, Weibull, 100-PB usable).

• 3-fault-tolerant declustered RAID (8+3P) is 400,000x better– MDDTL ~ 3x106 years (simulated, MTTFdisk=600Khrs, Weibull, 100-PB

usable)– Guards against unexpected correlated failures.


Features• Auto rebalancing• Only 2% rebuild performance hit• Reed Solomon erasure code, “8 data +3 parity”• ~105 year MTDDL for 100-PB file system• End-to-end, disk-to-Spectrum Scale-client data checksums

No hardware storage controller• Software RAID on the I/O

Servers– SAS attached JBOD – Special JBOD storage drawer

for very dense drive packing– Solid-state drives (SSDs) for

metadata storage

SAS

vDISK

Local area network (LAN)

NSD servers

SAS

vDISK

JBODs


Works within Spectrum Scale (GPFS) Network Shared Disk (NSD)

Disks

IO N

ode

Use

r Spa

ce

GPFS NSD Server

Ker

nel S

pace

GPFS Kernel IO Layer

OS Device Driver

HBA Device Driver

Com

pute

Nod

e

Use

r Spa

ceGPFS NSD Client

GPFS

Client Application

Con

trol

RPC

Data

RD

MA

Disk Array Controller

DisksIO

Nod

e

Use

r Spa

ce GPFS NSD Server

Ker

nel S

pace

GPFS Kernel IO Layer

GPFS Vdisk (PERSEUS)

OS Device Driver

HBA Device Driver

Com

pute

Nod

e

Use

r Spa

ce

GPFS NSD Client

GPFS

Client Application

Con

trol

RPC

Dat

a R

DM

A

Removehardwarecontroller

Add GPFSsoftware

controller

Traditional GNR based


RAID algorithm• Two types of RAID:

• 3 or 4 way replication• 8 + 2 or 3 way parity

• 2-fault and 3-fault tolerant codes (‘RAID-D2, RAID-D3’)

3-way Replication (1+2)8 + 2p Reed Solomon2-faulttolerantcodes

3-faulttolerantcodes

1 strip(GPFSblock)

2 or 3replicated

strips

4-way Replication (1+3)

8 strips(GPFS block)

2 or 3redundancy

strips

8 + 3p Reed Solomon


Declustered RAID• Data, parity and spare strips are uniformly and independently

distributed across disk array.

• Supports an arbitrary number of disks per array–Not restricted to an integral number of RAID track widths.

Conventional Declustered


Lower disk rebuild overhead• Improved file system performance during rebuild

– Throughput of all operational disks is used for rebuildingafter disk failure, reducing load on client.

– Why: Since Spectrum Scale stripes data across all storage controllers, without declustering, performance would be gated by slowest rebuilding controller.• In large systems, some array is likely always rebuilding

– 25,000 disks * 24 hours / (600,000-hour disk MTBF) = 1 rebuild / day• Or in smaller storage array with out-of-spec failure rates

– 1,500 disks * 2% per month MTBF * 1/30 month = 1 rebuild / day– With DeClustered GNR RAID

• Non-critical rebuild overhead remains typically < 3%.• If risk should increase with multiple failures priority increases to reduce the

time in exposure.


7 disks3 groups6 disks

sparedisk

21 virtual tracks(42 strips)

49 strips

7 tracks per group(2 strips per track)

7 sparestrips

3 1-fault-tolerantgroups

Declustered RAID example

Traditional GNR Declustered


failed disk

Rd-Wr

time

Declustered RAID rebuild

stripTimesdisksstripTimeswrrdeRebuildTim

7277

5.327edupRebuildSpe

stripTimesdisksstripTimeswrrdeRebuildTim

2666

Rd Wr

time

failed disk


High reliability• Mean time to data loss due with 50,000 disks:

– 3 fault tolerance (8+3P)• MTTDL 200 million years• Annual Failure Rate (47-disk array) 4 x 10-12

– 2 fault tolerance (8+2P)• MTTDL 200 years• Annual Failure Rate (47-disk array) 5 x 10-6

– 1 fault-tolerance• MTTDL 1 week (due to latent sector errors)

– 1015 bits / (8 disks * 600-GB disks * 8 bits/byte ) = 26 rebuilds / 4 rebuilds/day

Simulation assumptions: Disk capacity = 600-GB, MTTF = 600khrs, hard error rate = 1-in-1015 bits,47-HDD declustered arrays, uncorrelated failures


Deferred disk maintenance• With GNR, when disks fail and are restored before another failure,

multiple disks can sequentially fail without data loss.– For example, RAID-D3 with 2 disks worth of spare space can handle up 5

sequential disk failures.

• With RAID-D3, disk maintenance can be deferred with a policy that replaces a disk after the second disk failure. *Fewer maintenance calls with combined disk replacements.

– Maintenance interval of a month or longer is possible.– No more evening panic calls for immediate maintenance on common FRU

replacements.

• This Reduces probability of improper maintenance and/or unintended side effects.


Data integrity manager• Highest priority: Restore redundancy after disk failure(s)

– Rebuild data stripes in order of 3, 2, and 1 erasures– Fraction of stripes affected when 3 disks have failed (assuming 8+3p,

47 disks):• 23% of stripes have 1 erasure (= 11/47)• 5% of stripes have 2 erasures (= 11/47 * 10/46)• 1% of stripes have 3 erasures (= 11/47 * 10/46 * 9/45)

• Medium priority: Rebalance spare space after disk install– Restores uniform declustering of data, parity, and spare strips.

• Low priority: Scrub and repair media faults– Verifies checksum/consistency of data and parity/mirror.


End-to-end checksum• True end-to-end checksum from disk surface to client’s Spectrum Scale interface

– Repairs soft/latent read errors– Repairs lost/missing writes.

• Checksums are maintained on disk and in memory and are transmitted to/from client.

• Checksum is stored in a 64-byte trailer of 32-KiB buffers– 8-byte checksum and 56 bytes of ID and version info– Sequence number used to detect lost/missing writes.

8 data strips 3 parity strips

32-KiB buffer

64B trailer

¼ to 2-KiB terminus


IO Node Failover

Minimal configuration of two Spectrum Scale Native RAID servers and one storage JBOD. Spectrum Scale Native RAID server 1 is the primary controller for the first recovery group and backup for the second recovery group. Spectrum Scale Native RAID server 2 is the primary controller for the second recovery group and backup for the first recovery group. As shown, when server 1 fails, control of the first recovery group is taken over by its backup server 2. During the failure of server 1, the load on backup server 2 increases by 100% from one to two recovery groups.


Comprehensive Disk and Path Diagnostics• Asynchronous ‘disk hospital’s design allows for careful

problem determination of disk fault– While a disk is in the disk hospital, reads are parity reconstructed.– For writes, strips are marked stale and repaired later when disk

leaves.– I/Os are resumed in under 10 seconds.

• Thorough Fault Determination– Power-cycling drives to reset them– Neighbor checking– Supports multi-disk carriers.

• Disk Enclosure Management– Uses SES interface for lights, latch locks, disk power, and so on.

• Manages topology and hardware configuration.


Disk Hospital Operations• Before taking severe actions against a disk, GNR checks

neighboring disks to decide if some systemic problem may be behind the failure.

• Tests paths using SCSI Test Unit Ready commands.• Power-cycles disks to try to clear certain errors.• Reads or writes sectors where an I/O occurred in order to test

for media errors.• Works with higher levels to rewrite bad sectors.• And Polls disabled paths.

Analysis with predictive actions to support best practice healing

(almost like a real hospital)


Storage Component Hierarchy (GNR+JBOD)

• A Recovery group can have:– max 512 disks– 16 declustered arrays– At least 1 SSD log vdisk– Max 64 vdisks

• A De-clustered array:– Can contain 128 pdisks– Smallest is 4 disks– Must have one large >= 11 disks– Need 1 or more pdisks worth of

spare space• Vdisks

– Vdisks are volumes that become NSDs in Spectrum Scale control.

– Block Size: 1 MiB, 2 MiB, 4 MiB, 8 MiB and 16 MiB

pdisks

Recovery Groupleft

Recovery Groupright

DA DA DA DADeclustered

Arrays

VD VD VDVdisks = NSD VD VD VD VD VD

GNR Commands: pdisks• mmaddpdisk

– Adds a pdisk to a Spectrum Scale Native RAID recovery group.• mmdelpdisk

– Deletes Spectrum Scale Native RAID pdisks.• mmlspdisk

– Lists information for one or more Spectrum Scale Native RAID pdisks.• mmchcarrier

– Allows Spectrum Scale Native RAID Physical Disks (pdisks) to be physically removed and replaced.


GNR Commands: Recovery groups• mmlsrecoverygroup

– Lists information about Spectrum Scale Native RAID recovery groups.• mmlsrecoverygroupevents

– Displays the Spectrum Scale Native RAID recovery group event log.• mmchrecoverygroup

– Changes Spectrum Scale Native RAID recovery group and declustered array attributes.

• mmcrrecoverygroup – Creates a Spectrum Scale Native RAID recovery group and its

component declustered arrays and pdisks and specifies the servers.• mmdelrecoverygroup

– Deletes a Spectrum Scale Native RAID recovery group.


GNR Commands: vdisk• mmdelvdisk

– Deletes vdisks from a declustered array in a Spectrum Scale Native RAID recovery group.

• mmlsvdisk – Lists information for one or more Spectrum Scale Native RAID vdisks.

• mmcrvdisk – Creates a vdisk within a declustered array of a Spectrum Scale native

RAID recovery group.



Hints and TipsWith Elastic Storage Server the client must become a competent administrator of several technologies:

IBM Power8, AIX, Redhat Enterprise Linux 7, Xcat, Spectrum Scale 4.1, Spectrum Scale Native RAID* You should always suggest adding service for Knowledge Transfer and ensure that your clients have links and document references to support information required to effectively manage their Spectrum Scale or Elastic Storage server systems.

With Elastic Storage Server and GNR you probably don’t want any 256K filesystems as GNR only supports data blocksizes down to 512K. That would mean a non-Vdisk filesystem using 256K block can never have a pool of Vdisk-based storage.

Clients are seeing better large file, sequential performance as they increase filesystem block size, as expected. And as they grow, they can update maxblocksize on all client clusters and move through testing all the way up to the 16M to find the best solution for their workloads, however, with a good distribution of small file sizes they will want to keep blocksize low to prevent subblock waste. As the minimum capacity file data will consume 1/32 of the file system Blocksize. So a 5k file will take up 32K in a file system with a 1MB blocksize.


Hints and TipsWith Elastic Storage Server make sure that Power is redundantly connected to ensure that power issues do not surprise your clients well into production.

Keep it simple (Left to Right) Fully Redundant

Review• Elastic Storage Server is specifically designed to simplify Building Blocks for

Spectrum Scale file system deployments on from optimized scalable and allow for the integration of GNR

• Elastic Storage Server has 7 models (4 GS models of small form factor for SSD & SAS dirves) and 3 GL models of large form factor for NLSAS drives)

• Elastic Storage Server ships with 1 week of Lab Services for Installation and installation is generally complicated enough to require that week of services, however if is good to pen in additional lab services for knowledge transfer for client with a 1st time install.

• Spectrum Scale Native RAID (GNR) removes the need for a RAID controller and optimizes RAID management for Spectrum Scale file system performance, and reliability

• Declustered RAID and Reed Solomon algorithms allow for Non-critical rebuild overhead to typically remain < 3% of a performance impact.

• A well laid plan is cognizant of the sizing the technology to the workloads and avoiding too many baked in assumptions.


Any Questions on ESS, GNR, Hints and Tips

Questions


Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases-hints-tips

Technology