1 CDA3101 Fall 2013 Computer Storage: Practical Aspects 6,13 November 2013 Copyright © 2011 Prabhat Mishra.

1

CDA3101 Fall 2013CDA3101 Fall 2013

Computer Storage:

Practical Aspects

6,13 November 2013

Copyright © 2011 Prabhat Mishra

2

Storage SystemsStorage Systems Introduction

Disk Storage

Dependability and Reliability

• I/O Performance

Server Computers

Conclusion

CDA 3101 – Fall 2013 Copyright © 2011 Prabhat Mishra

Case for StorageShift in focus from computation to communication and

storage of information “The Computing Revolution” (1960s to 1980s)

– IBM, Control Data Corp., Cray Research

“The Information Age” (1990 to today)– Google, Yahoo, Amazon, …

Storage emphasizes reliability and scalability as well as cost-performance Program crash – frustrating Data loss is unacceptable dependability is key concern

Which software determines HW features?Operating System for storageCompiler for processor

Cost vs Access time in DRAM/Disk

DRAM is 100,000 times faster, and costs 30-150 times more per gigabyte.

Chapter 6 — Storage and Other I/O Topics — 5

Flash StorageNonvolatile semiconductor storage

100× – 1000× faster than diskSmaller, lower power, more robustBut more $/GB (between disk and DRAM)

§6.4 Flas

h S

tora

ge

6

Hard Disk Drive

http://en.wikipedia.org/wiki/Image:Hard_disk_platter_reflection.jpg

http://en.wikipedia.org/wiki/Image:Hard_disk_dismantled.jpg

Seek Time is not Linear in Distance

• RULE OF THUMB: average seek is the time to access 1/3rd of the number of cylinders -- it is not linear, accelerate arm, pause, decelerate, wait for settle time. -- The average does not work well due to locality property.

Requires 3 revolutions to perform 4 reads(26, 100, 724, 9987)

Requires just 3/4th of a revolution

DependabilityFault: failure of a component

May or may not lead to system failure

Service accomplishmentService delivered

as specified

Service interruptionDeviation from

specified service

FailureRestoration

Dependability MeasuresReliability: mean time to failure (MTTF)

Service interruption: mean time to repair (MTTR)

Mean time between failures (MTBF)MTBF = MTTF + MTTR

Availability = MTTF / (MTTF + MTTR)

Improving Availability Increase MTTF: fault avoidance, fault tolerance, fault

forecasting

Reduce MTTR: improved tools and processes for diagnosis and repair

Disk Access ExampleGiven

512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk

Average read time 4ms seek time

+ ½ / (15,000/60) = 2ms rotational latency+ 512 / 100MB/s = 0.005ms transfer time+ 0.2ms controller delay= 6.2ms

If actual average seek time is 1ms Average read time = 3.2ms

Use Arrays of Small Disks?

14”10”5.25”3.5”

3.5”

Disk Array: 1 disk design

Conventional: 4 disk designs

Low End High End

Can smaller disks be used to close gap in performance between disks and CPUs? Improves throughput, latency may not improve

Array Reliability

• Reliability of N disks = Reliability of 1 Disk ÷ N

50,000 Hours ÷ 70 disks = 700 hours

Disk system MTTF: Drops from 6 years to 1 month!

• Arrays (w/o redundancy) too unreliable to be used

Hot spares support reconstruction in parallel with access: very high media availability can be achievedHot spares support reconstruction in parallel with access: very high media availability can be achieved

Redundant Arrays of (Inexpensive) Disks

Files are "striped" across multiple disks

Redundancy yields high data availability

Availability: service still provided to user, even if some components failed

Disks will still fail

Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store redundant information

Bandwidth penalty to update redundant information

RAID 1: Disk Mirroring/Shadowing

• Each disk is fully duplicated onto its “mirror”

Very high availability can be achieved

• Bandwidth sacrifice on write:• Logical write = two physical writes

• Reads may be optimized

• Most expensive solution: 100% capacity overhead

recoverygroup

RAID 10 vs RAID 01Striped mirrors

RAID 1 + 0 For example, four pair of disks for four-disk

data

Mirrored stripes For example, pair of

four disks for

four-disk data RAID 0 + 1

15

RAID 2Memory-style error correcting codes in disksNot used anymore.

Other RAID organizations are more attractive

16

RAID 3: Parity Disk

P

100100111100110110010011

. . .logical record 1

0100011

11001101

10100011

11001101

P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information

Striped physicalrecords

Inspiration for RAID 4RAID 3 relies on parity disk to discover errors

on Read

But every sector has an error detection field

To catch errors on read, rely on error detection field vs. the parity disk

Allows independent reads to different disks simultaneously

RAID 4: High I/O Rate Parity

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Disk Columns

IncreasingLogicalDisk

Address

Stripe

Inside of 5 disksInside of 5 disks

Example:small read D0 & D5, large write D12-D15

Example:small read D0 & D5, large write D12-D15

Inspiration for RAID 5RAID 4 works well for small readsSmall writes (write to one disk):

Option 1: read other data disks, create new sum and write to Parity Disk

Option 2: since P has old sum, compare old data to new data, add the difference to P

Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk

D0 D1 D2 D3 P

D4 D5 D6 PD7

RAID 5: Distributed Parity

N + 1 disksLike RAID 4, but parity blocks distributed

across disksAvoids parity disk being a bottleneck

Widely used

RAID 6: Recovering from 2 failuresWhy > 1 failure recovery?

If operator accidentally replaces the wrong disk during a failure

Since disk bandwidth is growing slowly than disk capacity, the MTTR of a disk is increasing increases the chances of a 2nd failure during repair

since it takes longer– 500 GB SATA disk could take 3 hours to read sequentially

reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss

Increasing number of disks, use of ATA disks (slower and larger than SCSI disks).

RAID 6: Recovering from 2 failuresNetwork Appliance’s row-diagonal parity or RAID-DP

Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe

Since it is protecting against a double failure, it adds two check blocks per stripe of data.

If p+1 disks total, p-1 disks have data

Row parity disk is just like in RAID 4

Even parity across other data blocks in its stripe

Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal

Example p = 5Row diagonal parity starts by recovering one of the 4 blocks

on the failed disk using diagonal parity Since each diagonal misses one disk, and all diagonals miss a

different disk, 2 diagonals are only missing 1 block

Once the data for those blocks are recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes

Process continues until two failed disks are restored

I/O - IntroductionI/O devices can be characterized by

Behavior: input, output, storagePartner: human or machineData rate: bytes/sec, transfers/sec

I/O bus connections

I/O System CharacteristicsDependability is important

Particularly for storage devices

Performance measuresLatency (response time)

Throughput (bandwidth)

Desktops & embedded systems

Primary focus is response time & diversity of devices

Servers

Primary focus is throughput & expandability of devices

Typical x86 PC I/O System

I/O Register MappingMemory mapped I/O

Registers are addressed in same space as memory

Address decoder distinguishes between themOS uses address translation mechanism to make

them only accessible to kernel

I/O instructionsSeparate instructions to access I/O registersCan only be executed in kernel modeExample: x86

PollingPeriodically check I/O status register

If device ready, do operationIf error, take action

Common in small or low-performance real-time embedded systemsPredictable timingLow hardware cost

In other systems, wastes CPU time

InterruptsWhen a device is ready or error occurs

Controller interrupts CPUInterrupt is like an exception

But not synchronized to instruction executionCan invoke handler between instructionsCause information often identifies the interrupting

devicePriority interrupts

Devices needing more urgent attention get higher priority

Can interrupt handler for a lower priority interrupt

I/O Data TransferPolling and interrupt-driven I/O

CPU transfers data between memory and I/O data registers

Time consuming for high-speed devices

Direct memory access (DMA)OS provides starting address in memory

I/O controller transfers to/from memory autonomously

Controller interrupts on completion or error

Server ComputersApplications are increasingly run on servers

Web search, office apps, virtual worlds, …

Requires large data center servers

Multiple processors, networks connections, massive storage

Space and power constraints

Server equipment built for 19” racks

Multiples of 1.75” (1U) high

Chapter 6 — Storage and Other I/O Topics — 33

Rack-Mounted Servers

Sun Fire x4150 1U server

Sun Fire x4150 1U server

4 cores each

16 x 4GB = 64GB DRAM

Concluding RemarksI/O performance measures

Throughput, response timeDependability and cost also important

Buses used to connect CPU, memory,I/O controllersPolling, interrupts, DMA

RAIDImproves performance and dependability

Please read Sections 6.1 – 6.10 P&H 4th Ed.

THINK: Weekend!!

36

The best way to predict the future is to create it. Peter Drucker

The best way to predict the future is to create it. Peter Drucker

1 CDA3101 Fall 2013 Computer Storage: Practical Aspects 6,13 November 2013 Copyright © 2011 Prabhat Mishra.

Documents

ms slide

disk data raid

disk array

disk smaller

disk n

disk columns

disk mirroringshadowing

parity disk p