1 CDA3101 CDA3101 Fall 2013 Fall 2013 Computer Storage: Practical Aspects 6,13 November 2013 Copyright © 2011 Prabhat Mishra
Mar 29, 2015
1
CDA3101 Fall 2013CDA3101 Fall 2013
Computer Storage:
Practical Aspects
6,13 November 2013
Copyright © 2011 Prabhat Mishra
2
Storage SystemsStorage Systems Introduction
Disk Storage
Dependability and Reliability
• I/O Performance
Server Computers
Conclusion
CDA 3101 – Fall 2013 Copyright © 2011 Prabhat Mishra
Case for StorageShift in focus from computation to communication and
storage of information “The Computing Revolution” (1960s to 1980s)
– IBM, Control Data Corp., Cray Research
“The Information Age” (1990 to today)– Google, Yahoo, Amazon, …
Storage emphasizes reliability and scalability as well as cost-performance Program crash – frustrating Data loss is unacceptable dependability is key concern
Which software determines HW features?Operating System for storageCompiler for processor
Cost vs Access time in DRAM/Disk
DRAM is 100,000 times faster, and costs 30-150 times more per gigabyte.
Chapter 6 — Storage and Other I/O Topics — 5
Flash StorageNonvolatile semiconductor storage
100× – 1000× faster than diskSmaller, lower power, more robustBut more $/GB (between disk and DRAM)
§6.4 Flas
h S
tora
ge
6
Hard Disk Drive
Seek Time is not Linear in Distance
• RULE OF THUMB: average seek is the time to access 1/3rd of the number of cylinders -- it is not linear, accelerate arm, pause, decelerate, wait for settle time. -- The average does not work well due to locality property.
Requires 3 revolutions to perform 4 reads(26, 100, 724, 9987)
Requires just 3/4th of a revolution
DependabilityFault: failure of a component
May or may not lead to system failure
Service accomplishmentService delivered
as specified
Service interruptionDeviation from
specified service
FailureRestoration
Dependability MeasuresReliability: mean time to failure (MTTF)
Service interruption: mean time to repair (MTTR)
Mean time between failures (MTBF)MTBF = MTTF + MTTR
Availability = MTTF / (MTTF + MTTR)
Improving Availability Increase MTTF: fault avoidance, fault tolerance, fault
forecasting
Reduce MTTR: improved tools and processes for diagnosis and repair
Disk Access ExampleGiven
512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
Average read time 4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency+ 512 / 100MB/s = 0.005ms transfer time+ 0.2ms controller delay= 6.2ms
If actual average seek time is 1ms Average read time = 3.2ms
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
Can smaller disks be used to close gap in performance between disks and CPUs? Improves throughput, latency may not improve
Array Reliability
• Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
• Arrays (w/o redundancy) too unreliable to be used
Hot spares support reconstruction in parallel with access: very high media availability can be achievedHot spares support reconstruction in parallel with access: very high media availability can be achieved
Redundant Arrays of (Inexpensive) Disks
Files are "striped" across multiple disks
Redundancy yields high data availability
Availability: service still provided to user, even if some components failed
Disks will still fail
Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store redundant information
Bandwidth penalty to update redundant information
RAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its “mirror”
Very high availability can be achieved
• Bandwidth sacrifice on write:• Logical write = two physical writes
• Reads may be optimized
• Most expensive solution: 100% capacity overhead
recoverygroup
RAID 10 vs RAID 01Striped mirrors
RAID 1 + 0 For example, four pair of disks for four-disk
data
Mirrored stripes For example, pair of
four disks for
four-disk data RAID 0 + 1
15
RAID 2Memory-style error correcting codes in disksNot used anymore.
Other RAID organizations are more attractive
16
RAID 3: Parity Disk
P
100100111100110110010011
. . .logical record 1
0100011
11001101
10100011
11001101
P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information
Striped physicalrecords
Inspiration for RAID 4RAID 3 relies on parity disk to discover errors
on Read
But every sector has an error detection field
To catch errors on read, rely on error detection field vs. the parity disk
Allows independent reads to different disks simultaneously
RAID 4: High I/O Rate Parity
D0 D1 D2 D3 P
D4 D5 D6 PD7
D8 D9 PD10 D11
D12 PD13 D14 D15
PD16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogicalDisk
Address
Stripe
Inside of 5 disksInside of 5 disks
Example:small read D0 & D5, large write D12-D15
Example:small read D0 & D5, large write D12-D15
Inspiration for RAID 5RAID 4 works well for small readsSmall writes (write to one disk):
Option 1: read other data disks, create new sum and write to Parity Disk
Option 2: since P has old sum, compare old data to new data, add the difference to P
Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk
D0 D1 D2 D3 P
D4 D5 D6 PD7
RAID 5: Distributed Parity
N + 1 disksLike RAID 4, but parity blocks distributed
across disksAvoids parity disk being a bottleneck
Widely used
RAID 6: Recovering from 2 failuresWhy > 1 failure recovery?
If operator accidentally replaces the wrong disk during a failure
Since disk bandwidth is growing slowly than disk capacity, the MTTR of a disk is increasing increases the chances of a 2nd failure during repair
since it takes longer– 500 GB SATA disk could take 3 hours to read sequentially
reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
Increasing number of disks, use of ATA disks (slower and larger than SCSI disks).
RAID 6: Recovering from 2 failuresNetwork Appliance’s row-diagonal parity or RAID-DP
Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
Since it is protecting against a double failure, it adds two check blocks per stripe of data.
If p+1 disks total, p-1 disks have data
Row parity disk is just like in RAID 4
Even parity across other data blocks in its stripe
Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
Example p = 5Row diagonal parity starts by recovering one of the 4 blocks
on the failed disk using diagonal parity Since each diagonal misses one disk, and all diagonals miss a
different disk, 2 diagonals are only missing 1 block
Once the data for those blocks are recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
Process continues until two failed disks are restored
I/O - IntroductionI/O devices can be characterized by
Behavior: input, output, storagePartner: human or machineData rate: bytes/sec, transfers/sec
I/O bus connections
I/O System CharacteristicsDependability is important
Particularly for storage devices
Performance measuresLatency (response time)
Throughput (bandwidth)
Desktops & embedded systems
Primary focus is response time & diversity of devices
Servers
Primary focus is throughput & expandability of devices
Typical x86 PC I/O System
I/O Register MappingMemory mapped I/O
Registers are addressed in same space as memory
Address decoder distinguishes between themOS uses address translation mechanism to make
them only accessible to kernel
I/O instructionsSeparate instructions to access I/O registersCan only be executed in kernel modeExample: x86
PollingPeriodically check I/O status register
If device ready, do operationIf error, take action
Common in small or low-performance real-time embedded systemsPredictable timingLow hardware cost
In other systems, wastes CPU time
InterruptsWhen a device is ready or error occurs
Controller interrupts CPUInterrupt is like an exception
But not synchronized to instruction executionCan invoke handler between instructionsCause information often identifies the interrupting
devicePriority interrupts
Devices needing more urgent attention get higher priority
Can interrupt handler for a lower priority interrupt
I/O Data TransferPolling and interrupt-driven I/O
CPU transfers data between memory and I/O data registers
Time consuming for high-speed devices
Direct memory access (DMA)OS provides starting address in memory
I/O controller transfers to/from memory autonomously
Controller interrupts on completion or error
Server ComputersApplications are increasingly run on servers
Web search, office apps, virtual worlds, …
Requires large data center servers
Multiple processors, networks connections, massive storage
Space and power constraints
Server equipment built for 19” racks
Multiples of 1.75” (1U) high
Chapter 6 — Storage and Other I/O Topics — 33
Rack-Mounted Servers
Sun Fire x4150 1U server
Sun Fire x4150 1U server
4 cores each
16 x 4GB = 64GB DRAM
Concluding RemarksI/O performance measures
Throughput, response timeDependability and cost also important
Buses used to connect CPU, memory,I/O controllersPolling, interrupts, DMA
RAIDImproves performance and dependability
Please read Sections 6.1 – 6.10 P&H 4th Ed.
THINK: Weekend!!
36
The best way to predict the future is to create it. Peter Drucker
The best way to predict the future is to create it. Peter Drucker