Storage Systems CSE 598d, Spring 2007 Lecture 11: Disk scheduling Feb 27, 2007 (ACK: Several slides borrowed from Shiva Chaitanya)

Storage SystemsStorage SystemsCSE 598d, Spring 2007CSE 598d, Spring 2007Lecture 11: Disk schedulingLecture 11: Disk scheduling

Feb 27, 2007Feb 27, 2007

(ACK: Several slides borrowed from (ACK: Several slides borrowed from Shiva Chaitanya)Shiva Chaitanya)

Disk Access Time: Components

• CPU time to issue and process I/O• contention for controller• contention for bus• contention for memory• verifying block correctness with checksums (retransmissions)

• waiting in scheduling queues• ...

Disk Scheduling

Seek time is a dominant factor of total disk I/O time

Let operating system or disk controller choose which request to serve next depending on the head’s current position and requested block’s position on disk

Disk scheduling is much more difficult than CPU scheduling

– a mechanical device – hard to determine (accurate) access times

– disk accesses cannot be preempted – runs until it finishes

– disk I/O often the main performance bottleneck

Scheduling at Multiple Locations!

S/W, H/W Components between an application and the disk:

- File system- Device driver- SCSI bus- RAID controller (if employing RAID)- Some bus - Disk controller

Why?- Why not do it only at FS/DD level?- Why not do it only within the disk?

Scheduling locations

Scheduling at Multiple Locations!

Why?Key ideas that disk scheduling employs:

Request re-ordering for seek/positioning minimization

Exploit temporal localityAnticipation for sequential streamsIntroduce non-work conserving behavior!

Exploit spatial localityCoalesce consecutively placed requestsFree-block scheduling

Different optimizations are best done at different locations

Furthermore, the best location to do an optimization depends on the workload!

Goals

– Short response time– High overall throughput – Fairness (equal probability for all blocks to be accessed in the same time)

Tradeoff: Throughput vs. Fairness Socialism vs. Capitalism?

Disk Scheduling

Several traditional algorithms– First-Come-First-Serve (FCFS)– Shortest Seek Time First (SSTF)

• Shortest Positioning Time First (SPTF)

– SCAN – C-SCAN– LOOK– C-LOOK– …

First–Come–First–Serve (FCFS)

FCFS serves the first arriving request first: Long seeks Short average response time

time

cylinder number1 5 10 15 20 25

12

incoming requests (in order of arrival):

14 2 7 21 8 24

Shortest Seek Time First (SSTF)

SSTF serves closest request first: short seek times longer maximum seek times – may lead to starvation

time



12 14 2 7 21 8 24

SCANSCAN moves head edge to edge and serves requests on the way: bi-directional compromise between response time and seek time optimizations

time


12 14 2 7 21 8 24incoming requests (in order of arrival):

C–SCANCircular-SCAN moves head from edge to edge serves requests on one way – uni-directional improves response time (fairness)

time


1214 2 721 8 24


LOOK and C–LOOKLOOK (C-LOOK) is a variation of SCAN (C-SCAN): same schedule as SCAN does not run to the edges stops and returns at outer- and innermost request increased efficiency

time

cylinder number1 5 10 15 20 251214 2 7 21 8 24


V–SCAN(R)V-SCAN(R) combines SCAN (or LOOK) and SSTF

– define a R-sized unidirectional SCAN window, i.e., C-SCAN, and use SSTF outside the window

– Example: V-SCAN(0.6) • makes a C-SCAN (C-LOOK) window over 60 % of the cylinders

uses SSTF for requests outside the window

V-SCAN(0.0) equivalent with SSTF– V-SCAN(1.0) equivalent with SCAN– V-SCAN(0.2) is supposed to be an appropriate configuration

Shortest Positioning Time First (SPTF)

Given the complete knowledge of the actual mapping of data blocks onto the media, the scheduler can choose the request with the minimum positioning delay (combined seek and rotational latency)

SPTF, like SSTF suffers from poor starvation resistance. To reduce response time variance, priority can be given to requests that have been in pending queue for excessive periods of time

Aged Shortest Positioning Time First (ASPTF)

ASPTF(W) adjusts each positioning delay (Tpos) by subtracting a weighted value corresponding to the amount of time the request has been waiting for service (Twait)

Teff = Tpos – (W*Twait)

For large values of W, ASPTF behaves like FCFS

Scheduling in Modern Disk Drives

Features of current disk drives that affect traditional

scheduling algorithms

Host interface Data layout On-Board Cache

Ref: B.L. Worthington, Greg Ganger, N. Patt : Scheduling Algorithms for Modern

Disk Drives ACM Sigmetrics 1994

Host interface

Controller presents a request to the disk drive in terms of the starting logical block number and request size

Subsequent media access hidden from the host

Scheduling entities outside of the drive have little knowledge of overhead delays

Data Layout Many systems assume sequentiality of LBN-to-PBN mappings in seek reducing algorithms

Aggressive algorithms require highly accurate knowledge of the data layout which is typically hidden

Complexity of mappings increased by zoned recording, track/cylinder skew and defect management

On-Board CacheMemory within disk drives has progressed from small speed-matching buffers to megabytes of cache memory

Disk logic typically prefetches data into cache to satisfy sequential read requests. This affects scheduling in two ways: Position of the head cannot be determined

easily Requests that can be satisfied by cache could

be given higher priority

Scheduling by Logical Block Number

• As expected, FCFS quickly saturates as workload increases

• SSTF provides lower mean response time

Scheduling by Logical Block Number

FCFS has the lowest coefficient for lighter workloads

As FCFS begins to saturate and its response time variance increases, C-LOOK emerges as a better algorithm for response time variance

Scheduling with Full knowledge

As W increases, the average response time slowly grows, though variance drops

Scheduling with Full Knowledge

Modern Disk Scheduling

In modern drives, C-LOOK best exploits the prefetching cache for workloads with significant read sequentiality

SSTF and LOOK perform better for random workloads

Powerful disk controllers use variants of Shortest Positioning Time First (SPTF).

Freeblock Scheduling

An approach to utilizing more of a disk’s potential media bandwidth

Fill rotational latency periods with useful media transfers for background applications

It has been observed that 20-50% of a never-idle disk’s bandwidth can often be provided to background applications without affecting foreground response times

Ref: Christopher R. Lumb, Jiri Schindler, Greg Ganger : “Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives”, OSDI , 2000

Disk-intensive background tasks

Disk Reorganization File system cleaning Backup Prefetching Write-back Integrity Checking RAID scrubbing Virus detection Index Reorganization …

Free Bandwidth

Time required for a disk media access

Taccess = Tseek + Trotate + Ttransfer

Freeblock scheduling uses the Trotate component of disk access to transfer additional data

Instead of just waiting for desired sector to arrive, this technique transfers the intermediate sectors

Steps in Freeblock Scheduling

Predict how much rotational latency will occur before the next foreground media transferRequires detailed knowledge of disk attributes, including layout algorithms and time dependent mechanical positioning overheads

Squeeze additional media transfers into that time

Get to the destination track in time for the foreground transfer

Anticipatory Disk Scheduling

Reorder available disk requests for

performance by seek optimization, proportional resource allocation, etc.

Any policy needs multiple outstanding requests to make good decisions!

Ref: Sitaram Iyer, Peter Druschel : “Anticipatory scheduling : A disk scheduling framework to overcome deceptive idleness in synchronous I/O”, SOSP 2001

With enough requests…

issued by process A issued by process B

E.g., Throughput = 21 MB/s (IBM Deskstar disk)

seek

time

location on disk

With synchronous I/O…

E.g., Throughput = 5 MB/sNext

schedule

issued by process A issued by process B

forced!

too late!

forced!

Deceptive idleness

Process A is about to issue next request.

but

Scheduler hastily assumes that process A has no further requests!

Proportional scheduler

Allocate disk service in say 1:2 ratio:

Deceptive idleness causes 1:1 allocation:

Next

BABA

Anticipatory scheduling

Key idea: Sometimes wait for process whose request was last serviced.

Keeps disk idle for short intervals.

But with informed decisions, this: Improves throughput Achieves desired proportions

Cost-benefit analysis

Balance expected benefits of waiting against cost of keeping disk idle.

Tradeoffs sensitive to scheduling policye.g., 1. seek optimizing scheduler

2. proportional scheduler

Statistics

For each process, measure:

1. Expected median and 95percentile thinktime

2. Expected positioning time

Median 95percentile

Num

ber

of

request

s

Thinktime

last next

Benefit =

best.positioning_time — next.positioning_time

Cost = next.median_thinktime

Waiting_duration =

(Benefit > Cost) ? next.95percentile_thinktime : 0

Cost-benefit analysisfor seek optimizing

schedulerbest := best available request chosen by scheduler

next := expected forthcoming request from process whose request was last serviced

Proportional scheduler

Costs and benefits are different.

e.g., proportional scheduler:

Wait for process whose request was last serviced,

1. if it has received less than its allocation, and2. if it has think time below a threshold (e.g., 3ms)

Waiting_duration = next.95percentile_thinktime

Prefetch

Overlaps computation with I/O.

Side-effect: avoids deceptive idleness!

Application-driven Kernel-driven

Conclusion

Anticipatory scheduling:

overcomes deceptive idleness achieves significant performance improvement on real applications

achieves desired proportions

and is easy to implement!

Fairness : Evaluating disk scheduling

algorithms Storage system designers prefer to keep the queue length at disks small regardless of the load

When queuing threshold is reached at the disk, the controller or the device driver queues the requests until disk queue is processed

Low queuing threshold minimizes request starvation at the disk level when unfair scheduling algorithms are deployed

Ref: Alma Riska, Erik Riedel : “It’s not fair – evaluating efficient disk scheduling” , MASCOTS 2003

Results Queuing more requests at the disk provides the scheduling algorithms more information used for better disk resource utilization

Percentage of requests starved remains small even if longer queues build up at the disk

Overall request starvation is independent

from the queuing threshold at the disk

Storage subsytem architecture

Queues at various levels

Outstanding requests queued at disk and at device driver in a single disk system

And, at the disks and the controller(s) in a multiple disk system

Impact of queuing thresholds

Average load of 16 outstanding requests in system

Average load of 64 outstanding requests in system

Response time distribution

Higher the load the larger the gap between the performances of different scheduling algorithms

Fair and simple FCFS yields longest average request response time

Best performance obtained when increasing the queue threshold under SPTF

How about request starvation and variability in the request response time ?

Response time distribution

Tail of response time distribution with average load of 16 outstanding requests and threshold of 8

Tail of response time distribution with average load of 16 outstanding requests and threshold of 16

Observations .. Majority of requests under FCFS exhibit long

response times, while seek-reducing algorithms result in majority of short response times

More than 90% of requests under SPTF have shorter response times than FCFS and only 1% exhibit upto double the response times in FCFS

Amount of starvation in position-based scheduling algorithms for both queuing thresholds is the same relative to FCFS

Hence, queuing more requests improves disk performance without introducing more request starvation

Scheduling at Device driver level

Depends on workload and filesystem layout Eg, with SCAN, seek times to sectors in the

middle of the disks are shorter OS could choose between algorithms based on

current queue Likely to be expensive in CPU cycles Queue changes as new requests arrive

SSTF or SCAN are reasonable defaults Allow algorithm selection as part of OS tuning FreeBSD: C-SCAN Linux 2.2 :SCAN Linux 2.6 : four different versions of elevator

algorithm

Discussion: Scheduling at

Multiple Locations Positioning-based optimizations best done within the disk Seek-based optimizations best done at device driver Why do scheduling within FS?

Device and DD independent Aware of buffer cache Application isolation

Disk queue length crucial Short queue results in degraded throughput

Locally good but globally bad schedules Long queue results in unfairness

Non-work conservation can improve fairness and throughout! Anticipatory scheduling

Achieving proportional fairness non-trivial Solutions based on hierarchy of queues, anticipatory scheduling can

help Request coalescing can result in great improvement in

throughput FS and device driver are good places Improve the sequentiality of the request stream seen by the disk

Free-block scheduling can improve throughput Can view this as a “corrector” for the non work conserving nature of

disk

Additional slides on free-block scheduling

Illustration of two freeblock scheduling

possibilities

Desired Characteristics of

tasks

Low priority Freeblock requests will only be served opportunistically

Not appropriate for a set of equally important requests

Large sets of desired blocks Larger the set of disk locations desired, higher the probability to find a free bandwidth opportunity

Desired Characteristics …

No particular order of access Ordering requirements restrict the set of requests that can be considered by scheduler

Effectiveness of freeblock scheduling directly related to number of outstanding requests

Small working memory footprints Need to buffer multiple blocks before processing creates artificial ordering requirements due to memory limitations

A Simple Interface

No call into the freeblock scheduler subsystem waits for a disk access. Calls return immediately

Freeblock read requests do not specify memory locations for read data. Completion callbacks provide pointers to buffers owned by freeblock scheduling subsystem

Applications Scanning applications : tasks that scan large

portion of disk contents like report generation, RAID scrubbing, virus detection, tamper detection and backup

Internal storage optimization: reorganizing stored data to improve performance, e.g placing related data contigously, placing hot data near the center of the disk, replicating data for subsequent reads, ..etc

Prefetching and Prewriting: prewriting is early writing out of dirty blocks under the assumption they will not be overwritten or deleted before writeback is necessary

Availability of free bandwidth

Availability of potential free bandwidth = Total bandwidth * Fraction of time spent on rotational latency

Results in the next few slides obtained using Disksim

Default disk drive used is Quantum Atlas 10k

Default workload consists of 10,000 foreground requests issued one at a time with uniform distribution of starting locations

Impact of disk characteristics

Overall, about one third of each disk’s head usage is on rotational latency

Characteristics of simulated disk drives

Impact of workload characteristics

Impact of scheduling algorithm

• C-LOOK and SSTF reduce seek times without affecting transfer times and rotational latencies

• SPTF tends to decrease both overhead components. Figure shows rotational latency decreases to 22%

Feasibility Freeblock scheduling relies heavily on ability to accurately predict positioning delays

Firmware of most disk drives now supports SPTF which requires similar predictions

Freeblock scheduling resembles advanced disk schedulers for environments with a mixed workload of real-time and non-real-time activities

Additional slides on anticipatory scheduling - experimental evaluation

Experiments

• FreeBSD-4.3 patch + kernel module (1500 lines of C code)

• 7200 rpm IDE disk (IBM Deskstar)

• Also in the paper: 15000 rpm SCSI disk (Seagate Cheetah)

Microbenchmark

0

5

10

15

20

25

Sequential Alternate Random within file

Th

rou

gh

pu

t (M

B/s

)

OriginalAnticipatory

no prefetch

no prefetch

no prefetch

prefetch

prefetch

prefetch

Real workloadsWhat’s the impact on real applications and benchmarks?

Andrew benchmarkApache web server (large working set)Database benchmark

•Disk-intensive•Prefetching enabled

Andrew filesystem benchmark

Overall 8% performance improvement

0

5

10

15

20

25

30

mkdir cp stat scan gcc

-16% -5% -5% -54% +1.7%

Execution time (minutes)

Original

Anticipatory

5

62 (or more) concurrent clients

Apache web server

0

1

2

3

4

read+29%

mmap+71%

Throughput (MB/s)

no prefetch

• CS.Berkeley trace

• Large working set

• 48 web clients

Storage Systems CSE 598d, Spring 2007 Lecture 11: Disk scheduling Feb 27, 2007 (ACK: Several slides borrowed from Shiva Chaitanya)

Documents

time optimizations

time tradeoff

cscan clook window

components cpu time

variation of scan cscan

sstfshortest positioning

order of arrival

disk schedulingfeb