Parallel disk head emulation

Parallel head disk emulation

Andy TwiggComputer Lab

Outline

● Outline● Parallel disk models● Emulations● Open problems● Bibliography

– [Sanders et al, soda00, spaa00, soda02] and related work on balanced allocations [Czumaj, Berenbrink,]

Parallel disk models● Ideally: want a large disk that can access D

arbitrary blocks in one I/O (parallel disk head model)



● Reality: have kD disks that can each access 1 block per I/O– But can access them in parallel (parallel disk model)



● Reality: have kD disks that can each access 1 block per I/O– But can access them in parallel (parallel disk model)

● Can we emulate the parallel head model on pdm?– Quality of emulation: throughput and delay of

requests, space overhead, ...

Assumptions

● One global buffer of size m– Shared among all disks

● Can access exactly one block per disk per I/O (no rotational latency, seeking, ...)

● Redundancy: each block will be stored on two disks– More generally, r outof (r+1)

Emulation: queued writing

● Assume pairwiseindependent hash functions f,g:[n]>[n]. Consider D queues Q

1...Q

D

● Each block i will be stored at f(i),g(i)● Write((1)D blocks): append blocks to queues,

keep writing from queues until ∑i |Q

i| < O(D/)

Emulation: queued writing

● Assume pairwiseindependent hash functions f,g:[n]>[n]. Consider D queues Q

1...Q

D

● Each block i will be stored at f(i),g(i)● Write((1)D blocks): append blocks to queues,

keep writing from queues until ∑i |Q

i| < O(D/)

● Theorem [Sanders]: – E[time to write (1)D blocks] < 1+exp(D)

Aside: allocation processes

● Eg [Azar, Broder, Karlin, Upfal STOC94], [Mitzenmacher 96], [Czumaj, Berenbrink, ..]

● m bins, n balls; ball i can go to 2 bins f(i),g(i) chosen independently and uar

● Balls arrive online, thrown into leastloaded of f(i),g(i)

● Interested in maxj load(j)

Allocation graphs and schedules● Allocation graph G

A: nodes are disks(bins), edges

are blocks(balls). Undirected edge e={i,j} means that block e stored on disks i,j.




● Schedule: given a set of requested edges S, GS is

an orientation of GA[S].




● Schedule: given a set of requested edges S, GS is

an orientation of GA[S].

● Load(disk j) = indegree(j) in GS

● #I/O steps = load(schedule) = maxj indegree(j)

→ maintain online an orientation of low indegree

– If blocks stored at several disks, GA is a hypergraph

Warm up

● Fact: Every connected component of G ~ G(D, (1/2)D) is either a tree or a tree with one cycle whp

● → Max load of a schedule with D/2 requests?

Warm up


● → Max load of a schedule with D/2 requests?

Warm up

● Orienting a tree:pick root, orient edges away from r


Warm up

● Orienting a tree:pick root, orient edges away from r● Orienting tree + cocycle of edge {u,v}: orient (v,u),

choose u as root and orient the remaining tree


Warm up

● Orienting a tree:pick root, orient edges away from r● Orienting tree + cocycle of edge {u,v}: orient (v,u),

choose u as root and orient the remaining tree● Strategy: Divide requests into subsequences of

length D/2 and schedule each as above– Max load 1 for each D/2 requests load 2N/D for N →

requests


Max load 1.2*N/D● Lemma[Pittel,Spencer,Wormald]: G ~ G(D,1.67D)

has no 3core whp


has no 3core whp● Strategy: Repeatedly pick the node with largest

remaining degree, orient edges toward it and remove it

● max load 2 for each 1.67D requests


has no 3core whp● Strategy: Repeatedly pick the node with largest

remaining degree, orient edges toward it and remove it

● max load 2 for each 1.67D requests● BUT: all these must buffer requests before

scheduling them

Asynchronous reading: Shortestqueue first

● Write(block i): buffer i, write i to both f(i),g(i) when each becomes free

● Read(block i): buffer the request at the leastloaded of f(i),g(i)– each disk serves its queue in FIFO order

Asynchronous reading: Shortestqueue first

● Write(block i): buffer i, write i to both f(i),g(i) when each becomes free

● Read(block i): buffer the request at the leastloaded of f(i),g(i)– each disk serves its queue in FIFO order

● Requests are scheduled online● Conjecture[Sanders]: Delay O(log 1/) is achievable

for average arrival rate (1)D– If 2 copies of each block allowed (\Theta(1/) for 1 copy)

Max load O(log log n)● Easier proof for lightly loaded case (n<d)

● Let G ~ G(n,n/8) and consider the following

while there exists a node of degree ≤ 13

for each such node

orient its edges towards it & remove

● Thm: max load = O(log log n)– Claim 1: balls added at step i have height ≤ 13i– Claim 2: largest connected component in G has size

O(log n)– Claim 3: procedure terminates in O(log log n) steps

Neat: majority method● Use 3 (3way ind) hash functions f,g,h● Writing: Write block i to the leastloaded two of

f(i),g(i),h(i) along with a timestamp● Reading: Read i from the leastloaded two of

f(i),g(i),h(i) and return the latest version● Max load O(log log n / log n) for writing and

reading● + writes and reads can be scheduled together

Virtual disk model● Want: A set of virtual disks V_1...V_m, each with

specified bandwidth b(V_i) and capacity c(V_i)● Have: a collection of physical disks D_1...D_n,

each with bandwidth 1 and capacity c

Virtual disk model● Want: A set of virtual disks V_1...V_m, each with

specified bandwidth b(V_i) and capacity c(V_i)● Have: a collection of physical disks D_1...D_n,

each with bandwidth 1 and capacity c● Efficient emulation of virtual disk model?

– Admission control + (1)bandwidth emulation for pdhm would imply ∑

i b(V

i) < (1)n and ∑

i c(V

i) < cn/2

are sufficient conditions for vdm emulation

● Adding/removing virtual disks, changing capacities, ...

ExtensionsOpen:● Prove good delay bounds for asynchronous

reading● Deterministic guarantees (expanders?)● Emulation of virtual disk model● Handling rotational latencies, seek times

Parallel disk head emulation

Science