RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler

RADIO: Managing the Performance of Large, Distributed Storage Systems

Scott A. Brandtand

Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey

Computer Science DepartmentUniversity of California Santa Cruz

andRichard Golding and Ted Wong, IBM Almaden Research Center

UPC—July 7, 2009

Who am I?

• Professor, Computer Science Department, UC Santa Cruz

• Director, UCSC/LANL Institute for Scalable Scientific Data Management (ISSDM)

• Director, UCSC Systems Research Laboratory (SRL)

• Background

• 1999 Ph.D. CS, Colorado

• 1987/1993 B. Math/MS CS, Minnesota

• 1982-1994 Programmer/Research Scientist/VP, CPT, B-Tree, Honeywell SRC, Theseus Research, Alliant TechSystems RTS, Secure Computing

• My Research

• High-performance petascale storage

• Real-time systems

• Performance management and virtualization

• Active object-based storage

• Other Research

• Secure operating systems

• Asynchronous circuits

• Real-time image processing

Distributed systems need performance guarantees

• Many distributed systems and applications need (or want) I/O performance guarantees• Multimedia, high-performance simulation, transaction processing,

virtual machines, service level agreements, real-time data capture, sensor networks, ...

• Systems tasks like backup and recovery

• Even so-called best-effort applications

• Providing such guarantees is difficult because it involves:• Multiple interacting resources

• Dynamic workloads

• Interference among workloads

• Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth

In a nutshell

• Big distributed systems• Serve many users/jobs

• Process petabytes of data

• Data center design• Use rules of thumb

• Over-provision

• Isolate

• Ad hoc per formance management approaches creates marginal storage systems that cost more than necessary

• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network

Outline

1. Problem: Managing the performance of large, distributed storage systems

2. Approach: End-to-end performance management

3. Model: RAD

4. Instances: • Disk

• Network

• Buffer cache

5. Application: Data Center Performance Management and Monitoring

End-to-end I/O performance guarantees

• Goal: Improve end-to-end performance management in large distributed systems• Manage performance

• Isolate traffic

• Provide high performance

• Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ...

• Approach:1. Develop a uniform model for managing performance

2. Apply it to each resource

3. Integrate the solutions

Our current target

• High-performance I/O• From client, across network, through server, to disk

• Up to hundreds of thousands of processing nodes

• Up to tens of thousands of I/O nodes

• Big, fat, network interconnect

• Up to thousands of storage nodes with cache and disk

• Challenges• Interference between I/O streams, variability of

workloads, variety of resources, variety of applications, legacy code, system management tasks, scale

Stages in the I/O path

client

cache

network

transport

diskstorage

cache

network

transport

flow control with one client

connection management between clients

IO selection and head scheduling

prefetch and writeback based on utilization, QoS

app

app

I/O

scheduler

client

cache

network

transport

app

app

integration between client and server cache

1. Disk I/O

2. Server cache

3. Flow control across network

• Within one client’s session and between clients

4. Client cache

System architecture

Client

StorageServer

QoSBroker

StorageServer

StorageServer

StorageServer

Request

Reservation

Utilizationreservations

1

2

3

4

NetworkServer Caches

Disks• Client: Task, host, distributed

application, VM, file, ...

• Reservations made via broker

• Specify workload: throughput, read/write ratio, burstiness, etc.

• Broker does admission control

• Requirements + workload are translated to utilization

• Utilizations are summed to see if they are feasible

• Once admitted, I/O streams are guaranteed (subject to workload adherence)

• Disk, caches, network controllers maintain guarantees

I/O

Achieving robust guaranteeable resources

• Goal: Unified resource management algorithms capable of providing• Good performance

• Arbitrarily hard or soft performance guarantees with• Arbitrary resource allocations

• Arbitrary timing / granularity

• Complete isolation between workloads

• All resources: CPU, disk, network, server cache, client cache

➡Virtual resources indistinguishable from “real” resources with fractional performance

Isolation is key

• CPU• 20% of a 3 Ghz CPU should be indistinguishable

from a 600 Mhz CPU

• Running: compiler, editor, audio, video

• Disk• 20% of a disk with 100 MB/second bandwidth

should be indistinguishable from a disk with 20 MB/second bandwidth

• Serving:1 stream, n streams, sequential, random

Scott’s epistemology of virtualization

• Virtual Machines and LUNs provide good HW virtualization

• Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource?

• Answer: By not getting its share of the resource when it needs it

Observation

• Resource management consists of two distinct decisions• Resource Allocation: How much resources to

allocate?• Dispatching: When to provide the allocated

resources?

• Most resource managers conflate them• Best-effort, proportional-share, real-time

Separating them is powerful!

• Separately managing resource allocation and dispatching gives direct control over the delivery of resources to tasks

• Enables direct, integrated support of all types of timeliness needs

Res

ourc

e A

lloca

tion

MissedDeadline

SRT

Dispatchingunconstrained

unco

nstra

ined

cons

train

ed

ResourceAllocation

SRTSoftReal-Time

BestEffort

CPU-Bound

I/O-Bound

HardReal-Time

Rate-Based

constrained

The resource allocation/dispatching (RAD) scheduling model

Rate

Deadlines

DispatcherSeries of jobs w/

budgets and deadlines

Share of resources

Times at which allocation must equal share

Process

Supporting different timeliness requirements with RAD

HardReal-time

Rate-based

Best-effort

Soft Real-time

Rate

Deadlines

Dispatcher

SchedulingMechanism

RuntimeSystem

RateBounds

PeriodWCET

PeriodACET

Priority

PiPiPiPi

Set of jobs w/


SchedulingPolicy

Rate-Based Earliest Deadline (RBED) CPU scheduler

Rate

Deadlines

EDF + timers

SchedulingPolicy

SchedulingMechanism

RuntimeSystem

Set of jobs w/


• Processes have rate & period• ∑rates ≤ 100%

• Periods based on processing characteristics, latency needs, etc.

• Jobs have budget & deadline• budget = rate * period

• Deadlines based on period or other characteristics

• Jobs dispatched via Earliest Deadline First (EDF)• Budgets enforced with timers

• Guarantees all budgets & deadlines = all rates & periods

Adapting RAD to disk, network, and buffer cache

• Fahrrad—Guaranteed disk request schedulingAnna Povzner (UCSC)

• RADoN—Guaranteeing storage network performanceAndrew Shewmaker (UCSC and LANL)

• Radium—Buffer management for I/O guaranteesRoberto Pineiro (UCSC)

Disk

Guaranteed disk request scheduling

• Goals• Hard and soft performance guarantees

• Isolation between I/O streams

• Good I/O performance

• Challenging because disk I/O is:• Stateful

• Non-deterministic

• Non-preemptable, and

• Best- and worst-case times vary by 3–4 orders of magnitude

Fahrrad

• Manages disk time instead of disk throughput

• Adapts RAD/RBED to disk I/O

• Reorders aggressively to provide good performance, without violating guarantees

A B C BE

Disk

I/O streams

Fahrrad

A bit more detail

• Reservations in terms of disk time utilization and period (granularity)

• All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS)

• I/Os in the DSS are issued in the most efficient way

• I/O charging model is critical

• Overhead reservation ensures exact utilization• 2 WCRTs per period for “context switches”

• 1 WCRT per period to ensure last I/Os

• 1 WCRT for the process with the shortest period due to non-preemptability

Fahrrad outperforms Linux

• Workload• Media 1: 400 sequential I/Os per second (20%)

• Media 2: 800 sequential I/Os per second, (40%)

• Transaction: short bursts of random I/Os at random times (30%)

• Background: random (10%)

• Result: Better isolation AND better throughput

Linux Linux w/Fahrrad

New work: virtual disks

• Provide workload-independent performance guarantees

• Isolate from other workloads concurrently accessing the device

• LUNs virtualize storage capacity

• Fahrrad virtualizes storage performance

Fahrrad virtual disks

• Implemented with the Fahrrad real-time I/O scheduler

• Guarantee reserved and isolated share of the time on storage device• Hard guarantees on performance isolation

• Virtual disk throughput same as equivalent standalone throughput

• Amount of data transferred:• ∀i, Di(x%, t) = Di(100%, x%·t)

Share ofdisk

Time Share ofdisk

Time

Guaranteeing performance isolation

• Virtual disk reservation: disk share (utilization) and time granularity (period) • Account for all extra (inter-stream) seeks

• Reserve overhead utilization to do them

• Charge each I/O stream for all of the time it uses, including inter- and intra-stream seeks

• Reservation = Disk Share + Overhead utilization

25%, 1 sec

30%, 250 ms

19%, 1 min

Disk time

Extra seeks

• Intra-stream seeks caused by workload

• Inter-stream seeks caused by reservations• Low time granularity causes more frequent seeks

• At most two extra seeks per stream per period• To and away from the stream to meet deadlines

➡Extra seeks caused by un-queued requests

Charging model and reservations

• Charge streams responsible for inter-stream seeking• From overhead: for seeks caused by reservations

• From reservation: for seeks caused by bursty behavior

• Overhead utilization needed for hard guarantees

• Overhead utilization = WCRT/p + 2*WCRT/p + WCRT/p

• We can trade-off hard guarantees for lower overhead by assuming less than worst-case request time

Guaranteereservedutilization

Account for inter-stream

seeks

Maintain 2outstanding

requests

• Throughput is determined by reservation and workload

0

100

200

300

400

500

0.01 0.1 1 10 100 1000

Am

ou

nt of

data

tra

nsf

ere

d [

MB

]

Run length of semi-sequential stream [MB]

Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)

Random, virtual disk (20% share, 150s)

Each virtual disk reserves 20% with 1 second

granularity

Performance: guaranteeing throughput

• Throughput is determined by reservation and workload

Each virtual disk reserves 20% with 1 second

granularity

0

100

200

300

400

500

0.01 0.1 1 10 100 1000

Am

ou

nt of

data

tra

nsf

ere

d [

MB

]

Run length of semi-sequential stream [MB]

Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)

Random, virtual disk (20% share, 150s)Sequential, standalone (100%, 30s)

Semi-sequential, standalone (100%, 30s)Random, standalone (100%, 30s)

Performance: guaranteeing throughput

Performance: Controlling throughput

• Each virtual disk is isolated from the other

• Performance is fully determined by the reservation and workload

Reserved share for sequential stream Reserved share for sequential stream

Dat

a tr

ansf

erre

d (M

B)

Dis

k tim

e re

serv

atio

n (%

)

Performance: Controlling latency

• Reservation granularity bounds latency: • period = latency/2

• Virtual device serves periodic semi-sequential stream and shares storage with random background stream. Four experiments for different period reservations.

Upper bounds

Frac

tion

of I/

Os

Latency Period of virtual disk

Util

izat

ion

Performance: Isolation guarantees

• Hard guarantees require high overhead (proportional to reservation granularity)

• Three virtual disks each serving one sequential stream with many outstanding I/Os share a storage system with a random background stream.

Dis

k tim

e re

serv

atio

n (%

)

Period of virtual disk 3Period of virtual disk 3

Dat

a tr

ansf

erre

d (M

B)

Performance: Soft guarantees w/isolation

• Overhead based on less than worst-case I/O time

• Increased short term throughput variation• Virtual disk (10%, 1 sec) runs one sequential stream with 400 IO/sec arrival rate

and shares the system with 5 virtual disks each running one random stream.

•

Dis

k tim

e re

serv

atio

n (%

)

Percentile of observed service timesPercentile of observed service times

Dat

a tr

ansf

erre

d (%

)

Performance: Soft guarantees w/isolation

• Linux fails to support Cello99 (variation up to 30% from standalone)

• Fahrrad Virtual Disks provide Cello99 and OpenMail performance close to standalone

• Cello99 and OpenMail virtual disks share the system with random background stream.•

Thr

ough

put

(I/O

s pe

r se

cond

)

TimeTime

Linux Fahrrad Virtual Disks

Fahrrad Virtual Disks

1. Guarantee throughput by accounting for overhead and guaranteeing utilization

2. Guarantee isolation between workloads by accurately accounting for all disk time

3. Provide high throughput (w/guarantees) by minimizing interference between workloads

4. Result: performance of virtual disk depends only on reservation, workload, and performance of device

Guaranteeing storage network performance



• Good I/O performance

• Challenging because network I/O is:• Distributed

• Non-deterministic (due to collisions or switch queue overflows)

• Non-preemptable

• Assumption: closed network

What we want

Client

Client

Client

Server

Server

Server

30%

50%

20%

What we have

• Switched fat tree w/full bisection bandwidth

• Issue 1: Capacity of shared links

• Issue 2: Switch queue contention

Congestion in a simple switch model

• Each transmit port on the switch is a collision domain

tx/rx ports

shared

FIFO

switch fabric

1

2

3

4

5

6

7

8


• One of the packets arriving at the same switch transmit port is delayed on the queue

switch fabric

1 and 2congest1

2

3

4

5

6

7

8

1 and 2send to 5


• Delayed packets from unrelated streams affect each other on the queue

switch fabric

1 and 2congest1

2

3

43 and 4congest

2 and 4congest

5

6

7

8

1 and 2send to 5

3 and 4send to 8

TCP

• Those who do not understand TCP are destined to reimplement it

• Jon Postel

• Ack-clocked flow control

• Packet loss based congestion control

• Sawtooth throughput

• Incast throughput collapse

Network resource usage measurements

• Round trip time RTTi = Ci - Si

• Combines queueing effects on forward and reverse path + response time

Clock 1

Si

Ci

Clock 2


• One-way delay OWDi = Ri - Si

• Isolates queueing affects on forward path, but

• Requires synchronized clocks

Clock 1

Si

Ri

Clock 2


• Relative forward delay RFDi,j = (Rj - Ri) - (Sj - Si)

• Isolates queueing affects on forward path, and

• Does not require synchronized clocks• But they must be relatively stable

Clock 1S

i

Ri

Rj

Sj

Clock 2

RADoN

• A reservation has a network share (utilization) and a time granularity (period)

• Two real-time scheduling algorithms• Earliest Deadline First (EDF) - absolute deadlines

• Least Laxity First (LLF) - relative laxities

now deadlinelaxity

release

Approximating optimal scheduling

• Flow control - throttling senders• Execution time (per period) e = utilization /

period

• Budget in packets m = e / packets_per_second

• Congestion control - avoiding switch contention (adjust wait time between packets)• Percent budget %budget = (1 - %laxity) = e/(d-t)

• Packet wait time w = wmin / %budget

• Size change w∆ = -|wi - wmin|/2

• New wait time wi+1 = min(wmax, max(wmin, w∆))

Queue modeling: single network stream

• No contention: 765 Mbps w/no lost packets

0

5

10

15

20

25

30

35

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)

Packet Sequence Number

basic modelmedian-filter model

pathload model

Queue modeling: punctuated stream

• Contention: 5 bursts of 250 Mbps

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)



pathload modellost packets

Median filter detects congestion before packet loss and decreasing queue size after congestion

Queue modeling: punctuated adaptive stream

• Contention: 5 bursts of 250 Mbps

Adapting to median-filter model decreases packet loss

0

50

100

150

200

250

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)



pathload modellost packets

Userspace RADoN prototype

• Detects congestion using Relative Forward Delay

• Responds to congestion using RAD real-time theory

• Decreases packet loss significantly

• Improves goodput

• Requires no global knowledge or synchronization

• Ongoing: RADoN kernel implementation

Buffer management for I/O guarantees



• Improved I/O performance

• Challenging because:• Buffer is space-shared rather than time-shared

• Space limits time guarantees

• Best- and worst-case are opposite of disk

• Buffering affects performance in non-obvious ways

Disk

Guarantees in the buffer cache

• Role

• Improve performance

• Preserve & enhance guarantees

• App-specific guarantees:

• Hard at core

• Soft when possible

• Predictable

• Hard isolation

• Device time utilization

ResourceBroker

App 1

Disk Scheduler

Buffer-Cache

Disk

App 2 App n

App 1 App 2 App n

Dsk

Disk

Buffering roles in storage servers

• Staging and de-staging data• Decouples sender and receiver

• Speed matching• Allows slower and faster devices to communicate

• Traffic shaping• Shapes traffic to optimize performance of

interfacing devices

• Assumption: reuse primarily occurs at the client

Disk

Radium

• I/O into and out of buffer have rates and time granularities (periods)

• Period transformation: period into cache may be shorter than from cache to disk

• Rate transformation: rate into cache may be higher than disk can support

• Partition cache based on I/O characteristics and performance requirements

• Cache policies enhance performance within constraints determined by I/O requirements• Use slack to prefetch reads and delay

writes

Disk

App 1

Disk Scheduler

Buffer-Cache

Disk

App 2 App 3

Enhancing guarantees in the buffer cache

• Reclaim unused resources (e.g., unused overhead)• Use slack to prefetch reads and delay writes

• Allow more unguaranteed services

• Resource redistribution (buffer swapping) accommodates burstiness

• Period transformation: period into cache may be shorter than from cache to disk

• Rate transformation: rate into cache may be higher than disk can support

Disk

Managing a sequential workload

0

3

6

9

12

15

no-cachemonolithic

Thro

ughp

ut [t

hous

and

I/O p

er s

ec]

Sequential workloadexecuted in isolation

target performancereservation

0

3

6

9

12

15

no-cache monolithic Radium

Sequential workloadcombined with random workload

(with reservations)


fifonoop

deadlineanticipatory

cfq(50%/50%)quanta(50%/50%, 2 sec)

RAD(50%/50%, 2 sec)

Managing a random workload

0

3

6

9

12

15

no-cachemonolithic

Thro

ughp

ut [t

ens

of I/

O p

er s

ec]

Random workloadexecuted in isolation


0

3

6

9

12

15

no-cache monolithic Radium

Random workloadcombined with sequential workload

(with reservations)


fifonoop

deadlineanticipatory

cfq(50%/50%)quanta(50%/50%, 2 sec)

RAD(50%/50%, 2 sec)

Managing combined workloads

0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRADTh

roug

hput

[tho

usan

d I/O

per

sec

] no cache

target performance

Combined throughput of rand.(top) and seq.(bottom) workloads

0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRAD

Monolithic

target performance


0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRAD

Radium

target performance


Controlling throughput w/mixed workloads

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Target throughput [I/Os per sec]

Radium+RAD

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229


Radium+CFQ

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229


Radium+Anticipatory

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229


Radium+deadline

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Actu

al th

roug

hput

[I/O

s pe

r sec

]


Radium+FIFO

stream 1, 2 secstream 2, 2 secIdeal stream 1Ideal stream 2

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Actu

al th

roug

hput

[I/O

s pe

r sec

]


Radium+NOOP

Consistent and predictable throughput for arbitrary reservations

Controlling latency w/mixed workloads

0

0.2

0.4

0.6

0.8

1

0 250 500 750

Cum

ulat

ive d

istrib

utio

n[%

]

Latency [ms]

Radium+CFQ

Upper boundsUpper bounds

0

0.2

0.4

0.6

0.8

1

0 250 500 750Latency [ms]

Radium+RAD

Upper boundsUpper bounds

period 1 sec 750 ms 500 ms 250 ms

Precise control over the service times of each stream

Results w/complex workloads

0

20

40

60

80

100

0 15 30 45 60 75 90

Avg.

thro

ughp

ut [I

Os

per s

ec]

Time [sec]

Radium+CFQ

0

20

40

60

80

100

0 15 30 45 60 75 90Time [sec]

Radium+RADSoft stream 1, period 500 msSoft stream 2, period 500 ms

Hard stream 3, period 500 msGreedy stream 4, period 1 sec Greedy stream 5, period 1 sec

Reasonable control with complex workloads

Data center performance management

• Big distributed systems• Serve many users/jobs

• Process petabytes of data

• Data center design• Use rules of thumb

• Over-provision

• Isolate

• Ad hoc performance management creates marginal storage systems that cost more than necessary

• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network

Data center performance mgmt. goals

1. A first-principles model for data center perf. mgmt.

2. Full-system performance metrics for client processing nodes, buffer cache, network, server buffer cache, and disk

3. Performance visualization by application, client node, reservation, or device

4. Application workload profiling and modeling

5. Full system performance provisioning and management based on all of the above

6. Online machine-learning based performance monitoring for real-time diagnostics

RADIX

• $1 million from UC Lab Fee program

• Based on schedulers and workload-independent utilization metrics from our E2E QoS research

• Plan1.Performance model and metrics

2.Tools for profiling, prediction, and planning

3.Operating systems components

4.Performance monitors and visualization tools

• Case study: LANL data centers

Conclusion

• Distributed I/O performance management requires management of many separate components

• An integrated approach is needed

• RAD provides the basis for a solution

• It has been successfully applied to several resources: CPU, disk, network, and buffer cache

• We are on our way to an integrated solution

• There are many useful applications: Data center performance management, full storage virtualization, ...

RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler

Documents