Top Banner
RADIO: Managing the Performance of Large, Distributed Storage Systems Scott A. Brandt and Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey Computer Science Department University of California Santa Cruz and Richard Golding and Ted Wong, IBM Almaden Research Center UPC—July 7, 2009
66

RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Jan 21, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

RADIO: Managing the Performance of Large, Distributed Storage Systems

Scott A. Brandtand

Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey

Computer Science DepartmentUniversity of California Santa Cruz

andRichard Golding and Ted Wong, IBM Almaden Research Center

UPC—July 7, 2009

Page 2: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Who am I?

• Professor, Computer Science Department, UC Santa Cruz

• Director, UCSC/LANL Institute for Scalable Scientific Data Management (ISSDM)

• Director, UCSC Systems Research Laboratory (SRL)

• Background

• 1999 Ph.D. CS, Colorado

• 1987/1993 B. Math/MS CS, Minnesota

• 1982-1994 Programmer/Research Scientist/VP, CPT, B-Tree, Honeywell SRC, Theseus Research, Alliant TechSystems RTS, Secure Computing

• My Research

• High-performance petascale storage

• Real-time systems

• Performance management and virtualization

• Active object-based storage

• Other Research

• Secure operating systems

• Asynchronous circuits

• Real-time image processing

Page 3: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Distributed systems need performance guarantees

• Many distributed systems and applications need (or want) I/O performance guarantees• Multimedia, high-performance simulation, transaction processing,

virtual machines, service level agreements, real-time data capture, sensor networks, ...

• Systems tasks like backup and recovery

• Even so-called best-effort applications

• Providing such guarantees is difficult because it involves:• Multiple interacting resources

• Dynamic workloads

• Interference among workloads

• Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth

Page 4: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

In a nutshell

• Big distributed systems• Serve many users/jobs

• Process petabytes of data

• Data center design• Use rules of thumb

• Over-provision

• Isolate

• Ad hoc per formance management approaches creates marginal storage systems that cost more than necessary

• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network

Page 5: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Outline

1. Problem: Managing the performance of large, distributed storage systems

2. Approach: End-to-end performance management

3. Model: RAD

4. Instances: • Disk

• Network

• Buffer cache

5. Application: Data Center Performance Management and Monitoring

Page 6: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

End-to-end I/O performance guarantees

• Goal: Improve end-to-end performance management in large distributed systems• Manage performance

• Isolate traffic

• Provide high performance

• Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ...

• Approach:1. Develop a uniform model for managing performance

2. Apply it to each resource

3. Integrate the solutions

Page 7: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Our current target

• High-performance I/O• From client, across network, through server, to disk

• Up to hundreds of thousands of processing nodes

• Up to tens of thousands of I/O nodes

• Big, fat, network interconnect

• Up to thousands of storage nodes with cache and disk

• Challenges• Interference between I/O streams, variability of

workloads, variety of resources, variety of applications, legacy code, system management tasks, scale

Page 8: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Stages in the I/O path

client

cache

network

transport

diskstorage

cache

network

transport

flow control with one client

connection management between clients

IO selection and head scheduling

prefetch and writeback based on utilization, QoS

app

app

I/O

scheduler

client

cache

network

transport

app

app

integration between client and server cache

1. Disk I/O

2. Server cache

3. Flow control across network

• Within one client’s session and between clients

4. Client cache

Page 9: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

System architecture

Client

StorageServer

QoSBroker

StorageServer

StorageServer

StorageServer

Request

Reservation

Utilizationreservations

1

2

3

4

NetworkServer Caches

Disks• Client: Task, host, distributed

application, VM, file, ...

• Reservations made via broker

• Specify workload: throughput, read/write ratio, burstiness, etc.

• Broker does admission control

• Requirements + workload are translated to utilization

• Utilizations are summed to see if they are feasible

• Once admitted, I/O streams are guaranteed (subject to workload adherence)

• Disk, caches, network controllers maintain guarantees

I/O

Page 10: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Achieving robust guaranteeable resources

• Goal: Unified resource management algorithms capable of providing• Good performance

• Arbitrarily hard or soft performance guarantees with• Arbitrary resource allocations

• Arbitrary timing / granularity

• Complete isolation between workloads

• All resources: CPU, disk, network, server cache, client cache

➡Virtual resources indistinguishable from “real” resources with fractional performance

Page 11: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Isolation is key

• CPU• 20% of a 3 Ghz CPU should be indistinguishable

from a 600 Mhz CPU

• Running: compiler, editor, audio, video

• Disk• 20% of a disk with 100 MB/second bandwidth

should be indistinguishable from a disk with 20 MB/second bandwidth

• Serving:1 stream, n streams, sequential, random

Page 12: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Scott’s epistemology of virtualization

• Virtual Machines and LUNs provide good HW virtualization

• Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource?

• Answer: By not getting its share of the resource when it needs it

Page 13: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Observation

• Resource management consists of two distinct decisions• Resource Allocation: How much resources to

allocate?• Dispatching: When to provide the allocated

resources?

• Most resource managers conflate them• Best-effort, proportional-share, real-time

Page 14: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Separating them is powerful!

• Separately managing resource allocation and dispatching gives direct control over the delivery of resources to tasks

• Enables direct, integrated support of all types of timeliness needs

Res

ourc

e A

lloca

tion

MissedDeadline

SRT

Dispatchingunconstrained

unco

nstra

ined

cons

train

ed

ResourceAllocation

SRTSoftReal-Time

BestEffort

CPU-Bound

I/O-Bound

HardReal-Time

Rate-Based

constrained

Page 15: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

The resource allocation/dispatching (RAD) scheduling model

Rate

Deadlines

DispatcherSeries of jobs w/

budgets and deadlines

Share of resources

Times at which allocation must equal share

Process

Page 16: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Supporting different timeliness requirements with RAD

HardReal-time

Rate-based

Best-effort

Soft Real-time

Rate

Deadlines

Dispatcher

SchedulingMechanism

RuntimeSystem

RateBounds

PeriodWCET

PeriodACET

Priority

PiPiPiPi

Set of jobs w/

budgets and deadlines

SchedulingPolicy

Page 17: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Rate-Based Earliest Deadline (RBED) CPU scheduler

Rate

Deadlines

EDF + timers

SchedulingPolicy

SchedulingMechanism

RuntimeSystem

Set of jobs w/

budgets and deadlines

• Processes have rate & period• ∑rates ≤ 100%

• Periods based on processing characteristics, latency needs, etc.

• Jobs have budget & deadline• budget = rate * period

• Deadlines based on period or other characteristics

• Jobs dispatched via Earliest Deadline First (EDF)• Budgets enforced with timers

• Guarantees all budgets & deadlines = all rates & periods

Page 18: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Adapting RAD to disk, network, and buffer cache

• Fahrrad—Guaranteed disk request schedulingAnna Povzner (UCSC)

• RADoN—Guaranteeing storage network performanceAndrew Shewmaker (UCSC and LANL)

• Radium—Buffer management for I/O guaranteesRoberto Pineiro (UCSC)

Disk

Page 19: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Guaranteed disk request scheduling

• Goals• Hard and soft performance guarantees

• Isolation between I/O streams

• Good I/O performance

• Challenging because disk I/O is:• Stateful

• Non-deterministic

• Non-preemptable, and

• Best- and worst-case times vary by 3–4 orders of magnitude

Page 20: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Fahrrad

• Manages disk time instead of disk throughput

• Adapts RAD/RBED to disk I/O

• Reorders aggressively to provide good performance, without violating guarantees

A B C BE

Disk

I/O streams

Fahrrad

Page 21: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

A bit more detail

• Reservations in terms of disk time utilization and period (granularity)

• All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS)

• I/Os in the DSS are issued in the most efficient way

• I/O charging model is critical

• Overhead reservation ensures exact utilization• 2 WCRTs per period for “context switches”

• 1 WCRT per period to ensure last I/Os

• 1 WCRT for the process with the shortest period due to non-preemptability

Page 22: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Fahrrad outperforms Linux

• Workload• Media 1: 400 sequential I/Os per second (20%)

• Media 2: 800 sequential I/Os per second, (40%)

• Transaction: short bursts of random I/Os at random times (30%)

• Background: random (10%)

• Result: Better isolation AND better throughput

Linux Linux w/Fahrrad

Page 23: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

New work: virtual disks

• Provide workload-independent performance guarantees

• Isolate from other workloads concurrently accessing the device

• LUNs virtualize storage capacity

• Fahrrad virtualizes storage performance

Page 24: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Fahrrad virtual disks

• Implemented with the Fahrrad real-time I/O scheduler

• Guarantee reserved and isolated share of the time on storage device• Hard guarantees on performance isolation

• Virtual disk throughput same as equivalent standalone throughput

• Amount of data transferred:• ∀i, Di(x%, t) = Di(100%, x%·t)

Share ofdisk

Time Share ofdisk

Time

Page 25: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Guaranteeing performance isolation

• Virtual disk reservation: disk share (utilization) and time granularity (period) • Account for all extra (inter-stream) seeks

• Reserve overhead utilization to do them

• Charge each I/O stream for all of the time it uses, including inter- and intra-stream seeks

• Reservation = Disk Share + Overhead utilization

25%, 1 sec

30%, 250 ms

19%, 1 min

Disk time

Page 26: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Extra seeks

• Intra-stream seeks caused by workload

• Inter-stream seeks caused by reservations• Low time granularity causes more frequent seeks

• At most two extra seeks per stream per period• To and away from the stream to meet deadlines

➡Extra seeks caused by un-queued requests

Page 27: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Charging model and reservations

• Charge streams responsible for inter-stream seeking• From overhead: for seeks caused by reservations

• From reservation: for seeks caused by bursty behavior

• Overhead utilization needed for hard guarantees

• Overhead utilization = WCRT/p + 2*WCRT/p + WCRT/p

• We can trade-off hard guarantees for lower overhead by assuming less than worst-case request time

Guaranteereservedutilization

Account for inter-stream

seeks

Maintain 2outstanding

requests

Page 28: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

• Throughput is determined by reservation and workload

0

100

200

300

400

500

0.01 0.1 1 10 100 1000

Am

ou

nt of

data

tra

nsf

ere

d [

MB

]

Run length of semi-sequential stream [MB]

Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)

Random, virtual disk (20% share, 150s)

Each virtual disk reserves 20% with 1 second

granularity

Performance: guaranteeing throughput

Page 29: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

• Throughput is determined by reservation and workload

Each virtual disk reserves 20% with 1 second

granularity

0

100

200

300

400

500

0.01 0.1 1 10 100 1000

Am

ou

nt of

data

tra

nsf

ere

d [

MB

]

Run length of semi-sequential stream [MB]

Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)

Random, virtual disk (20% share, 150s)Sequential, standalone (100%, 30s)

Semi-sequential, standalone (100%, 30s)Random, standalone (100%, 30s)

Performance: guaranteeing throughput

Page 30: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Performance: Controlling throughput

• Each virtual disk is isolated from the other

• Performance is fully determined by the reservation and workload

Reserved share for sequential stream Reserved share for sequential stream

Dat

a tr

ansf

erre

d (M

B)

Dis

k tim

e re

serv

atio

n (%

)

Page 31: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Performance: Controlling latency

• Reservation granularity bounds latency: • period = latency/2

• Virtual device serves periodic semi-sequential stream and shares storage with random background stream. Four experiments for different period reservations.

Upper bounds

Frac

tion

of I/

Os

Latency Period of virtual disk

Util

izat

ion

Page 32: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Performance: Isolation guarantees

• Hard guarantees require high overhead (proportional to reservation granularity)

• Three virtual disks each serving one sequential stream with many outstanding I/Os share a storage system with a random background stream.

Dis

k tim

e re

serv

atio

n (%

)

Period of virtual disk 3Period of virtual disk 3

Dat

a tr

ansf

erre

d (M

B)

Page 33: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Performance: Soft guarantees w/isolation

• Overhead based on less than worst-case I/O time

• Increased short term throughput variation• Virtual disk (10%, 1 sec) runs one sequential stream with 400 IO/sec arrival rate

and shares the system with 5 virtual disks each running one random stream.

Dis

k tim

e re

serv

atio

n (%

)

Percentile of observed service timesPercentile of observed service times

Dat

a tr

ansf

erre

d (%

)

Page 34: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Performance: Soft guarantees w/isolation

• Linux fails to support Cello99 (variation up to 30% from standalone)

• Fahrrad Virtual Disks provide Cello99 and OpenMail performance close to standalone

• Cello99 and OpenMail virtual disks share the system with random background stream.•

Thr

ough

put

(I/O

s pe

r se

cond

)

TimeTime

Linux Fahrrad Virtual Disks

Page 35: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Fahrrad Virtual Disks

1. Guarantee throughput by accounting for overhead and guaranteeing utilization

2. Guarantee isolation between workloads by accurately accounting for all disk time

3. Provide high throughput (w/guarantees) by minimizing interference between workloads

4. Result: performance of virtual disk depends only on reservation, workload, and performance of device

Page 36: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Guaranteeing storage network performance

• Goals• Hard and soft performance guarantees

• Isolation between I/O streams

• Good I/O performance

• Challenging because network I/O is:• Distributed

• Non-deterministic (due to collisions or switch queue overflows)

• Non-preemptable

• Assumption: closed network

Page 37: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

What we want

Client

Client

Client

Server

Server

Server

30%

50%

20%

Page 38: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

What we have

• Switched fat tree w/full bisection bandwidth

• Issue 1: Capacity of shared links

• Issue 2: Switch queue contention

Page 39: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Congestion in a simple switch model

• Each transmit port on the switch is a collision domain

tx/rx ports

shared

FIFO

switch fabric

1

2

3

4

5

6

7

8

Page 40: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Congestion in a simple switch model

• One of the packets arriving at the same switch transmit port is delayed on the queue

switch fabric

1 and 2congest1

2

3

4

5

6

7

8

1 and 2send to 5

Page 41: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Congestion in a simple switch model

• Delayed packets from unrelated streams affect each other on the queue

switch fabric

1 and 2congest1

2

3

43 and 4congest

2 and 4congest

5

6

7

8

1 and 2send to 5

3 and 4send to 8

Page 42: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

TCP

• Those who do not understand TCP are destined to reimplement it

• Jon Postel

• Ack-clocked flow control

• Packet loss based congestion control

• Sawtooth throughput

• Incast throughput collapse

Page 43: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Network resource usage measurements

• Round trip time RTTi = Ci - Si

• Combines queueing effects on forward and reverse path + response time

Clock 1

Si

Ci

Clock 2

Page 44: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Network resource usage measurements

• One-way delay OWDi = Ri - Si

• Isolates queueing affects on forward path, but

• Requires synchronized clocks

Clock 1

Si

Ri

Clock 2

Page 45: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Network resource usage measurements

• Relative forward delay RFDi,j = (Rj - Ri) - (Sj - Si)

• Isolates queueing affects on forward path, and

• Does not require synchronized clocks• But they must be relatively stable

Clock 1S

i

Ri

Rj

Sj

Clock 2

Page 46: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

RADoN

• A reservation has a network share (utilization) and a time granularity (period)

• Two real-time scheduling algorithms• Earliest Deadline First (EDF) - absolute deadlines

• Least Laxity First (LLF) - relative laxities

now deadlinelaxity

release

Page 47: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Approximating optimal scheduling

• Flow control - throttling senders• Execution time (per period) e = utilization /

period

• Budget in packets m = e / packets_per_second

• Congestion control - avoiding switch contention (adjust wait time between packets)• Percent budget %budget = (1 - %laxity) = e/(d-t)

• Packet wait time w = wmin / %budget

• Size change w∆ = -|wi - wmin|/2

• New wait time wi+1 = min(wmax, max(wmin, w∆))

Page 48: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Queue modeling: single network stream

• No contention: 765 Mbps w/no lost packets

0

5

10

15

20

25

30

35

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)

Packet Sequence Number

basic modelmedian-filter model

pathload model

Page 49: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Queue modeling: punctuated stream

• Contention: 5 bursts of 250 Mbps

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)

Packet Sequence Number

basic modelmedian-filter model

pathload modellost packets

Median filter detects congestion before packet loss and decreasing queue size after congestion

Page 50: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Queue modeling: punctuated adaptive stream

• Contention: 5 bursts of 250 Mbps

Adapting to median-filter model decreases packet loss

0

50

100

150

200

250

0 5000 10000 15000 20000 25000

Queue D

epth

(packets

)

Packet Sequence Number

basic modelmedian-filter model

pathload modellost packets

Page 51: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Userspace RADoN prototype

• Detects congestion using Relative Forward Delay

• Responds to congestion using RAD real-time theory

• Decreases packet loss significantly

• Improves goodput

• Requires no global knowledge or synchronization

• Ongoing: RADoN kernel implementation

Page 52: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Buffer management for I/O guarantees

• Goals• Hard and soft performance guarantees

• Isolation between I/O streams

• Improved I/O performance

• Challenging because:• Buffer is space-shared rather than time-shared

• Space limits time guarantees

• Best- and worst-case are opposite of disk

• Buffering affects performance in non-obvious ways

Disk

Page 53: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Guarantees in the buffer cache

• Role

• Improve performance

• Preserve & enhance guarantees

• App-specific guarantees:

• Hard at core

• Soft when possible

• Predictable

• Hard isolation

• Device time utilization

ResourceBroker

App 1

Disk Scheduler

Buffer-Cache

Disk

App 2 App n

App 1 App 2 App n

Dsk

Disk

Page 54: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Buffering roles in storage servers

• Staging and de-staging data• Decouples sender and receiver

• Speed matching• Allows slower and faster devices to communicate

• Traffic shaping• Shapes traffic to optimize performance of

interfacing devices

• Assumption: reuse primarily occurs at the client

Disk

Page 55: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Radium

• I/O into and out of buffer have rates and time granularities (periods)

• Period transformation: period into cache may be shorter than from cache to disk

• Rate transformation: rate into cache may be higher than disk can support

• Partition cache based on I/O characteristics and performance requirements

• Cache policies enhance performance within constraints determined by I/O requirements• Use slack to prefetch reads and delay

writes

Disk

App 1

Disk Scheduler

Buffer-Cache

Disk

App 2 App 3

Page 56: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Enhancing guarantees in the buffer cache

• Reclaim unused resources (e.g., unused overhead)• Use slack to prefetch reads and delay writes

• Allow more unguaranteed services

• Resource redistribution (buffer swapping) accommodates burstiness

• Period transformation: period into cache may be shorter than from cache to disk

• Rate transformation: rate into cache may be higher than disk can support

Disk

Page 57: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Managing a sequential workload

0

3

6

9

12

15

no-cachemonolithic

Thro

ughp

ut [t

hous

and

I/O p

er s

ec]

Sequential workloadexecuted in isolation

target performancereservation

0

3

6

9

12

15

no-cache monolithic Radium

Sequential workloadcombined with random workload

(with reservations)

target performancereservation

fifonoop

deadlineanticipatory

cfq(50%/50%)quanta(50%/50%, 2 sec)

RAD(50%/50%, 2 sec)

Page 58: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Managing a random workload

0

3

6

9

12

15

no-cachemonolithic

Thro

ughp

ut [t

ens

of I/

O p

er s

ec]

Random workloadexecuted in isolation

target performancereservation

0

3

6

9

12

15

no-cache monolithic Radium

Random workloadcombined with sequential workload

(with reservations)

target performancereservation

fifonoop

deadlineanticipatory

cfq(50%/50%)quanta(50%/50%, 2 sec)

RAD(50%/50%, 2 sec)

Page 59: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Managing combined workloads

0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRADTh

roug

hput

[tho

usan

d I/O

per

sec

] no cache

target performance

Combined throughput of rand.(top) and seq.(bottom) workloads

0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRAD

Monolithic

target performance

Combined throughput of rand.(top) and seq.(bottom) workloads

0

1

2

3

4

5

fifonoopdeadline

anticipatory

cfqquantaRAD

Radium

target performance

Combined throughput of rand.(top) and seq.(bottom) workloads

Page 60: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Controlling throughput w/mixed workloads

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Target throughput [I/Os per sec]

Radium+RAD

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Target throughput [I/Os per sec]

Radium+CFQ

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Target throughput [I/Os per sec]

Radium+Anticipatory

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Target throughput [I/Os per sec]

Radium+deadline

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Actu

al th

roug

hput

[I/O

s pe

r sec

]

Target throughput [I/Os per sec]

Radium+FIFO

stream 1, 2 secstream 2, 2 secIdeal stream 1Ideal stream 2

0 1000 2000 3000 4000 5000 6000

1500 2500 3500 4500

677 815 953 1091 1229

Actu

al th

roug

hput

[I/O

s pe

r sec

]

Target throughput [I/Os per sec]

Radium+NOOP

Consistent and predictable throughput for arbitrary reservations

Page 61: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Controlling latency w/mixed workloads

0

0.2

0.4

0.6

0.8

1

0 250 500 750

Cum

ulat

ive d

istrib

utio

n[%

]

Latency [ms]

Radium+CFQ

Upper boundsUpper bounds

0

0.2

0.4

0.6

0.8

1

0 250 500 750Latency [ms]

Radium+RAD

Upper boundsUpper bounds

period 1 sec 750 ms 500 ms 250 ms

Precise control over the service times of each stream

Page 62: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Results w/complex workloads

0

20

40

60

80

100

0 15 30 45 60 75 90

Avg.

thro

ughp

ut [I

Os

per s

ec]

Time [sec]

Radium+CFQ

0

20

40

60

80

100

0 15 30 45 60 75 90Time [sec]

Radium+RADSoft stream 1, period 500 msSoft stream 2, period 500 ms

Hard stream 3, period 500 msGreedy stream 4, period 1 sec Greedy stream 5, period 1 sec

Reasonable control with complex workloads

Page 63: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Data center performance management

• Big distributed systems• Serve many users/jobs

• Process petabytes of data

• Data center design• Use rules of thumb

• Over-provision

• Isolate

• Ad hoc performance management creates marginal storage systems that cost more than necessary

• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network

Page 64: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Data center performance mgmt. goals

1. A first-principles model for data center perf. mgmt.

2. Full-system performance metrics for client processing nodes, buffer cache, network, server buffer cache, and disk

3. Performance visualization by application, client node, reservation, or device

4. Application workload profiling and modeling

5. Full system performance provisioning and management based on all of the above

6. Online machine-learning based performance monitoring for real-time diagnostics

Page 65: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

RADIX

• $1 million from UC Lab Fee program

• Based on schedulers and workload-independent utilization metrics from our E2E QoS research

• Plan1.Performance model and metrics

2.Tools for profiling, prediction, and planning

3.Operating systems components

4.Performance monitors and visualization tools

• Case study: LANL data centers

Page 66: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07  · writeback based on utilization, QoS app app I/O scheduler

Conclusion

• Distributed I/O performance management requires management of many separate components

• An integrated approach is needed

• RAD provides the basis for a solution

• It has been successfully applied to several resources: CPU, disk, network, and buffer cache

• We are on our way to an integrated solution

• There are many useful applications: Data center performance management, full storage virtualization, ...