Balancing Throughput and Latency to Improve Real-Time I/O Service in Commodity Systems

1

Balancing Throughput and Latency to Improve Real-Time I/O Service in Commodity Systems

Mark Stanovich

2

Outline

• Motivation and Problem• Approach• Research Directions

1) Multiple worst-case service times2) Preemption coalescing

• Conclusion

3

Overview

• Real-time I/O support using– Commercial-of-the-shelf (COTS)

devices– General purpose operating systems

(OS)• Benefits

– Cost effective– Shorter time-to-market

• Prebuilt components• Developer familiarity

– Compatibility

4

Example:Video Surveillance System

– Receive video– Intrusion detection– Recording– Playback

Loca

l net

wor

k

Internet

CPU

Network

Changes to make the

system work?

How do we know the

system works?

5

Problem with Current I/O in Commodity Systems

• Commodity system relies on heuristics– One size fits all– Not amenable to RT techniques

• RT too conservative– Considers a missed deadline as catastrophic– Assumes a single worst case

• RT theoretical algorithms ignore practical considerations– Time on a device service provided– Effects of implementation

• Overheads• Restrictions

6

Approach

• Balancing throughput and latency• Variability in provided service

– More distant deadlines allow for higher throughput

– Tight deadlines require low latency• Trade-off

– Latency and throughput are not independent

– Maximize throughput while keeping latency low enough to meet deadlineshttp://www.wikihow.com/Race-Your-Car

7

Latency and Throughput

time

arrivals

Smaller

Scheduling Windows

Larger

8

Observation #1:WCST(1) * N > WCST(N)

• Sharing cost of I/O overheads

• I/O service overhead examples– Positioning hard disk head– Erasures required when

writing to flash• Less overhead higher

throughput

9

Device Service Profile Too Pessimistic

• Service rate workload dependent– Sequential vs. random– Fragmented vs. bulk

• Variable levels of achievable service by issuing multiple requests

min access size

seek time rotational latency

Worst-case:

Average movie:

10

Overloaded?

25 500

RT1

15

25 500 15

RT2

+

time25 500

RT1+RT2

75

75

75

11

Increased System Performance

25 500

RT1

time25 500

15

25 500 15

RT2

RT1+RT2

+

Small Variations Complicate Analysis

time25 500

RT1+RT2

RT1

RT2

arrivals

deadlines

25 500 15

5

13

Current Research

• Scheduling algorithm to balance latency and throughput– Sharing the cost of I/O overheads– RT and NRT

• Analyzing amortization effect– How much improvement?– Guarantee

• Maximum lateness• Number of missed deadlines

• Effects considering sporadic tasks

14

Observation #2:Preemption, a double-edged sword

• Reduces latency– Arrival of work can begin immediately

• Reduces throughput– Consumes time without providing service– Examples

• Context switches• Cache/TLB misses

• Tradeoff– Too often reduces throughput– Not often enough increases latency

15

Preemption

time

deadline

arrivals

16

Cost of Preemption

CPU time for a job

17

Cost of Preemption

Context switch time

CPU time for a job

18

Cost of Preemption

Context switch time

Cache misses

CPU time for a job

19

Current Research:How much preemption?

Network packet arrivals

time

20



time

21



time

22

Current Research:Coalescing

• Without breaking RT analysis• Balancing overhead of preemptions and requests

serviced• Interrupts

– Good: services immediately– Bad: can be costly if occurs too often

• Polling– Good: batches work– Bad: may unnecessarily delay service

Average Response Time


Can we get the best of both?

Sporadic Sever– Light Load– Low response time

Polling Sever– Heavy Load– Low response time– No dropped pkts


27

Conclusion

• Implementation effects force a tradeoff between throughput and latency

• Existing RT I/O support is artificially limited– One size fits all approach– Assumes a single worst-case

• Balancing throughput and latency uncovers a broader range of RT I/O capabilities

• Several promising directions to explore

28

Extra Slides

29

Latency and Throughput• Timeliness depends on min throughput and max latency• Tight timing constraints

– Smaller number requests to consider– Fewer possible service orders– Low latency, Low throughput

• Relaxed timing constraints– Larger number of requests– Larger number of possible service orders– High throughput, high latency

lengthen latency

increase throughput

time interval

resource(service provided)

30

System Resources

Observation #3:RT Interference on Non-RT

• Non-real time != not important

• Isolating RT from NRT is important

• RT can impact NRT throughput

RT Anti-virusBackup

Maintenance

Anti-virusBackup

Maintenance

31

Current Research:Improving Throughput of NRT

• Pre-allocation– NRT applications as a single RT entity

• Group multiple NRT requests– Apply throughput techniques to NRT

• Interleave NRT requests with RT requests• Mechanism to split RT resource allocation

– POSIX sporadic server (high, low priority)– Specify low priority to be any priority including NRT

32

Research

• Description– One real-time application– Multiple non-real time

applications• Limit NRT interference• Provide good throughput

for non-real-time• Treat hard disk as black box

Real timeNon-real time

OS scheduler

Amortization Reducing Expected Completion Time

Higher throughput(More jobs serviced)

Lower throughput(Fewer jobs serviced)

(Queue size increases)

(Queue size decreases)

34

Livelock

• All CPU time spent dealing with interrupts• System not performing useful work• First interrupt is useful

– Until packet(s) for interrupt are processed, further interrupts provide no benefit

– Disable interrupts until no more packets (work) available

• Provided notification needed for scheduling decisions

35

Other Approaches

• Only account for time on device [Kaldewey 2008]

• Group based on deadlines [ScanEDF , G-EDF]• Require device-internal knowledge

– [Cheng 1996]– [Reuther 2003]– [Bosch 1999] vs.

Andy Wang

references are dateduse a table?

36

“Amortized” Cost of I/O Operations

• WCST(n) << n * WCST(1)• Cost of some ops can be

shared amongst requests– Hard disk seek time– Parallel access to flash

packages• Improved minimum

available resource

WCST(5)

5 * WCST(1)

time

37

Amount of CPU Time?

Sends ping traffic to B Receive and respond to packets from A

A B

deadlinearrival interrupt

deadline

38

Measured Worst-Case Load

39

Some Preliminary Numbers

• Experiment– Send n random read requests

simultaneously– Measure longest time to

complete n requests• Amortized cost per request

should decrease for larger values of n– Amortization of seek operation

Hard Disk

n random requests

40

0

10

20

30

40

50

60

70

80

90

0 5 10 15 20 25 30

requ

ests

/sec

number of requests

50 Kbyte Requests

41

0

50

100

150

200

250

300

350

400

0 5 10 15 20 25 30

Wor

st-C

ase

Serv

ice T

ime

number of requests

50 Kbyte Requests

42

Observation #1:I/O Service Requires CPU Time

• Examples– Device drivers– Network protocol processing– Filesystem

• RT analysis must consider OS CPU time

Apps

Device (e.g., Network adapter,

HDD)

OS

43

Example System

• Web services– Multimedia– Website

• Video surveillance– Receive video– Intrusion detection– Recording– Playback

Loca

l net

wor

k

Internet

NetworkAll-in-one

server

CPU

44

Example

time

App

arrivaldeadline

45

Example: Network Receive

time

deadline

App

arrival interrupt

App

OS

OS

deadline

46

OS CPU Time

• Interrupt mechanism outside control of OS• Make interrupts schedulable threads

[Kleiman1995]– Implemented by RT Linux

47

Example: Network Receive

time

deadline

AppOS

arrival interrupt

App

OS

48

Other Approaches• Mechanism

– Enable/disable interrupts– Hardware mechanism (e.g., Motorola 68xxx)– Schedulable thread [Kleiman1995]– Aperiodic servers (e.g., sporadic server [Sprunt 1991])

• Policies– Highest priority with budget [Facchinetti 2005]– Limit number of interrupts [Regehr 2005]– Priority inheritance [Zhang 2006]– Switch between interrupts and schedulable thread [Mogul

1997]

Andy Wang

use a tablesee if there are more recent reference

49

Problems Still Exist• Analysis?• Requires known maximum on the amount of priority inversion

– What is the maximum amount?• Is enforcement of the maximum amount needed?

– How much CPU time?– Limit using POSIX defined aperiodic server

• Is an aperiodic server sufficient?• Practical considerations?

– Overhead– Imprecise control

• Can we back-charge an application?– No priority inversion charge to application– Priority inversion charge to separate entity

50

Concrete Research Tasks

• CPU– I/O workload characterization [RTAS 2007]– Tunable demand [RTAS 2010, RTLWS 2011]– Effect of reducing availability on I/O service

• Device– Improved schedulability due to amortization [RTAS 2008]– Analysis for multiple RT tasks

• End-to-end I/O guarantees– Fit into analyzable framework [RTAS 2007]– Guarantees including both CPU and device components

51

Feature Comparison

Linux

RT

pree

mpt

Inte

rrup

t Acc

t

GEDF

SCAN

-EDF

RT C

alcu

lus

Mod

elin

g DD

Fitti

ng Li

nux

DD

Thro

ttlin

g HD

D

POSI

X SS

Linux

SS

OPS

CHED

CPUMethods to fit into AnalysisConfigurable Bound on InterferenceTime Accounting/CorrelationEffect on I/O ServiceMultiple Min Service Profiles

DeviceMultiple Min Service ProfilesImproved SchedulabilityFit into AnalysisBounded InterferenceWorks with Black Box

End-to-endMethods to fit into Analysis

Improved Average-case Performance

52

New Approach• Better Model

– Include OS CPU consumption into analysis– Enhance OS mechanisms to allow better

system design• Models built on empirical observations

– Timing information unavailable– Static analysis not practical and too

pessimistic• Resources operate at a variety of service

rates– Tighter deadlines == lower throughput– Longer deadlines == higher throughput

53

Example:Rate-Latency Curve Convolution

=Latency1 Latency2

Latency1 + Latency2

rate1 rate2

rate1

⊗

Apps

CPU

54

A Useful Tool: Real-Time Calculus

• Based on network calculus, derived from queueing theory– Provides an analytical framework to compose system

• More precise analysis (bounds) especially for end-to-end analysis

• Can be used with existing models (e.g., periodic)• Provides a very general representation for

modeling systems

Apps

CPU

55

End-to-End Analysis

• I/O service time includes multiple components• Analysis must consider all components

– Worst-case delay for each?– Is this bound tight?

• Framework to “compose” individual resources

Tx RxDevicerequest response

Apps

CPU

56

Real-Time Calculus

Δ

α

β (min service curve)

(max arrival curve)

Maximum horizontal distanceis the worst-case response time

Apps

CPU

57

Real-Time Calculus [Thiele 2000]

𝛼1 𝛼1 ′

𝛽 1′

𝛽 1

2 𝛼2 ′

𝛽 1′ ′

2

workload(arrivals)

resources

Apps

CPU

58

Apps Tx RxDevice

Composing RT I/O ServiceApps

CPU

59

Constraint on Output Arrival

• Deconvolution

– Envelope arrival curve• γ – maximum service curve• β – min service curve• - input• - output

60

Timing Bounds

measuredpossible

analyticalupper-bound

freq

uenc

y

responsetime

actualupper-bound

observable upper-bound

empirical upper-bound

0

61

Job

arrival/release time

completion time

lateness(tardiness)

start timeabsolute deadline

response time

relative deadline

62

Task

Worst-case ExecutionTime (WCET)

Inter-arrival Time

deadline

time

63

Theoretical Analysis

• Non-preemptive job scheduling reduces to bin packing (NP-hard)

time

64


• Resource availability in the time interval [s,t) is C[s,t)

time

𝑠 𝑡

65


𝛼1 𝛼1 ′

𝛽 1′

𝛽 1

𝛽 ′ 𝑙 (𝑡 )=𝑠𝑢𝑝0≤ λ ≤𝑡 {𝛽 𝑙 ( λ )−𝛼𝑢 ( λ ) }𝛽 ′𝑢 (𝑡 )=𝑖𝑛𝑓 λ ≥0 {𝛽𝑢 ( λ )−𝛼 𝑙 ( λ ) }𝛼 ′ 𝑙=𝑚𝑖𝑛 {(𝛼𝑙⊘ 𝛽𝑢)⊗ 𝛽𝑙 , 𝛽 𝑙}𝛼 ′𝑢=𝑚𝑖𝑛 {(𝛼𝑢⊗ 𝛽𝑢)⊘ 𝛽 𝑙 , 𝛽𝑢}

( 𝑓 ⊗𝑔 ) (𝑡 )=𝑖𝑛𝑓 0≤ λ ≤𝑡 { 𝑓 (𝑡− λ )+𝑔 (λ) }( 𝑓 ⊘ 𝑔 ) (𝑡 )=𝑠𝑢𝑝 λ ≥𝑡 {𝑓 (𝑡+ λ )−𝑔 (λ) }

2 𝛼2 ′

𝛽 1′ ′

2

66

Real-Time Calculus

Δ

α

β (service curve)

(arrival curve)

Maximum horizontal distanceis the worst-case response time

Maximum vertical distanceis maximum queue length

𝑑𝑚𝑎𝑥≤𝑠𝑢𝑝 λ ≥0 {𝑖𝑛𝑓 {𝜏 ≥0 :𝛼𝑢( λ)≤ 𝛽𝑙( λ+𝜏 )}}

𝑏𝑢𝑓 𝑚𝑎𝑥 ≤𝑠𝑢𝑝 λ≥ 0 {𝛼𝑢( λ)≤ 𝛽𝑙( λ+𝜏 )}

67

Network Calculus

𝑥 (𝑡) 𝑦 (𝑡)𝜎 (𝑡 )

𝑥 (𝑡 ) 𝑦 (𝑡)𝜎 1(𝑡) 𝜎 2 (𝑡)

𝜎 𝑐=𝜎 1⊗𝜎 2

𝑦 (𝑡 )=(𝜎⊗𝑥 ) (𝑡 )=𝑖𝑛𝑓 0≤ λ ≤𝑡 {𝜎 (𝑡− λ )+𝑥 (λ)}

68

AppsApps

CPUCPU

Apps

CPU

69

Real-Time Background• Explicit timing constraints

– Finish computation before a deadline– Retrieve sensor reading every 5 msecs– Display image every 1/30th of a second

• Schedule (online) access to resources to meet timing constraints

• Schedulability analysis (offline)– Abstract models

• Workloads• Resources

– Scheduling algorithm

Appn

App2

App1

70

Current Research:Analyzing CPU Time for IOs

• Applications demand CPU time• Measure the interference• Ratio of max demand to interval length defines load• Schedulability (fixed-task priority)

• Characterize I/O CPU time in terms of a load function

Task underconsideration

Interference fromhigher priority tasks

71

How to measure load

• I/O CPU component at high priority• Measurement task at low priority

time

72

Measured Worst-Case Load

73

Analyzing

𝑒𝑘𝑑𝑘+∑

𝑖=1

𝑘−1

𝑙𝑜𝑎𝑑𝜏 𝑖𝑚𝑎𝑥 (𝑑𝑘 )≤1

Task underconsideration Interference from

higher priority tasks

τ1 is a periodic task (WCET =2, Period = 10)

74

Bounding

75

Adjusting the Interference

• May have missed worst-case• CPU time consumed too high• Aperiodic servers

– Force workload into a specific workload model

– Example: Sporadic server

76

Future Research

• Combine bounding and accounting– Accounting

• Charge user of services• Cannot always charge correct account

– Bound• Set aside separate account• If exhausted disable I/O until account is replenished

77

Future Research:Practicality of Aperiodic Servers

• Practical considerations– Is the implementation correct?– Overhead

• Context switches• Latency vs Throughput

Real timeNon-real time

OS scheduler

Past Research:Throttling

79

“Amortized” Cost of I/O Operations

• WCST(n) << n * WCST(1)• Cost of some ops can be

shared amongst requests– Hard disk seek time– Parallel access to flash

packages• Improved minimum

available resource

Seek Time Amortization



83

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Amor

tized

Serv

ice T

ime

number of simultaneous requests

50 Kbyte Requests

84

Example System

• Web services– Multimedia– Website

• Video surveillance– Receive video– Intrusion detection– Recording– Playback

Loca

l net

wor

k

Internet

CPU

NetworkAll-in-one

server

How do we make the

system work?

Balancing Throughput and Latency to Improve Real-Time I/O Service in Commodity Systems

Documents

realtime io service

system work

service order

current io

rt techniquesrt

time work

independentmaximize

actual system