Balancing Throughput and Latency to Improve Real-Time I/O Service in Commodity Systems Mark Stanovich 1
Mar 23, 2016
1
Balancing Throughput and Latency to Improve Real-Time I/O Service in Commodity Systems
Mark Stanovich
2
Outline
• Motivation and Problem• Approach• Research Directions
1) Multiple worst-case service times2) Preemption coalescing
• Conclusion
3
Overview
• Real-time I/O support using– Commercial-of-the-shelf (COTS)
devices– General purpose operating systems
(OS)• Benefits
– Cost effective– Shorter time-to-market
• Prebuilt components• Developer familiarity
– Compatibility
4
Example:Video Surveillance System
– Receive video– Intrusion detection– Recording– Playback
Loca
l net
wor
k
Internet
CPU
Network
Changes to make the
system work?
How do we know the
system works?
5
Problem with Current I/O in Commodity Systems
• Commodity system relies on heuristics– One size fits all– Not amenable to RT techniques
• RT too conservative– Considers a missed deadline as catastrophic– Assumes a single worst case
• RT theoretical algorithms ignore practical considerations– Time on a device service provided– Effects of implementation
• Overheads• Restrictions
6
Approach
• Balancing throughput and latency• Variability in provided service
– More distant deadlines allow for higher throughput
– Tight deadlines require low latency• Trade-off
– Latency and throughput are not independent
– Maximize throughput while keeping latency low enough to meet deadlineshttp://www.wikihow.com/Race-Your-Car
7
Latency and Throughput
time
arrivals
Smaller
Scheduling Windows
Larger
8
Observation #1:WCST(1) * N > WCST(N)
• Sharing cost of I/O overheads
• I/O service overhead examples– Positioning hard disk head– Erasures required when
writing to flash• Less overhead higher
throughput
9
Device Service Profile Too Pessimistic
• Service rate workload dependent– Sequential vs. random– Fragmented vs. bulk
• Variable levels of achievable service by issuing multiple requests
min access size
seek time rotational latency
Worst-case:
Average movie:
10
Overloaded?
25 500
RT1
15
25 500 15
RT2
+
time25 500
RT1+RT2
75
75
75
11
Increased System Performance
25 500
RT1
time25 500
15
25 500 15
RT2
RT1+RT2
+
Small Variations Complicate Analysis
time25 500
RT1+RT2
RT1
RT2
arrivals
deadlines
25 500 15
5
13
Current Research
• Scheduling algorithm to balance latency and throughput– Sharing the cost of I/O overheads– RT and NRT
• Analyzing amortization effect– How much improvement?– Guarantee
• Maximum lateness• Number of missed deadlines
• Effects considering sporadic tasks
14
Observation #2:Preemption, a double-edged sword
• Reduces latency– Arrival of work can begin immediately
• Reduces throughput– Consumes time without providing service– Examples
• Context switches• Cache/TLB misses
• Tradeoff– Too often reduces throughput– Not often enough increases latency
15
Preemption
time
deadline
arrivals
16
Cost of Preemption
CPU time for a job
17
Cost of Preemption
Context switch time
CPU time for a job
18
Cost of Preemption
Context switch time
Cache misses
CPU time for a job
19
Current Research:How much preemption?
Network packet arrivals
time
20
Current Research:How much preemption?
Network packet arrivals
time
21
Current Research:How much preemption?
Network packet arrivals
time
22
Current Research:Coalescing
• Without breaking RT analysis• Balancing overhead of preemptions and requests
serviced• Interrupts
– Good: services immediately– Bad: can be costly if occurs too often
• Polling– Good: batches work– Bad: may unnecessarily delay service
Average Response Time
Average Response Time
Can we get the best of both?
Sporadic Sever– Light Load– Low response time
Polling Sever– Heavy Load– Low response time– No dropped pkts
Average Response Time
27
Conclusion
• Implementation effects force a tradeoff between throughput and latency
• Existing RT I/O support is artificially limited– One size fits all approach– Assumes a single worst-case
• Balancing throughput and latency uncovers a broader range of RT I/O capabilities
• Several promising directions to explore
28
Extra Slides
29
Latency and Throughput• Timeliness depends on min throughput and max latency• Tight timing constraints
– Smaller number requests to consider– Fewer possible service orders– Low latency, Low throughput
• Relaxed timing constraints– Larger number of requests– Larger number of possible service orders– High throughput, high latency
lengthen latency
increase throughput
time interval
resource(service provided)
30
System Resources
Observation #3:RT Interference on Non-RT
• Non-real time != not important
• Isolating RT from NRT is important
• RT can impact NRT throughput
RT Anti-virusBackup
Maintenance
Anti-virusBackup
Maintenance
31
Current Research:Improving Throughput of NRT
• Pre-allocation– NRT applications as a single RT entity
• Group multiple NRT requests– Apply throughput techniques to NRT
• Interleave NRT requests with RT requests• Mechanism to split RT resource allocation
– POSIX sporadic server (high, low priority)– Specify low priority to be any priority including NRT
32
Research
• Description– One real-time application– Multiple non-real time
applications• Limit NRT interference• Provide good throughput
for non-real-time• Treat hard disk as black box
Real timeNon-real time
OS scheduler
Amortization Reducing Expected Completion Time
Higher throughput(More jobs serviced)
Lower throughput(Fewer jobs serviced)
(Queue size increases)
(Queue size decreases)
34
Livelock
• All CPU time spent dealing with interrupts• System not performing useful work• First interrupt is useful
– Until packet(s) for interrupt are processed, further interrupts provide no benefit
– Disable interrupts until no more packets (work) available
• Provided notification needed for scheduling decisions
35
Other Approaches
• Only account for time on device [Kaldewey 2008]
• Group based on deadlines [ScanEDF , G-EDF]• Require device-internal knowledge
– [Cheng 1996]– [Reuther 2003]– [Bosch 1999] vs.
36
“Amortized” Cost of I/O Operations
• WCST(n) << n * WCST(1)• Cost of some ops can be
shared amongst requests– Hard disk seek time– Parallel access to flash
packages• Improved minimum
available resource
WCST(5)
5 * WCST(1)
time
37
Amount of CPU Time?
Sends ping traffic to B Receive and respond to packets from A
A B
deadlinearrival interrupt
deadline
38
Measured Worst-Case Load
39
Some Preliminary Numbers
• Experiment– Send n random read requests
simultaneously– Measure longest time to
complete n requests• Amortized cost per request
should decrease for larger values of n– Amortization of seek operation
Hard Disk
n random requests
40
0
10
20
30
40
50
60
70
80
90
0 5 10 15 20 25 30
requ
ests
/sec
number of requests
50 Kbyte Requests
41
0
50
100
150
200
250
300
350
400
0 5 10 15 20 25 30
Wor
st-C
ase
Serv
ice T
ime
number of requests
50 Kbyte Requests
42
Observation #1:I/O Service Requires CPU Time
• Examples– Device drivers– Network protocol processing– Filesystem
• RT analysis must consider OS CPU time
Apps
Device (e.g., Network adapter,
HDD)
OS
43
Example System
• Web services– Multimedia– Website
• Video surveillance– Receive video– Intrusion detection– Recording– Playback
Loca
l net
wor
k
Internet
NetworkAll-in-one
server
CPU
44
Example
time
App
arrivaldeadline
45
Example: Network Receive
time
deadline
App
arrival interrupt
App
OS
OS
deadline
46
OS CPU Time
• Interrupt mechanism outside control of OS• Make interrupts schedulable threads
[Kleiman1995]– Implemented by RT Linux
47
Example: Network Receive
time
deadline
AppOS
arrival interrupt
App
OS
48
Other Approaches• Mechanism
– Enable/disable interrupts– Hardware mechanism (e.g., Motorola 68xxx)– Schedulable thread [Kleiman1995]– Aperiodic servers (e.g., sporadic server [Sprunt 1991])
• Policies– Highest priority with budget [Facchinetti 2005]– Limit number of interrupts [Regehr 2005]– Priority inheritance [Zhang 2006]– Switch between interrupts and schedulable thread [Mogul
1997]
49
Problems Still Exist• Analysis?• Requires known maximum on the amount of priority inversion
– What is the maximum amount?• Is enforcement of the maximum amount needed?
– How much CPU time?– Limit using POSIX defined aperiodic server
• Is an aperiodic server sufficient?• Practical considerations?
– Overhead– Imprecise control
• Can we back-charge an application?– No priority inversion charge to application– Priority inversion charge to separate entity
50
Concrete Research Tasks
• CPU– I/O workload characterization [RTAS 2007]– Tunable demand [RTAS 2010, RTLWS 2011]– Effect of reducing availability on I/O service
• Device– Improved schedulability due to amortization [RTAS 2008]– Analysis for multiple RT tasks
• End-to-end I/O guarantees– Fit into analyzable framework [RTAS 2007]– Guarantees including both CPU and device components
51
Feature Comparison
Linux
RT
pree
mpt
Inte
rrup
t Acc
t
GEDF
SCAN
-EDF
RT C
alcu
lus
Mod
elin
g DD
Fitti
ng Li
nux
DD
Thro
ttlin
g HD
D
POSI
X SS
Linux
SS
OPS
CHED
CPUMethods to fit into AnalysisConfigurable Bound on InterferenceTime Accounting/CorrelationEffect on I/O ServiceMultiple Min Service Profiles
DeviceMultiple Min Service ProfilesImproved SchedulabilityFit into AnalysisBounded InterferenceWorks with Black Box
End-to-endMethods to fit into Analysis
Improved Average-case Performance
52
New Approach• Better Model
– Include OS CPU consumption into analysis– Enhance OS mechanisms to allow better
system design• Models built on empirical observations
– Timing information unavailable– Static analysis not practical and too
pessimistic• Resources operate at a variety of service
rates– Tighter deadlines == lower throughput– Longer deadlines == higher throughput
53
Example:Rate-Latency Curve Convolution
=Latency1 Latency2
Latency1 + Latency2
rate1 rate2
rate1
⊗
Apps
CPU
54
A Useful Tool: Real-Time Calculus
• Based on network calculus, derived from queueing theory– Provides an analytical framework to compose system
• More precise analysis (bounds) especially for end-to-end analysis
• Can be used with existing models (e.g., periodic)• Provides a very general representation for
modeling systems
Apps
CPU
55
End-to-End Analysis
• I/O service time includes multiple components• Analysis must consider all components
– Worst-case delay for each?– Is this bound tight?
• Framework to “compose” individual resources
Tx RxDevicerequest response
Apps
CPU
56
Real-Time Calculus
Δ
α
β (min service curve)
(max arrival curve)
Maximum horizontal distanceis the worst-case response time
Apps
CPU
57
Real-Time Calculus [Thiele 2000]
𝛼1 𝛼1 ′
𝛽 1′
𝛽 1
2 𝛼2 ′
𝛽 1′ ′
2
workload(arrivals)
resources
Apps
CPU
58
Apps Tx RxDevice
Composing RT I/O ServiceApps
CPU
59
Constraint on Output Arrival
• Deconvolution
– Envelope arrival curve• γ – maximum service curve• β – min service curve• - input• - output
60
Timing Bounds
measuredpossible
analyticalupper-bound
freq
uenc
y
responsetime
actualupper-bound
observable upper-bound
empirical upper-bound
0
61
Job
arrival/release time
completion time
lateness(tardiness)
start timeabsolute deadline
response time
relative deadline
62
Task
Worst-case ExecutionTime (WCET)
Inter-arrival Time
deadline
time
63
Theoretical Analysis
• Non-preemptive job scheduling reduces to bin packing (NP-hard)
time
64
Real-Time Calculus [Thiele 2000]
• Resource availability in the time interval [s,t) is C[s,t)
time
𝑠 𝑡
65
Real-Time Calculus [Thiele 2000]
𝛼1 𝛼1 ′
𝛽 1′
𝛽 1
𝛽 ′ 𝑙 (𝑡 )=𝑠𝑢𝑝0≤ λ ≤𝑡 {𝛽 𝑙 ( λ )−𝛼𝑢 ( λ ) }𝛽 ′𝑢 (𝑡 )=𝑖𝑛𝑓 λ ≥0 {𝛽𝑢 ( λ )−𝛼 𝑙 ( λ ) }𝛼 ′ 𝑙=𝑚𝑖𝑛 {(𝛼𝑙⊘ 𝛽𝑢)⊗ 𝛽𝑙 , 𝛽 𝑙}𝛼 ′𝑢=𝑚𝑖𝑛 {(𝛼𝑢⊗ 𝛽𝑢)⊘ 𝛽 𝑙 , 𝛽𝑢}
( 𝑓 ⊗𝑔 ) (𝑡 )=𝑖𝑛𝑓 0≤ λ ≤𝑡 { 𝑓 (𝑡− λ )+𝑔 (λ) }( 𝑓 ⊘ 𝑔 ) (𝑡 )=𝑠𝑢𝑝 λ ≥𝑡 {𝑓 (𝑡+ λ )−𝑔 (λ) }
2 𝛼2 ′
𝛽 1′ ′
2
66
Real-Time Calculus
Δ
α
β (service curve)
(arrival curve)
Maximum horizontal distanceis the worst-case response time
Maximum vertical distanceis maximum queue length
𝑑𝑚𝑎𝑥≤𝑠𝑢𝑝 λ ≥0 {𝑖𝑛𝑓 {𝜏 ≥0 :𝛼𝑢( λ)≤ 𝛽𝑙( λ+𝜏 )}}
𝑏𝑢𝑓 𝑚𝑎𝑥 ≤𝑠𝑢𝑝 λ≥ 0 {𝛼𝑢( λ)≤ 𝛽𝑙( λ+𝜏 )}
67
Network Calculus
𝑥 (𝑡) 𝑦 (𝑡)𝜎 (𝑡 )
𝑥 (𝑡 ) 𝑦 (𝑡)𝜎 1(𝑡) 𝜎 2 (𝑡)
𝜎 𝑐=𝜎 1⊗𝜎 2
𝑦 (𝑡 )=(𝜎⊗𝑥 ) (𝑡 )=𝑖𝑛𝑓 0≤ λ ≤𝑡 {𝜎 (𝑡− λ )+𝑥 (λ)}
68
AppsApps
CPUCPU
Apps
CPU
69
Real-Time Background• Explicit timing constraints
– Finish computation before a deadline– Retrieve sensor reading every 5 msecs– Display image every 1/30th of a second
• Schedule (online) access to resources to meet timing constraints
• Schedulability analysis (offline)– Abstract models
• Workloads• Resources
– Scheduling algorithm
Appn
App2
App1
70
Current Research:Analyzing CPU Time for IOs
• Applications demand CPU time• Measure the interference• Ratio of max demand to interval length defines load• Schedulability (fixed-task priority)
• Characterize I/O CPU time in terms of a load function
Task underconsideration
Interference fromhigher priority tasks
71
How to measure load
• I/O CPU component at high priority• Measurement task at low priority
time
72
Measured Worst-Case Load
73
Analyzing
𝑒𝑘𝑑𝑘+∑
𝑖=1
𝑘−1
𝑙𝑜𝑎𝑑𝜏 𝑖𝑚𝑎𝑥 (𝑑𝑘 )≤1
Task underconsideration Interference from
higher priority tasks
τ1 is a periodic task (WCET =2, Period = 10)
74
Bounding
75
Adjusting the Interference
• May have missed worst-case• CPU time consumed too high• Aperiodic servers
– Force workload into a specific workload model
– Example: Sporadic server
76
Future Research
• Combine bounding and accounting– Accounting
• Charge user of services• Cannot always charge correct account
– Bound• Set aside separate account• If exhausted disable I/O until account is replenished
77
Future Research:Practicality of Aperiodic Servers
• Practical considerations– Is the implementation correct?– Overhead
• Context switches• Latency vs Throughput
Real timeNon-real time
OS scheduler
Past Research:Throttling
79
“Amortized” Cost of I/O Operations
• WCST(n) << n * WCST(1)• Cost of some ops can be
shared amongst requests– Hard disk seek time– Parallel access to flash
packages• Improved minimum
available resource
Seek Time Amortization
Seek Time Amortization
Seek Time Amortization
83
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
Amor
tized
Serv
ice T
ime
number of simultaneous requests
50 Kbyte Requests
84
Example System
• Web services– Multimedia– Website
• Video surveillance– Receive video– Intrusion detection– Recording– Playback
Loca
l net
wor
k
Internet
CPU
NetworkAll-in-one
server
How do we make the
system work?