Bridging the Tenant-Provider Gap in Cloud Services Virajith Jalaparti, Hitesh Ballani, Paolo Costa Thomas Karagiannis, Ant Rowstron
Feb 10, 2016
Bridging the Tenant-Provider Gap in Cloud Services
Virajith Jalaparti, Hitesh Ballani, Paolo Costa Thomas Karagiannis, Ant
Rowstron
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 2
Today’s Interface to the Cloud
• Resource-centric Interface– “I want 100 small VMs”
• Per-VM Per-Hour pricing– E.g.: $0.08 per hour in Amazon EC2
Simple but problematic!
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 3
Using the Resource-centric Interface
User
Job
Private Cluster [40 machines]
Done in T hrs
Cloud Provider40 VMs
T hrs, $40T
2T hrs,$80T!!
Unpredictable/Variable Performance and Costs
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 4
Proposal: Job-centric Interface
Cloud Provider
Finish in T hrs
Dedicated Resources
Done in T hrs!!
• Tenant specifies high-level goals they care about• Completion Time, Cost to run a job etc.
• Provider determines resources to use to meet tenants’ goal• VMs, Network Bandwidth etc.
User
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 5
Proposal: Job-centric Interface
Guaranteed performance for tenants
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 6
Proposal: Job-centric Interface
Incentive for provider?
Exploit multi-resource tradeoff
5 10 15 20 25 30 35 400
50100150200250300
Number of Compute Instances (N)
Band
wid
th (B
) (in
M
bps)
Resource Trade-off CurveLinkGraph in 300sec
Increases Goodput/Revenue<N,B> = <10, 150>
<N,B> = <20, 100>
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 7
Outline• Motivation• Job-centric Interface–Multi-resource tradeoff
• Bazaar: A Job-centric Cloud Framework– Performance Prediction– Resource Selection
• Evaluation• Bazaar: Extensions and Opportunities• Conclusion
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 8
Bazaar: A Job-centric Cloud Framework
Bazaar
User
Job Specificati
onCompletio
n Time
Performance
PredictionResource Selection
<N1,B1><N2,B2>
… <Nk,Bk>
Datacenter State
<N,B>
Focus on MapReduce Applications
Focus on two resources: Compute (N) and Bandwidth (B)
Notation: <N,B> denotes resources allocated
Resource tuple
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 9
Performance Prediction• Well studied area– Run-time profiling, Static analysis,
Simulations
• Bazaar requirements– Fast prediction (trades-off with
accuracy)– Account for Network along with
Compute • Not addressed by Jockey, MRPerf, Aria
Dedicated N and B makes the problem tractable
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 10
MRCuTE: Performance Prediction in Bazaar
MRCuTEJob SpecificationsProgram (P), Input
data (I), Sample Data (Is)
Resource Parameters <N, B>
Completion Time
Analytical Model
Profiler
Analytical Modeling + Profiling based approach
Map
Map
Map
Reduce
Reduce
Map Phase
Reduce Phase
Shuffle
Phase
Completion Time determined by (a)Input data size(b)Rate of progress
Job Specific
Program (P) profiled using sample data (Is) on one machine
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 11
Resource PredictionMRCuTE( P, I, Is, N, B ) Completion TimeN = MRCuTE-1( P, I, Is, Completion Time , B)
User specifiedProvider can determine multiple <N, B>
resource tuplesB1 N1
B2 N2B3 N3
…
Which <N,B> to use?
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 12
Resource Selection
Which <N,B> tuple maximizes the provider’s ability to accept future requests?
Increases Goodput/Revenue
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 13
Resource Selection: Example
4 Physical machines 2 VM slots each
TOR TOR
R1 = <3 VMs, 500Mbps>
R2 = <6 VMs, 200Mbps>
Replica 2
<4 VMs, 400Mbps>
Select the resource tuple leading to better goodput
or
Greedy packing allocation algorithm : Oktopus [Sigcomm’11]
Replica 1
Replica 1 will accept more requests than Replica 2
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 14
Resource Selection
• Similar to Multi-dimensional Bin Packing
• Heuristic: Minimize Resource Imbalance Metric– Select <N,B> which balances the
remaining capacity across resources
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 15
Outline• Motivation• Job-centric Interface– Resource Malleability
• Bazaar: A Job-centric Cloud Framework– Performance Prediction– Resource Selection
• Evaluation• Bazaar: Extensions and Opportunities• Conclusion
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 16
Evaluation• MRCuTE: Prediction accuracy
• Benefits of Bazaar– Testbed Deployment– Simulations
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 17
MRCuTE: Prediction Accuracy
• Setup: Hadoop on 35-node Emulab cluster Sort with 200GB of random
data
Average prediction error = 8.9%
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 18
MRCuTE: Prediction Accuracy
Sort WordCountGridMix TF-IDF LinkGraph0
10
20
30
40
50
% A
vera
ge E
rror
5 MapReduce Jobs
Average Error < 12%
Overcome prediction inaccuracies using slack
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 19
Evaluation: Benefits of Bazaar• Metrics– Fraction of rejected/accepted requests– Datacenter Goodput
• Strategies– Bazaar: Select <N,B> using resource imbalance
metric– Baseline: Select <N,B> randomly
• Workload– Poisson job arrival process with a target arrival rate
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 20
Bazaar: Testbed Deployment
• Working prototype on 26 node Emulab cluster–Workload: 100 Sort Jobs
Accepted Jobs
Goodput0
5
10
15
20G
ain
of B
azaa
r (%
) re
lativ
e to
Bas
elin
e
11.4% more
15.5% more
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 21
Bazaar: Simulations• Datacenter scale: 16,000 machines• Cross-validated using testbed
Operational occupancy range for services like Amazon
EC2 is 70-80%
Bazaar is ~50% better than
baseline
Requests arrive faster
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 22
Outline• Motivation• Job-centric Interface– Resource Malleability
• Bazaar: A Job-centric Cloud Framework– Performance Prediction– Resource Selection
• Evaluation• Bazaar: Extensions and Opportunities• Conclusion
04/22/2023 SOCC 2012 - Bazaar 23
Bazaar-T: An extension of Bazaar• Bazaar trades-off N and B– Finish jobs “on time”
• Bazaar-T: Exploits flexibility with time– Finish jobs “before time”–More resources available in future
• Extend resource imbalance metric to time domain
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 24
Bazaar-T: More Flexibility, More Gains
Bazaar vs. Bazaar-T Bazaar-T has 10-20% more goodput than
Bazaar
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 25
Bazaar: Pricing implications• Today: Resource-based pricing – E.g: Using 20 VMs for 4hrs costs $80– Extendable to multiple resources– No incentive for provider to finish in time
• Bazaar enables job-based pricing– E.g.: Finish Sort over 200GB in 4hrs costs
$100– Tenants pay based on job characteristics– Aligns tenant and provider interests
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 26
Conclusion• Bazaar: Job-centric Framework for
MapReduce
• Win-win situation for provider and tenant– Tenants get predictable performance– Providers get increased revenue
• Provides new avenues for pricing
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services
27
Thank You!
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services
28
Back-up Slides
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 29
Related Work• Performance Prediction
– MRPerf [Mascots 2009], Mumak• Detailed Simulations
– Elastisizer [SOCC 2011]• Detailed Modeling of MapReduce
• SLOs– Jockey [Eurosys 2012]:
• Simulations; Runtime monitoring to meet deadline– Conductor [NSDI 2012]
• Solves optimization problem to meet goals– Proteus [Sigcomm 2012]:
• Time varying network reservations– Aria [ICAC 2011]
• Profiling and modeling
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 30
Hadoop Jobs Details
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 31
MRCuTE: Profiling Time
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 32
MRCuTE: Accounting for heterogeneity
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 33
Addressing Skew- Slack
% of Late Jobs vs. Slack
% of Rejected Requests vs. Slack
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 34
Goodput vs. Oversubscription
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 35
Rejected Requests vs. Mean BW
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 36
Rejected requests vs. Occupancy
04/22/2023 Bridging the Tenant-Provider Gap in Cloud Services 37
Bazaar vs. Fair Sharing