USENIX NSDI2016 Session: Resource Sharing 20160529 @oraccha
USENIX NSDI2016Session: Resource Sharing2016-05-29 @oraccha
Co-located Events ACM Symposium on SDN Research 2016 (SOSR), March 13-17
2016 Open Networking Summit (ONS), March 14-17
The 12th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS16), March 17-19
The 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI16)
The USENIX Workshop on Cool Topics in Sustainable Data Centers (CoolDC16), March 19
2
Session: Resource Sharing Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics, Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica, University of California, Berkeley
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches, Asaf Cidon and Assaf Eisenman, Stanford University; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University
FairRide: Near-Optimal, Fair Cache Sharing, Qifan Pu and HaoyuanLi, University of California, Berkeley; Matei Zaharia, Massachusetts Institute of Technology; Ali Ghodsi and Ion Stoica, University of California, Berkeley
HUG: Multi-Resource Fairness for Correlated and Elastic Demands, Mosharaf Chowdhury, University of Michigan; Zhenhua Liu, Stony Brook University; Ali Ghodsi and Ion Stoica, University of California, Berkeley, and Databricks Inc. 3
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Who?SparkMesosUCB AMPLabSoCC12EuroSys13OSDI14SIGMOD16
What?
4
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Tim
e (s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K
0
5
10
15
20
25
30
35
Tim
e (s)
QR Factorization 1M by 1K
Network Bound Mem Bandwidth Bound
DO CHOICES MATTER ? MATRIX MULTIPLY
0
5
10
15
20
25
30
Tim
e (s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix size: 400K by 1K
Cores = 16 Memory = 244 GB Cost = $2.66/hr
CosineTransform Normalization
Linear Solver
~100 iterations
Iterative (each iteration many jobs)
Long Running Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE RawData
Properties
0
10
20
30
0 100 200 300 400 500 600
Tim
e (s)
Cores
Actual Ideal
r3.4xlarge instances, QR Factorization:1M by 1K
13
Do choices MATTER ?
Computation + Communication Non-linear Scaling
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics
5
How?Training jobTraining job
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1 2 4 8
Inp
ut
Machines
Use off-the-shelf solver(CVX)
USING ERNEST
Training Jobs
JobBinary
Machines,Input Size
Linear Model
Experiment Design
Use few iterations for training
0
200
400
600
800
1000
1 30 900
Tim
e
Machines
ERNEST
BASIC Model
time = x1 + x2 input
machines+ x3 log(machines)+ x4 (machines)
Serial Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data Fit Linear Regression
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Results
6
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
29
7 data pointsUp to 16 machinesUp to 10% data
EXPERIMENT DESIGN
0 1000 2000 3000 4000 5000 6000
42 machines
Time (s)
Training TimeRunning Time
0% 20% 40% 60% 80% 100%
Regression
Classification
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
30
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches Who?Stanford CSSookasaCEOSIGCOMM12USENIX ATC13, 15
What?Performance cliffMemcachedSlab allocator
70 2000 4000 6000 8000 10000 12000 14000 16000 1800000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Items in LRU Queue
Hitr
ate
Concave HullApplication 19, Slab 0
Performance Cliff, Talus[HPCA15]
+1 cache hit-rate
+35% speedup
The cache hit-rate of Facebooks Memcached poolis 98.2%[SIGMETRICS12]
Hit-rate Curve
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches How?shadow queues
Hill climbing algorithm: Hit rate curvequeue (slab)queue
Cliff scaling algorithm: performance cliff
8
Using&Shadow&Queues&to&Estimate&Local&Gradient
823221
879
53
Queue$1
Queue$2
Physical$Queue Shadow$Queue
Physical$Queue Shadow$Queue
CreditsQueue&1 2Queue&2 @2
1
Resize$Queues
Cliffhanger+Runs+Both+Algorithms+in+Parallel
Par$$oned)
Original)Queue)
Par$$oned)Queues)
Track)le4)of)pointer)
Track)le4)of)pointer)
Track)right)of)pointer)Track)right)of)pointer)
Track)hill)climbing)
Track)hill)climbing)
Algorithm+1:+incrementally+optimize+memory+across+queues Across+slab+classes Across+applications
Algorithm+2:+scales+performance+cliffs
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches FairRideFairness
9
Cliffhanger+Reduces+Misses+and+Can+Save+Memory
Average+misses+reduced:+36.7% Average+potential+memory+savings:+45%
Cliffhanger+Outperforms+Default+and+Optimized+Schemes
Average+Cliffhanger+hit+rate+increase:+1.2%
FairRide: Near-Optimal, Fair Cache Sharing Who?UCB AMPLabMobiCom13SIGCOMM15
What?Isolation guaranteeStrategy proofnessPareto Efficiency
106
Statically allocated
*
Globally shared
Cache
Backend (storage/network)
Backend (storage/network)
Cache Cache Cache
What we want
Isolation Strategy-proof
Higher utilization Share data
Isolation Guarantee
Strategy Proofness
Pareto Efficiency
max-min fairness
priority allocation
max-min rate
static allocation
Isolation Guarantee
Strategy Proofness
Pareto Efficiency
106
Properties
FairRide Near-optimal
SIP
FairRide: Near-Optimal, Fair Cache Sharing How
Max-minProbabilistic blockingdis-incentive
Alluxio (Tachyon)[SoCC14]
11
USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 397
LEGEND
A
C
5
5
A
B
C
5
510
B
A
B
C
5
510
true access
free-ridecheat
blocked
Figure 3: Example with 2 users, 3 files and total cachesize of 2. Numbers represent access frequencies. (a). Al-location under max-min fairness; (b). Allocation undermax-min fairness when second user makes spurious ac-cess (red line) to file C; (c). Blocking free-riding access(blue dotted line).
3.3 CheatingWhile max-min fairness is strategy-proof when users ac-cess different files, this is no longer the case when filesare shared. There are two types of cheating that couldbreak strategy-proofness: (1) Intuitively, when files areshared, a user can free ride files that have been alreadycached by other users. (2) A thrifty user can choose tocache files that are shared by more users, as such files aremore economic due to cost-sharing.
Free-riding To illustrate free riding, consider twousers: user 1 accesses files A and B, and user 2 accessesfiles A and C. Assume size of cache is 2, and that we cancache a fraction of a file. Next, assume that every useruses LFU replacement policy and that both users accessA much more frequently than the other files. As a result,the system will cache file A and charge each user by1/2. In addition, each user will get half of their otherfiles in the cache, i.e., half of file B for user 1, and fileB for user 2, as shown in Figure 3(a). Each user gets acache hit rate of 50.5+10 = 12.51 hits/sec.
Now assume user 2 cheats by spuriously accessing fileC to artificially increase its access rate such that to exceedAs access rate (Figure 3(b)), effectively sets the priorityof C higher than B. Since now C has the highest accessrate for user 2, while A remains the most accessed file ofuser 1, the system will cache A for user 1 and C for user 2,respectively. The problem is that user 2 will still be ableto benefit from accessing file A, which has already beencached by user 1. At the end, user 1 gets 10 hits/sec, anduser 2 gets 15 hits/sec. In this way, user 2 free-rides onuser 1s file A.
Thrifty-cheating To explain the kind of cheatingwhere a user carefully calculates cost-benefits andthen changes file priorities accordingly, we first definecost/(hit/sec) as the amount of budget cost a user pays
1When half of a file is in cache, half of the page-level accesses tothe file will result in cache miss. Numerically, it is the equal to missingthe entire file 50% of the time. So hit rate is calculated as access ratemultiplied by percentage cached.
to get 1 hit/sec access rate for a unit file. To opti-mize over the utility, which is defined as the total hitrate, a users optimal strategy is not to cache the filesthat one has highest access frequencies, but the oneswith lowest cost/(hit/sec). Compare a file of 100MB,shared by 2 users and another file of 100MB, shared by 5users. Even though a user access the former 10 times/secand the latter only 8 times/sec, it is overall economicto cache the second file (comparing 5MB/(hit/sec) vs.2.5MB/(hit/sec)).
The consequence of thrift-cheating, however, ismore complicated. As it might appear to improveuser and system performance at first glance, it doesntlead to an equilibrium where all users are contentabout their allocations. This can cause users to con-stantly game the system which leads to a worse outcome.
In the above examples we have shown that due to an-other user cheating, one can experience utility loss. Anatural question to ask is, how bad could it be? i.e. Whatis the upper bound a user can lose when being cheated?By construction, one can show that for two-user cases, auser can lose up to 50% of cache/hit rate when all herfiles are shared and free ridden by the other strategicuser. As the free-rider evades charges of shared files, thehonest user double pays. This can be extended to a moregeneral case with n (n> 2) users, where loss can increaselinearly with the number of cheating users. Suppose thatcached files are shared by n users, each user pays 1n of thefile sizes. If n 1 strategic users decide to cache otherfiles, the only honest user left has to pay the total cost.In turn, the honest user has to evict at most ( n1n ) of herfiles to maintain the same budget.
It is also worth mentioning that for many applications,moderate or even minor cache loss can result in dras-tic performance drop. For example, in many file sys-tems with overall high cache hit ratio, the effective I/Olatency with caching could be approximated as TIO =RatiomissLatencymiss. A slight difference in the cachehit ratio, e.g. from 99.7% to 99.4%, means 2 I/O av-erage latency drop! This indeed necessitates strategy-proofness in cache policies.
3.4 Blocking Access to Avoid Cheating
At the heart of providing strategy-proofness is this ques-tion of how free-riding can be prevented. In the previ-ous example, user 2 was incentivized to cheat becauseshe was able to access the cached shared files regardlessher access patterns. Intuitively, if user 2 is blocked fromaccessing files that she tries to free-ride, she will be dis-incentivized to cheat.
Applying blocking to our previous example, user 2will not be allowed to access A, despite the fact that user1 has already cached A (Figure 3(c)). The system blocks
5
(a) Max-min fairness
(b) second usermakes cheating
(c) blocking free-riding access
Probabilistic blocking FairRide blocks a user with p(nj) = 1/(nj+1) probability
nj is number of other users caching file j e.g., p(1)=50%, p(4)=20%
The best you can do in a general case Less blocking does not prevent cheating
25
FairRide: Near-Optimal, Fair Cache Sharing
12
0
15
30
45
60
0 150 300 450 600 750 900 1050
mis
s ra
tio (%
)
Time (s)
user 1user 2
Cheating under FairRide
user 2 cheats
user 1 cheats
32
FairRide dis-incentives users from cheating.
400
300
200
100
0 Avg
. res
pons
e (m
s)
Facebook experiments
FairRide outperforms max-min fairness by 29%
34
0
15
30
45
60
1-10 11-50 51-100 101-500 501-
Redc
utio
n in
Med
ian
Job
Tim
e (%
)
Bin (#Tasks)
max-minFairRide
HUG: Multi-Resource Fairness for Correlated and Elastic Demands Who?UCB AMPLabcoflow-based networking, multi-resource allocation in dataceters, compute and storage for big data, network virtualizationSIGCOMMDRF[NSDI11]FairCloud[SIGCOMM12]
What?
13
M1 M2 M3 MN
Congestion-Less Core
L1 L2 L3 LNLN+1 LN+2 LN+3 L2N
How to share the links between multiple tenants to provide1. optimal performance
guarantees and2. maximize utilization?
Tenant-As VMs
Tenant-Bs VMs
HUG: Multi-Resource Fairness for Correlated and Elastic Demands Highest Utilization with the Optimal Isolation Guarantee
14
Isolation Guarantee
Uti
lizat
ion
Work-Conserving
Low
Low Optimal
PS-P
DRF
Per-Flow Fairness
HUG
HUG in Cooperative Setting
1. Optimal Isolation Guarantee
2. Work Conservation
Isolation Guarantee
Uti
lizat
ionWork-
Conserving
Low
Low Optimal
PS-P
DRF
Per-Flow Fairness
HUG
1. Optimal Isolation Guarantee
2. Highest Utilization3. Strategyproof
HUG in Non-Cooperative Setting
408 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) USENIX Association
Intuitively, we want to maximize the minimumprogress over all tenants, i.e., maximize mink Mk,where mink Mk corresponds to the isolation guaran-tee of an allocation algorithm. We make three observa-tions. First, when there is a single link in the system,this model trivially reduces to max-min fairness. Sec-ond, getting more aggregate bandwidth is not always bet-ter. For tenant-A in the example, 50Mbps, 100Mbps isbetter than 90Mbps, 90Mbps or 25Mbps, 200Mbps,even though the latter ones have more bandwidth in to-tal. Third, simply applying max-min fairness to individ-ual links is not enough. In our example, max-min fairnessallocates equal resources to both tenants on both links,resulting in allocations 12 ,
12 on both links (Figure 1b).
Corresponding progress (MA = MB = 12 ) result in asuboptimal isolation guarantee (min{MA,MB} = 12 ).
Dominant Resource Fairness (DRF) [33] extends max-min fairness to multiple resources and prevents such sub-optimality. It equalizes the shares of dominant resources link-2 (link-1) for tenant-A (tenant-B) across all ten-ants with correlated demands and maximizes the iso-lation guarantee in a strategyproof manner. As shownin Figure 1c, using DRF, both tenants have the sameprogress MA = MB = 23 , 50% higher than usingmax-min fairness on individual links. Moreover, DRFsisolation guarantee (min{MA,MB} = 23 ) is optimalacross all possible allocations and is strategyproof.
However, DRF assumes inelastic demands [40], and itis not work-conserving. For example, bandwidth on link-2 in shades is not allocated to either tenant. In fact, weshow that DRF can result in arbitrarily low utilization(Lemma 6). This is wasteful, because unused bandwidthcannot be recovered.
We start by showing that strategy-proofness is a neces-sary condition for providing the optimal isolation guar-antee i.e., to maximize mink Mk in non-cooperativeenvironments (2). Next, we prove that work conserva-tion i.e., when tenants are allowed to use unallocatedresources, such as the shaded area in Figure 1c, withoutconstraints spurs a race to the bottom. It incentivizeseach tenant to continuously lie about her demand cor-relations, and in the process, it decreases the amount ofuseful work done by all tenants! Meaning, simply mak-ing DRF work-conserving can do more harm than good.
We propose a two-stage algorithm, High Utilizationwith Guarantees (HUG), to achieve our goals (3). Fig-ure 2 surveys the design space for cloud network shar-ing and places HUG in context by following the thicklines. At the highest level, unlike many alternatives[13, 14, 37, 44], HUG is a dynamic allocation algo-rithm. Next, HUG enforces its allocations at the tenant-/network-level, because flow- or (virtual) machine-levelallocations [61, 62] do not provide isolation guarantee.
Due to the hard tradeoff between optimal isolation
Cloud Network Sharing
Dynamic Sharing
Flow-Level(Per-Flow Fairness)
No isolation guarantee
VM-Level(Seawall, GateKeeper)No isolation guarantee
Tenant-/Network-Level
Non-CooperativeEnvironments
Require strategy-proofness
Highest Utilization forOptimal Isolation Guarantee
(HUG)
CooperativeEnvironments
Do not require strategy-proofness
Reservation(SecondNet, Oktopus, Pulsar, Silo)
Uses admission control
LowUtilization
(DRF)Optimal isolation guarantee
Work-ConservingOptimal Isolation Guarantee
(HUG)
SuboptimalIsolation Guarantee(PS-P, EyeQ, NetShare)
Work-conserving
Figure 2: Design space for cloud network sharing.
guarantee and work conservation in non-cooperative en-vironments, HUG ensures the highest utilization possi-ble while maintaining the optimal isolation guarantee.It incentivizes tenants to expose their true demands, en-suring that they actually consume their allocations in-stead of causing collateral damages. In cooperative en-vironments, where strategy-proofness might be a non-requirement, HUG simultaneously ensures both workconservation and the optimal isolation guarantee. Incontrast, existing solutions [33, 45, 51, 58, 59] are sub-optimal in both environments. Overall, HUG generalizessingle- [25, 43, 55] and multi-resource max-min fairness[27, 33, 38, 56] and multi-tenant network sharing solu-tions [45, 51, 58, 59, 61, 62] under a unifying framework.
HUG is easy to implement and scales well. Even with100, 000 machines, new allocations can be centrally cal-culated and distributed throughout the network in lessthan a second faster than that suggested in the litera-ture [13]. Moreover, each machine can locally enforceHUG-calculated allocations using existing traffic controltools without any changes to the network (4).
We demonstrate the effectiveness of our proposal us-ing EC2 experiments and trace-driven simulations (5).In non-cooperative environments, HUG provides the op-timal isolation guarantee, which is 7.4 higher than ex-isting network sharing solutions like PS-P [45, 58, 59]and 7000 higher than traditional per-flow fairness, and1.4 better utilization than DRF for production traces. Incooperative environments, HUG outperforms PS-P andper-flow fairness by 1.48 and 17.35 in terms of the95th percentile slowdown of job communication stages,and 70% jobs experience lower slowdown w.r.t. DRF.
We discuss current limitations and future research inSection 6 and compare HUG to related work in Section 7.
2
HUG: Multi-Resource Fairness for Correlated and Elastic Demands 100EC2
ACpairwise one-to-one communication Ball-to-all communication
15
USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 415
0
50
100
0 60 120 180 240 300 360 420 480 540
Tota
l Allo
c (G
bps)
Time (Seconds)
Tenant ATenant BTenant C
(a) Per-flow Fairness (TCP)
0
50
100
0 60 120 180 240 300 360 420 480 540
Tota
l Allo
c (G
bps)
Time (Seconds)
Tenant ATenant BTenant C
(b) HUG
Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100VMs, but each uses a different communication pattern (5.1.1). We observe that (a) using TCP, tenant-B dominates the network bycreating more flows; (b) HUG isolates tenants A and C from tenant B.
flow fairness, PS-P [58], and DRF [33] (5.2). Finally,we evaluate HUGs long-term impact on application per-formance using a 3000-machine Facebook cluster traceused by Chowdhury et al. [23] and compare against per-flow fairness, PS-P, DRF, as well as Varys, which focusesonly on improving performance (5.3).
5.1 Testbed ExperimentsMethodology We performed our experiments on 100m2.4xlarge Amazon EC2 [2] instances running onLinux kernel 3.4.37 and used the default htb and tc im-plementations. While there exist proposals for more ac-curate qdisc implementations [45, 57], the default htbworked sufficiently well for our purposes. Each of themachines had 1 Gbps NICs, and we could use close tofull 100 Gbps bandwidth simultaneously.
5.1.1 Network-Wide Isolation
We consider a cluster with 100 EC2 machines, dividedbetween three tenants A, B, and C that arrive over time.Each tenant has 100 VMs; i.e., VMs Ai, Bi, and Ciare collocated on the i-th physical machine. However,they have different communication patterns: tenants Aand C have pairwise one-to-one communication patterns(100 VM-VM flows each), whereas tenant-B followsan all-to-all pattern using 10, 000 flows. Specifically, Aicommunicates with A(i+50)%100, Cj communicates withC(j+25)%100, and any Bk communicates with all Bl,where i, j, k, l {1, ..., 100}. Each tenant demands theentire capacity at each machine; hence, the entire capac-ity of the cluster should be equally divided among theactive tenants to maximize isolation guarantees.
Figure 10a shows that as soon as tenant-B arrives, shetakes up the entire capacity in the absence of isolationguarantee. Tenant-C receives only marginal share as shearrives after tenant-B and leaves before her. Note thattenant-A (when alone) uses only about 80% of the avail-able capacity; this is simply because just one TCP flowper VM-VM pair often cannot saturate the link.
Figure 10b presents the allocation using HUG. As ten-ants arrive and depart, allocations are dynamically calcu-
lated, propagated, and enforced in each machine of thecluster. As before, tenants A and C use marginally lessthan their allocations because of creating only one flowbetween each VM-VM pair.
5.1.2 ScalabilityThe key challenge in scaling HUG is its centralized re-source allocator, which must recalculate tenant sharesand redistribute them across the entire cluster wheneverany tenant changes her correlation vector.
We found that the time to calculate new allocations us-ing HUG is less than 5 microseconds in our 100 machinecluster. Furthermore, a recomputation due to a tenantsarrival, departure, or change of correlation vector wouldtake about 8.6 milliseconds on average for a 100, 000-machine datacenter.
Communicating a new allocation takes less than 10milliseconds to 100 machines and around 1 second for100, 000 emulated machines (i.e., sending the same mes-sage 1000 times to each of the 100 machines).
5.2 Instantaneous FairnessWhile Section 5.1 evaluated HUG in controlled, syn-thetic scenarios, this section focuses on HUGs instanta-neous allocation characteristics in the context of a large-scale cluster.
Methodology We use a one-hour snapshot with 100concurrent jobs from a production MapReduce trace,which was extracted from a 3200-machine Facebookcluster by Popa et al. [58, Section 5.3]. Machines are con-nected to the network using 1 Gbps NICs. In the trace, ajob with M mappers and R reducers hence, the corre-sponding M R shuffle is described as a matrix withthe amount of data to transfer between each M -R pair.We calculated the correlation vectors of individual shuf-fles from their communication matrices ourselves usingthe optimal rate allocation algorithm for a single shuffle[22, 23], ensuring all the flows of each shuffle to finishsimultaneously.
Given the workload, we calculate progress of eachjob/shuffle using different allocation mechanisms and
9
NSDI
20
UCB AMPLab
Facebook trace data
16NSDI2016proceedingsslides