Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang.
Post on 02-Jan-2016
215 Views
Preview:
Transcript
Scheduling and Resource Management for Next-
generation Clusters
Yanyong ZhangPenn State University
www.cse.psu.edu/~yyzhang
What is a Cluster?
•Cost effective
•Easily scalable
•Highly available
•Readily upgradeable
Scientific & Engineering Applications
• HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm)
• Sandia's expansion of their Alpha-based C-plant system.
• Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm)
• A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 ….
(http://www.swiss.ai.mit.edu/~pas/p/sc95.html)
• The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide ….
(http://www.osc.edu/press/releases/2001/approved.shtml)
Commercial Applications
• Business applications– Transaction Processing (IBM DB2, oracle …)– Decision Support System (IBM DB2, oracle …)
• Internet applications– Web serving / searching (Google.Com …)– Infowares (yahoo.Com, AOL.Com)– Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything– Computing portal
Resource Management
• Each application is demanding• Several applications/users can
be present at the same time
Resource management and Quality-of-service become important.
4
System ModelArrival Q
43
• Each node is independent• Maximum MPL• Arrival queue
High Speed
Network
P0 P1 P2 P3 P4
Two Phases in Resource Management• Allocation Issues
– Admission Control– Arrival Queue Principle
• Scheduling Issues (CPU Scheduling)– Resource Isolation– Co-allocation
SEND
switch
Co-allocation / Co-scheduling
P0 P1
TIME
t0
t1
P0RECV
Scheduling skewness
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
Contribution 1:Boosting CPU Utilization at Supercomputing Centers
Wait Time Execute Time
Objective
Wait in the arrival Q
Wait in the ready/blocked
Q
Response Time
slowdown =Response Time
Execute Time in Isolation
minimize
• Back Filling (BF)
• Gang Scheduling (GS)
• Migration (M)
Existing Techniques
2 6
5
23
# of CPUs = 14
8283 2
6
tim
e
space2 2
Proposed Scheme
• MBGS = GS + BF + M– Use GS as the basic framework– At each row of GS matrix, apply
BF technique– Whenever GS matrix is re-
calculated, M should be considered.
How Does MBGS Perform?
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
Contribution 2:Reducing Response Times for Commercial Applications
Wait Time Execute Time
Objective
Wait in the arrival Q
Wait in the ready/block
ed Q
Response Time
•Minimize wait time•Minimize response time
Previous Work I:Gang Scheduling (GS)
GS is not responsive enough !
(1)
(2)
MINUTES !
wasted
Previous Work II:Dynamic Co-scheduling
B D A C
P0 P1 P2 P3
B just gets a msg
Everybody else is blocked
It’s A’s tur
n
C just finishes I/O
The scheduler on each node makes independentdecision based on local events without global synchronizations.
Dynamic Co-scheduling Heuristics
How do you wait for a message?
What doyou do onmessagearrival?
No ExplicitReschedule
Interrupt &Reschedule
PeriodicallyReschedule
Busy Wait Spin Block Spin Yield
Local
SB SY
DCS DCS-SB DCS-SY
PB PB-SB PB-SY
Simulation Study
• A detailed simulator at a microsecond granularity
• System parameters– System configurations (maximum
MPL, to partition or not)– System overheads (context switch
overheads, interrupt costs, costs associated with manipulating queues)
Simulation Study (Cont’d)
• Application parameters– Injection load– Characteristics (CPU intensive, IO
intensive, communication intensive or somewhere in the middle)
Impact of Load
Impact of Workload Characteristics
Comm intensive I/O intensive
Periodic Boost Heuristics
• S1: Compute Phase• S2: S1 + Unconsumed
Msg.• S3: Recv. + Msg.
Arrived• S4: Recv. + No Msg.
• A: S3-> {S2,S1}• B: S3->S2->S1• C: {S3,S2,S1}• D: {S3,S2}->S1• E: S2->S3->S1
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Ave
rage
Job
Res
pon
se T
ime
(X10
000
seco
nd
s)
A B C D E
Analytical Modeling Study
• The state space is impossible to handle.
High Speed
Network
P0 P1 P2 P3 Pp
… …
Dynamic arrival
Analysis Descriptioni
X i, jA, j1B,…,jP
Bi+,
jA1, …, mA,
number of nodes
_ _
jkB
_ ik, jk,1B ,…,jk,Bik M , ik1,…,iM,jk
R,_
1,…,iM,jkR(l)
_
jk,l1,…,N,B jk 1,…,mQ+mO, k1,…,P, N Q ll=1
n
Original State Space (impossible to handle!!)
Assumption: The state of each processor is stochastically independent and identical to thestate of the other processors.
i, ,…, jiM,jQ jA,
Reduced State Space (much more tractable !! )
iY jR,j1
B_
B i+, jA1, …, mA, jR(l)1,…,iM,_
jkB1,…,N, jQ
1,…,mQ+mO
Number of jobs on node k
Analysis Description (Cont)
Address the state transition rates usingContinuous Markov model; Build the
Generator Matrix Q
Get the invariant probability vector by
solving Q = 0, and e = 1.
Use fixed-point iteration to get the solution
SB Example
1 C2 C
2 C1 IO
2 C1 C
1 IO2 IO
1
2 C1 IO
1 SN2 CQ
1
Q1
1 C2 C
2 C1 SN
1xP 1
1x(1-P1)Q…
…
…
…
1 SP2 C
1 C2 C
r 1
2 C1 B
1
2 C1 SP
Q
1 B2 IO
1
1 C2 C
r1’
2 C1 B
Q
…
… …
…
…
…
r1 = P( )x1 C2 * 1/1+1/1+1/1
1 +{P( )+P( )}x1 IO2 *
2 *1 IO
1
1/1+1/1
+P( )x1 SN2 *
1
r2 = …
Results
Optimal PB Frequency Optimal Spin Time for SB
Results – Optimal Quantum Length
Comm Intensive
CPU Intensive
I/OIntensive
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
Contribution 3:Scheduling Multiple Classes of Applications
realtime
interactive
batch
Objective
cluster
BE
RTHow long did it take me to finish?? Response time
How many deadlines have been missed? Miss rate
Fairness Ratio (x:y)
RT
BE
Cluster Resource xx+y
yx+y
How to Adhere to Fairness Ratio?
RT1RT2
BE
RT
BE1GS 2DCS-TDM 2DCS-PS
x:y = 2:1
tim
e
tim
e
tim
e
P0 P1 P0 P1P0 P1
BE response time
RT : BE = 2:1 RT : BE = 1:9
RT : BE = 9:1
RT Deadline Miss Rate
RT : BE = 2:1 RT : BE = 1:9
RT : BE = 9:1
• From OS’s perspective– Contribution 1: boosting the CPU utilization at
supercomputing centers– Contribution 2: providing quick responses for
commercial workloads– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective– Characterizing decision support workloads on
the clustered database server– Resource management for transaction
processing workloads on the clustered database server
Outline
NEXT
Experiment Setup
• IBM DB2 Universal Database for Linux, EEE, Version 7.2
• 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node.
• TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.
Myrinet
Server
Platform
Client
001A 002B 003C 004D
004D
003C
002B
001A
Table T
Select * from T
coordinator node
1
3 3 3 3 34
4 4 422 2 2
5
004D
003C
002B
001A
Methodology
• Identify the components with high system overhead.
• For each such component, characterize the request distribution.
• Come up with ways of optimization.
• Quantify potential benefits from the optimization.
Sampling OS Statistics
• Sample the statistics provided by stat, net/dev, process/stat.– User/system CPU %– # of pages faults– # of blocks read/written– # of reads/writes– # of packets sent/received– CPU utilization during I/O
Kernel Instrumentation
• Instrument each system call in the kernel.
Enter system call
block
unblock
resumeexecution
Exitsystem call
Operating System Profile
• Considerable part of the execution time is taken by pread system call.
• There is good overlap of computation with I/O for some queries.
• More reads than writes.
TPC-H pread OverheadQuery
% of exe time
Query
% of exe time
Q6 20.0 Q13 10.0
Q14 19.0 Q3 9.6
Q19 16.9 Q4 9.1
Q12 15.4 Q18 9.0
Q15 13.4 Q20 7.9
Q7 12.1 Q2 5.2
Q17 10.8 Q9 5.2
Q8 10.5 Q5 4.6
Q10 10.3 Q16 4.1
Q1 10.0 Q11 3.5
pread overhead = # of preads X overhead per pread.
pread Optimization
user space
pagecache 1
2
pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest }}
pagetable
Optimization:•Re-mapping the buffer•Copy on write
30s
Copy-on-write
user space
pagecache
read only
Query
% reduction
Query % reduction
Q1 98.9 Q11 96.1
Q2 85.7 Q12 87.1
Q3 96.0 Q13 100.0
Q4 80.9 Q14 96.1
Q5 100.0 Q15 96.8
Q6 100.0 Q16 70.7
Q7 79.7 Q17 94.5
Q8 79.3 Q18 100.0
Q9 88.7 Q19 95.7
Q10 77.8 Q20 94.4
# of copy-on-write
# of preads% reduction = 1 -
Operating System Profile
• Socket calls are the next dominant system calls.
Message Characteristics
Q11
Q16
Message Size (bytes)
Message Inter-injectionTime (Millisecond)
Message Destination
Observations on Messages
• Only a small set of message sizes is used.
• Many messages are sent in a short period.
• Message destination distribution is uniform.
• Many messages are point-to-point implementations of multicast/broadcast messages.
• Multicast can reduce # of messages.
Potential % Reduction in Messages
query
total
small
large query
total
small
large
Q1 44.7 71.4
38.7 Q11 9.6 28.6
0.1
Q2 20.4 58.7
0.2 Q12 8.3 7.8 2.9
Q3 48.2 64.3
38.0 Q13 24.5
75.2
0.1
Q4 22.6 58.6
0.1 Q14 27.9
80.4
0.7
Q5 8.0 7.1 8.4 Q15 46.6
56.5
0.7
Q6 76.4 78.6
45.5 Q16 59.1
63.0
56.9
Q7 57.5 71.4
56.2 Q17 41.5
66.7
27.3
Q8 29.1 75.5
4.8 Q18 11.4
32.3
0.0
Q9 66.8 78.5
61.1 Q19 26.7
79.4
0.2
Q10 25.0 73.6
0.1 Q20 21.1
62.8
0.1
Send ( msg, dest ) { if (msg = buffered_msg && dest dest_set) dest_set = dest_set { dest } ; else buffer the msg; }
Send_bg () { foreach buffered_msg if ( it has been buffered longer than threshold ) send multicast msg to nodes in dest_set;}
Online AlgorithmSend ( msg, dest ) { send msg to node dest;}
Impact of ThresholdQ7 Q16
Threshold (millisecond) Threshold (millisecond)
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Characterizing decision support workloads on
the clustered database server– Resource management for clustered database
applications NEXT
Ongoing/Near-term Work
• What is the optimal number of jobs which should be admitted?
• Can we dynamically pause some processes based on resource requirement and resource availability?
• Which dynamic co-scheduling scheme works best here?
• How do we exploit application level information in scheduling?
• Some next-generation applications– Real time medical imaging and collaborative surgery
Future Work
Application requirements:• VAST processing power, disk capacity and network bandwidth• absolute availability• deterministic performance
Future Work– E-business on demand
Requirements:• performance
more users responsive Quality-of-service
• availability• security• power consumption• pricing model
Future Work
• What does it take to get there?– Hardware innovations– Resource management and
isolation– Good scalability– High availability– Deterministic Performance
Future Work
• Not only high performance– Energy consumption– Security– Pricing for service – User satisfaction– System management– Ease of use
Related Work
• parallel job scheduling: – Gang Scheduling [Ousterhout82]– Backfilling ([Lifka95], [Feitelson98]) – Migration ([Epima96])
• Dynamic co-scheduling: – Spin Block ([Arpaci-Dusseau98],
[Anglano00]), – Periodic Boost ([Nagar99])– Demand-based Coscheduling
([Sobalvarro97]),
Related Work (Cont’d)
• Real-time Scheduling: – Earliest Deadline First– Rate Monotonic– Least Laxity First
• Single node Multi-class scheduling– Hierarchical scheduling ([Goyal96])– Proportional share ([Waldspurger95])
• Commercial clustered server (Pai[98], reserve)
Related Work (Cont’d)
• Commercial Workloads (CAECW, [Barford99], Kant[99])
• Database Characterizing ([Keeton99], [Ailamaki99], [Rosenblum97])
• OS support for database ([Stonebraker81], [Gray78], [Christmann87])
• Reducing copies in IO ([Pai00], [Druschel93], [Thadani95])
Publications
• IEEE Transactions on Parallel and Distributed Systems.
• International Parallel and Distributed Processing Symposium (IPDPS 2000)
• ACM International Conference on Supercomputing (ICS 2000)
• International Euro-par Conference (Europar 2000)• ACM Symposium on Parallel Algorithms and
Architectures (SPAA 2001)• Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP 2001)• Workshop on Computer Architecture Evaluation
Using Commercial Workloads (CAECW 2002)
Publications I:Batch Applications
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and Migration, 7th Workshop on Job Scheduling Strategies for Parallel Processing.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. Proceedings of 6th International Euro-Par Conference Lecture Notes in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by combining Gang Scheduling and Backfilling Techniques. International Parallel and Distributed Processing Symposium (IPDPS'2000), pages 133-142, May 2000.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems. Submitted to IEEE Transactions on Parallel and Distributed Systems.
Publications II:Interactive Applications
• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Penn State CSE tech report CSE-01-004.
• Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and Distributed Systems.
• Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulation-based Performance Study of Cluster Scheduling Mechanisms. 14th ACM International Conference on Supercomputing (ICS'2000), pages 100-109, May 2000.
• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Submitted to ACM Transactions on Modeling and Compute Simulation (TOMACS).
Publications III:Multi-class Applications• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort
and Real-Time Pipelined Applications on Time-Shared Clusters, the 13th Annual ACM symposium on Parallel Algorithms and Architectures.
• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, Submitted to IEEE Transactions on Parallel and Distributed Systems.
Publications IV:Database• Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H.
Franke. Decision-Support Workload Characteristics on a Clustered Database Server from the OS Perspective. Penn State Technical Report CSE-01-003
Thank You !
I/O Characteristics (Q6)
top related