The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup , Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l. Symposium on High Performance Distributed Computing Parallel and Distributed Systems Group, TU Delft
24
Embed
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema ACM/IEEE Int’l.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems
Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema
ACM/IEEE Int’l. Symposium on High Performance Distributed Computing
Parallel and Distributed Systems Group, TU Delft
2
The VL-e project
• A grid project in the Netherlands (2004-)
• Natural gas money: VL-e 45 MEuro / 800 MEuro total research package
• Overall aim: … to design and build a virtual lab for
(digitally) enhanced science (e-science) experiments (no in-vivo or in-vitro, but in-silico experiments).
• Goals:1. create prototypes of application-specific e-science
environments
2. design and develop re-usable ICT/grid components
3. validate with real-life applications in testbeds
• Complete scientific work better, … • User-oriented performance metrics
(time a critical performance component)• Bags-of-tasks for ease-of-use
• … in real systems• Workloads (now that real traces are available)• Information unavailability
• What to do?• Hint: the next 10% improvement won’t cut it!
7
The Challenge (cont’d.)
• System modelWhat is a good model for the study of large-scale distributed computing systems that run bag-of-tasks?
• Input modelWhat is a good model for bag-of-tasks workloads in large-scale distributed computing systems?
• What is the best setup for such system/input?• How to find the best?• If a best is found, can there be another?
8
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
9
Context: System Model [1/4]
Overview
• System Model1. Clusters
execute jobs
2. Resource managerscoordinate job execution
3. Resource management architecturesroute jobs among resource managers
4. Task selection policiescreate the eligible set
5. Task scheduling policies:schedule the eligible set
10
Context: System Model [2/4]
Resource Management Architecturesroute jobs among resource managers
Separated Clusters (sep-c)
Centralized (csp)
Decentralized (fcondor)
11
Context: System Model [3/4]
Task Selection Policiescreate the eligible set
• Age-based:1. S-T: Select Tasks in the order of their arrival.
2. S-BoT: Select BoTs in the order of their arrival.
• User priority based:3. S-U-Prio: Select the tasks of the User with the highest
Priority.
• Based on fairness in resource consumption:4. S-U-T: Select the Tasks of the User with the lowest res. cons.
5. S-U-BoT: Select the BoTs of the User with the lowest res. cons.
6. S-U-GRR: Select the User Round-Robin/all tasks for this user.
7. S-U-RR: Select the User Round-Robin/one task for this user.
12
Context: System Model [4/4]
Task Scheduling Policiesschedule the eligible set
• Information availability:• Known• Unknown• Historical records
• Sample policies:• Earliest Completion Time (with
Prediction of Runtimes) (ECT(-P))• Fastest Processor First (FPF)• (Dynamic) Fastest Processor Largest Task ((D)FPLT)• Shortest Task First w/ Replication (STFR) • Work Queue w/ Replication (WQR)
Task Information
Reso
urc
e
Info
rmati
on
K H U
K
H
U
ECT, FPLT
FPFECT-P
DFPLT,
MQDSTFR
RR, WQR
13
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
14
Workload Modeling 101: What Matters• Job arrival process & job service time:
• Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94]
• Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure) [IosupJSE EuroPar’07]
• Job size: almost always 1 CPU [IosupDELW Grid’06]
No.
Pac
kets
/T
ime
Uni
tN
o.P
acke
ts/
Tim
e U
nit
Time Units Time Units
Longer queues
TimeUnit=
0.01s
TimeUnit=
100s
15
• Model:• Users, Bags-of-Tasks, Tasks• Heavy-tailed distributions for inter-arrival time, job
service time→ can model self-similar workloads
• More details (e.g., parameter values): see article
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
17
Design Space Exploration [1/5]
Overview
• Design space exploration: time to understand how our solutions fit into the complete system.
• Study the impact of:• The Task Scheduling Policy (s policies)• The Workload Characteristics (P characteristics)• The Dynamic System Information (I levels)• The Task Selection Policy (S policies)• The Resource Management Architecture (A policies)
s x 7P x I x S x A x (environment) → >2M design points