Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling M. Islam, P. Balaji, G. Sabin and P. Sadayappan Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies
32
Embed
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling. M. Islam , P. Balaji , G. Sabin and P. Sadayappan. Computer Science and Engineering, Ohio State University Mathematics and Computer Science, Argonne National Laboratory RNet Technologies. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Analyzing and Minimizing the Impact of Opportunity Cost in
QoS-aware Job Scheduling
M. Islam, P. Balaji, G. Sabin and P. Sadayappan
Computer Science and Engineering, Ohio State University
Mathematics and Computer Science, Argonne National Laboratory
RNet Technologies
• Publicly Usable Supercomputer Centers– Becoming increasingly common (OSC, SDSC, etc)– Jobs submitted with resource requirements
• CPUs, Memory, Estimate Runtime• Scheduler maps the requirements of the jobs to available resources
– If resources are available, job is scheduled immediately– Else, queued and scheduled to execute at a later time– Several job schedulers existing today: PBS, Maui, Silver
• Independent Parallel Job Scheduling Model– Dynamically arriving Independent Parallel Jobs– Popular model in most supercomputers
• Significant prior research on best-effort scheduling• Optimizations proposed for different metrics
– Utilization (U): what fraction of the resources is actually utilized. • U = Resource Used / Resource Provided
– Response Time (RT): Time from submission to completion• RT = Job’s completion time – Job’s arrival time
– Slowdown (SD): How much slower is the system as compared to a dedicated system
• SD = Job’s Response Time / Job’s Runtime
– Prioritization: Static (user or group based) and Dynamic (how long the job was in the queue)
• NERSC cluster provides static prioritization based on job cost
Previous Research in Job Scheduling
• Users can request for guarantees in turnaround time– E.g., Submit a job before leaving work at 5pm and request for a
deadline at 8am the next morning
• Two Components for QoS in Job Scheduling– Job Scheduling Component [islam03:qops]
• Admission Control: Can we meet the specified deadline?• Once admitted, cannot miss the specified deadline
– Revenue Management• Appropriate charging model• Urgent jobs cost more than non-urgent jobs• Need to prioritize jobs such that the incoming revenue is maximized
[islam03:qops] “QoPS: A QoS based scheme for Parallel Job Scheduling”, M. Islam, P. Balaji, P. Sadayappan and D. K. Panda. Published in JSSPP ’03 and LNCS ‘04.
QoS in Job Scheduling
J1
J2
Time
Pro
cess
ors J3
Current Time
Running Jobs
Opportunity Cost in Job Scheduling
J4 (10$)
D4
J5 (500$)D5
By scheduling J4, we lost the future opportunity to schedule the more expensive job J5
J4 has an opportunity cost of at least 500$
Problem Statement
• When the user submits a job, she pays an explicit cost
• However, the system also pays an implicit opportunity cost
• Accepting a job is beneficial if its explicit cost is greater
than its opportunity cost
• How do we determine the opportunity cost?– It depends on future jobs no way to know
• How do we design a predictive algorithm to estimate the
opportunity cost of a job?
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Advanced Reservation (before QoPS)– Before QoPS, the only way to guarantee a turnaround time
• Execution time window statically decided upfront
– Resources underutilized due to fragmentation– If resources are available early, the job can’t be rescheduled
• Primary Goals of QoPS:– Provide admission control
• When a new job arrives:– Reorder existing jobs to find feasible schedules– Select the best feasible schedule
– Ensure deadline guarantees for the accepted jobs• A later arriving job cannot force an existing job to miss its deadline!
QoPS: QoS for Parallel Job Scheduling
• Most supercomputer centers today do not provide QoS– Jobs are scheduled in a best-effort manner
– Thus, no special cost models for QoS either
• Some supercomputers provide prioritization (e.g., NERSC)– Different queues of jobs exist
– More expensive queues get higher priority
• For QoS-driven supercomputers, a new model required– Provider-centric: Supercomputer-center determines the charge
– User-centric: User offers the price / bid
Supercomputer Cost Model
Market-based User-centric Cost Model• User offers a price to the system
– Market-based bidding system– Proposed by Culler and Chase
• Price offered reduces with time (decay factor)• Offered price touches zero at the job deadline time
Rev
enue
Time
Maximum Revenue
Deadline
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
Value-aware QoPS (VQoPS)• Job acceptance based on two criteria:
– The deadline should be achievable (evaluated using QoPS)– The job should provide enough revenue so as to offset a statically
assumed opportunity cost• Product a fixed opportunity cost factor (OC-Factor) and the size of the
job (i.e., number of processor-hours requested)• Large jobs (more nodes or long running) have a higher opportunity
cost since they can potentially impact more later arriving jobs
• The OC-Factor has to be tuned by the system administrator based on the expected workload!– Complicated to evaluate– Difficult to adapt if workload changes
J1
J2
Time
Pro
cess
ors J3
Current Time
Running Jobs
VQoPS: An Example Scenario
J4 (10$)
D4
J5 (500$)D5
By not scheduling J4, we retained the future opportunity to schedule the more expensive job J5
Choosing the right OC-Factor is important for the scheme to be effective
Less than static opportunity cost (C)
VQoPS performance for different tracesRelative Urgency
Cost
Urgent Jobs (%)
Offered Load
OC-Factors
0.00 0.05 0.1 0.2 0.4
10X 80% Original 21% 26% 37% 37% 39%
5X 80% Original 20% 25% 34% 35% 30%
2X 80% Original 19% 26% 27% -47% -100%
10X 80% Original 21% 26% 37% 37% 39%
10X 50% Original 23% 34% 46% 45% 45%
10X 20% Original 26% 38% 22% 22% 22%
10X 80% Original 21% 26% 37% 37% 39%
10X 80% High 63% 90% 135% 144% 160%
VQoPS performance for different tracesRelative Urgency
Cost
Urgent Jobs (%)
Offered Load
OC-Factors
0.00 0.05 0.1 0.2 0.4
10X 80% Original 21% 26% 37% 37% 39%
5X 80% Original 20% 25% 34% 35% 30%
2X 80% Original 19% 26% 27% -47% -100%
10X 80% Original 21% 26% 37% 37% 39%
10X 50% Original 23% 34% 46% 45% 45%
10X 20% Original 26% 38% 22% 22% 22%
10X 80% Original 21% 26% 37% 37% 39%
10X 80% High 63% 90% 135% 144% 160%• No single static OC-Factor is best for all cases.• Best OC-Factor is dependent on trace characteristics.
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Estimate OC-Factor dynamically for best revenue gain• OC-Factor depends on
– System Load– Relative frequency of urgent jobs– Relative price of urgent jobs
• DVQoPS considers a history-based adaptive technique to consider all of the factors– Perform a what-if simulation by rolling back and find the best
OC-Factor
Dynamic “Self-learning” Value-aware QoPS
What-if Simulations in DVQoPSOC Factor = O
O1 O2 O3 ON
OC Factor = O3
O1 O2 O3 ON
OC Factor = O
O3 gave us the best revenue pick O3O2 gave us the best revenue pick O2
OC Factor = O2
We dynamically pick the OC-Factor that gave the best revenue in the previous roll-back interval
Impact of Rollback Window Size• Balancing Sensitivity and Stability
– Sensitivity: Too long a rollback window loses sensitivity to small changes in the workload
– Stability: Too short a rollback window loses stability and causes the results to be noisy
• Need to calculate rollback window dynamically
Rollback Window Size
Average Instability in OC-Factor
Load Variance Sensitivity
Revenue
4 6.18 2.89 508341077
32 2.99 0.34 692266945
48 1.36 0.24 715606095
128 1.13 0.04 701476009
Presentation Layout
• Introduction and Motivation
• Background on QoPS and QoS Cost Models
• Minimizing Opportunity Cost with Value-aware QoPS
• Dynamic “Self-learning” Value-aware QoPS
• Performance Results
• Conclusions
• Two categories of jobs– Urgent Jobs– Normal Jobs