Introduction Predicting Job Symbiosis SMT Interference-Aware Scheduler Experimental Evaluation Conclusions Symbiotic Job Scheduling on the IBM POWER8 J. Feliu 1 S. Eyerman 2 J. Sahuquillo 1 S. Petit 1 1 Department of Computing Engineering (DISCA) Universitat Polit` ecnica de Val` encia [email protected], {jsahuqui,spetit}@disca.upv.es 2 Intel Belgium [email protected]March 16th, 2016 2 This work was done while Stijn Eyerman was at Ghent University J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 1 / 24
72
Embed
Symbiotic Job Scheduling on the IBM POWER8personales.upv.es/jofepre/docs/HPCA_2016_slides.pdf · 2016-03-15 · Implemented and evaluated on the IBM POWER8 Averagesystem throughput
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
2This work was done while Stijn Eyerman was at Ghent UniversityJ. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 1 / 24
Previous work on symbiotic schedulingUses sampling to explore the space of possible schedules(Snavely et al., ASPLOS’00)Relies on novel hardware (Eyerman et al, ASPLOS’10)Performs an offline analysis with µbenchmarks to predict theinterference between applications (Zhang et al., MICRO’14)
Previous work on symbiotic schedulingUses sampling to explore the space of possible schedules(Snavely et al., ASPLOS’00)Relies on novel hardware (Eyerman et al, ASPLOS’10)Performs an offline analysis with µbenchmarks to predict theinterference between applications (Zhang et al., MICRO’14)
ComponentsCj represents thread j own component in ST modeCk represents the ST component of the other threads in thescheduleCj ’ identifies the SMT component of thread j
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
ParametersαC reflects a constant increase in SMT over ST
βC reflects the fraction or relative increase of the original STcomponent appears in SMT executionγC models the impact of the sum of the ST components ofthe other co-scheduled threadsδC models extra interactions that may occur between threadsThe meaningful parameters are determined using regression
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
ParametersαC reflects a constant increase in SMT over STβC reflects the fraction or relative increase of the original STcomponent appears in SMT execution
γC models the impact of the sum of the ST components ofthe other co-scheduled threadsδC models extra interactions that may occur between threadsThe meaningful parameters are determined using regression
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
ParametersαC reflects a constant increase in SMT over STβC reflects the fraction or relative increase of the original STcomponent appears in SMT executionγC models the impact of the sum of the ST components ofthe other co-scheduled threads
δC models extra interactions that may occur between threadsThe meaningful parameters are determined using regression
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
ParametersαC reflects a constant increase in SMT over STβC reflects the fraction or relative increase of the original STcomponent appears in SMT executionγC models the impact of the sum of the ST components ofthe other co-scheduled threadsδC models extra interactions that may occur between threads
The meaningful parameters are determined using regression
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
ParametersαC reflects a constant increase in SMT over STβC reflects the fraction or relative increase of the original STcomponent appears in SMT executionγC models the impact of the sum of the ST components ofthe other co-scheduled threadsδC models extra interactions that may occur between threadsThe meaningful parameters are determined using regression
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 9 / 24
Obtaining the ST CPI stacks is not a trivial issueOffline profiling of CPI stacks (Impractical)Sampling CPI stacks at runtime (Overhead)Specific hardware to collect ST CPI stacks online (Unavailable)
Measure the SMT CPI stacks and invert the model toobtain ST CPI stacks
Not trivial: ST CPI not available in SMT executionSolved with an approximate approach
00.20.40.60.81
1.21.4
App1 App2
Normalize
dCPI
00.20.40.60.81
1.21.4
App1 App2
Slow
down
STCPIstacksnormalizedtoSTCPI
SMTCPIstacksnormalizedtoSTCPI
Model
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 11 / 24
Obtaining the ST CPI stacks is not a trivial issueOffline profiling of CPI stacks (Impractical)Sampling CPI stacks at runtime (Overhead)Specific hardware to collect ST CPI stacks online (Unavailable)
Measure the SMT CPI stacks and invert the model toobtain ST CPI stacks
Not trivial: ST CPI not available in SMT executionSolved with an approximate approach
00.20.40.60.81
1.21.4
App1 App2
Normalize
dCPI
00.20.40.60.81
1.21.4
App1 App2
Slow
down
STCPIstacksnormalizedtoSTCPI
SMTCPIstacksnormalizedtoSTCPI
Invertedmodel
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 11 / 24
Obtaining the ST CPI stacks is not a trivial issueOffline profiling of CPI stacks (Impractical)Sampling CPI stacks at runtime (Overhead)Specific hardware to collect ST CPI stacks online (Unavailable)
Measure the SMT CPI stacks and invert the model toobtain ST CPI stacks
Not trivial: ST CPI not available in SMT executionSolved with an approximate approach
00.20.40.60.81
1.21.4
App1 App2
Normalize
dCPI
00.20.40.60.81
1.21.4
App1 App2
Slow
down
STCPIstacksnormalizedtoSTCPI
SMTCPIstacksnormalizedtoSTCPI
Model
Invertedmodel
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 11 / 24
45 events form the full CPI stack of the the IBM POWER8
6 thread-level counters are implemented (4 programmable)Structural conflicts on some events that cannot be measuredtogether19 time slices required to build the full CPI stack
Unacceptable for schedulingObtaining an updated CPI stack is not possible
Fortunately, the CPI stack model is build hierarchicallyTop level with 5 componentsThe model accuracy is reduced, but it has lower complexityand use updated CPI stacks
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 13 / 24
10-core IBM POWER8SPEC CPU2006 benchmarks (reference input set)105 multiprogram workloads
From 8-application combinations on 4 cores to 20-applicationcombinations on 10 cores
Metrics:System throughput (STP), by means of the weighted speedupmetricSystem fairness, Unfairness = Max Slowdowni
Min Slowdownj∀{i , j} ∈ {1,N}
Four schedulers are compared:
RandomLinux, default Completely Fair Scheduler (CFS)L1-bandwidth aware scheduler, which balances the L1bandwidth utilization among cores. Feliu et al., PACT’13Symbiotic scheduler
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 18 / 24
10-core IBM POWER8SPEC CPU2006 benchmarks (reference input set)105 multiprogram workloads
From 8-application combinations on 4 cores to 20-applicationcombinations on 10 cores
Metrics:System throughput (STP), by means of the weighted speedupmetricSystem fairness, Unfairness = Max Slowdowni
Min Slowdownj∀{i , j} ∈ {1,N}
Four schedulers are compared:RandomLinux, default Completely Fair Scheduler (CFS)L1-bandwidth aware scheduler, which balances the L1bandwidth utilization among cores. Feliu et al., PACT’13Symbiotic scheduler
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 18 / 24
Frequency matrices of the job co-schedulesThe darker the color the more frequently the couple isscheduled together on a SMT core
In workload 5 4 two couples are scheduled very frequently(> 65%) ⇒ High symbiosisIn workload 5 3 there is not that predominant couple ⇒ Highphase behavior
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 21 / 24
Frequency matrices of the job co-schedulesThe darker the color the more frequently the couple isscheduled together on a SMT coreIn workload 5 4 two couples are scheduled very frequently(> 65%) ⇒ High symbiosisIn workload 5 3 there is not that predominant couple ⇒ Highphase behavior
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 21 / 24
Scheduling has considerable impact on the performance ofSMT multicores
Novel symbiotic job scheduler for SMT multicoresQuick estimation of the performance of schedules to select theoptimal oneUsing CPI stacks can quickly adapt to phase behaviorNo need of additional hardware nor sampling schedules
Improve the system throughput of the random and Linuxschedulers, on average, by 10.3% and 4.7%
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 23 / 24
2This work was done while Stijn Eyerman was at Ghent UniversityJ. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 24 / 24
Approximate approachSMT components normalized to SMT CPI ≈ ST componentsnormalized to ST CPI
Both add to oneIf the relative increase of the components is the same, thenboth stacks are equal
However, it is not accurate enough
Estimate the slowdown applying the model to the estimatednormalized ST CPIRenormalize the measured SMT CPI stacks using theestimated slowdownApply the inverse model to obtain new estimates for the STCPI stacks
0
0.5
1
1.5
2
2.5
3
App 1 App 2
CPI
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
App 1 App 2 App 1 App 2(a) Measured SMT
CPI stacks(b) Normalized SMT
CPI stacks
(e) Predicted normalized ST CPI stacks
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 24 / 24
Approximate approachSMT components normalized to SMT CPI ≈ ST componentsnormalized to ST CPIEstimate the slowdown applying the model to the estimatednormalized ST CPI
Renormalize the measured SMT CPI stacks using theestimated slowdownApply the inverse model to obtain new estimates for the STCPI stacks
0
0.5
1
1.5
2
2.5
3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
App 1 App 2
forward
model
CPI
0
0.2
0.4
0.6
0.8
1
App 1 App 2(a) Measured SMT
CPI stacks
App 1 App 2
(b) Normalized SMT CPI stacks
(c) Predicted normalized SMT CPI stacks
estimated slowdown
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 24 / 24
Approximate approachSMT components normalized to SMT CPI ≈ ST componentsnormalized to ST CPIEstimate the slowdown applying the model to the estimatednormalized ST CPIRenormalize the measured SMT CPI stacks using theestimated slowdown
Apply the inverse model to obtain new estimates for the STCPI stacks
0
0.5
1
1.5
2
2.5
3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
App 1 App 2
forward
model
CPI
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
App 1 App 2 App 1 App 2(a) Measured SMT
CPI stacks
App 1 App 2
(b) Normalized SMT CPI stacks
(c) Predicted normalized SMT CPI stacks
(d) Adjusted normalized SMT CPI stacks
estimated slowdown
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 24 / 24
Approximate approachSMT components normalized to SMT CPI ≈ ST componentsnormalized to ST CPIEstimate the slowdown applying the model to the estimatednormalized ST CPIRenormalize the measured SMT CPI stacks using theestimated slowdownApply the inverse model to obtain new estimates for the STCPI stacks
0
0.5
1
1.5
2
2.5
3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
App 1 App 2
forward
model
CPI
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
App 1 App 2 0
0.2
0.4
0.6
0.8
1
App 1 App 2
inverse
model
App 1 App 2(a) Measured SMT
CPI stacks
App 1 App 2
(b) Normalized SMT CPI stacks
(c) Predicted normalized SMT CPI stacks
(d) Adjusted normalized SMT CPI stacks
(e) Predicted normalized ST CPI stacks
estimated slowdown
J. Feliu, S. Eyerman, J. Sahuquillo, S. Petit HPCA’16 @ Barcelona, Spain March 16th, 2016 24 / 24