Hardware-aware thread scheduling: the case of asymmetric multicore processors
Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder
* [email protected]://sosoa.inf.unisi.ch
2
CONTEXT AND OVERALL IDEAIntroduction
3
Context
• Modern CPUs increase the computational power through additional cores
• HW architectures are becoming increasingly more complex– Shared caches– Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers– Simultaneous MultiThreading (SMT) units
4
Context
• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources– Based on the underlying HW – Using a limited set of performance indicators (CPU
time, memory usage, etc.)
“Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.”
Performance Anxiety: Performance analysis in the new millenniumJoshua Bloch, Google Inc.
6
Contributions
2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis
- to improve processing units occupancy on SMT/asymmetric processors
1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers
7
FPUINT
The big pictureMonitoring daemon
OS threads and processes
Workload characterization
8
FPUINT
The big picture
Workload characterization
Hardware-aware scheduler
9
AMD BULLDOZER PROCESSORTarget architecture
10
AMD Bulldozer
• AMD Bulldozer architecture– Each CPU is implemented as a series of modules
(a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”)
– Arithmetic-Logic Units (ALUs) are really available per SMT unit
– A module is more similar to:• A dual core when doing integer ops• A single core with SMT=2 when
doing floating point ops
11
AMD Bulldozer
12
AMD Bulldozer
X
13
AMD Bulldozer
ok
14
WORKLOAD CHARACTERIZATION
15
Workload characterization
• Is used to sort processes and threads that are floating point intensive– Among the X most running threads• (where X = the number of cores available)
• Based on realtime monitoring system using Hardware Performance Counters (HPCs)
16
…about HPCs…
• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.
• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used
at the same time– This limits their wide adoption (yet) on large scale
• HW-specific
17
Workload characterization
• HPCs used:– PERF_COUNT_HW_CPU_CYCLES: measures the
total number of CPU cycles consumed by a thread during its execution time
– CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time
– L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time
18
MONITORING AND SCHEDULING INFRASTUCTURE DESING
19
BulldOver design
• Bulldozer Overseer -> BulldOver• Client-server architecture
20
BulldOver design
• Server– Daemon – Scans the underlying architecture– Time-based HPC monitoring (once per sec)• We target scientific workloads, short-lived threads are
not well suitable
– Applies scheduling policies– libHpcOverseer, hwloc, libpfm
21
BulldOver design
• Client– Command-line tool• prompt> bulldover java myprogram
– Traces the creation/termination of threads/processes
– Share information through shared memory with the server
– libmonitor, boost
22
BulldOver design
User space
23
EVALUATION
24
Testing environment
• Dell PowerEdge M915– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
modules each)• Limited to 1 CPU with 8 cores/4 modules
– Test limited to a single NUMA node• Avoiding latencies and other NUMA-related well known
effects
– Turbo mode and freq. scaling disabled
25
Benchmark suites
• SPEC CPU 2006– Perfect match for evaluating Integer vs. Floating point
behaviors
• SciMark 2.0– Java based– Noisy environment (additional threads for garbage
collection, JIT, etc.)– Mainly FPU-oriented, with different levels of stress– Modified multi-threaded version running several random
benchmarks over a thread-pool
26
Workload characterizationSpec CPU 2006
Empty FPU Cycles Total CPU Cycles
27
Workload characterizationSciMark 2.0
Empty FPU Cycles Total CPU Cycles
28
FPU usage and cachesFPU usage L2 cache miss ratio
29
Results for SPEC CPU 2006
Inefficient baseline
Improved scheduling
Default OS scheduling
Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores)
30
Discussion
• BulldOver avoids the worst case scenario– The default OS scheduler is not aware of the
workload characterization• Benefits coming both from improved cache
usage AND better FPU/Integer units occupancy
31
Results for Scimark 2.0
Default OS scheduling
Improved scheduling
Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores)
32
Discussion
• All the threads are FPU-intensive– But at different levels
• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage
intensity varies over time– BulldOver reacts accordingly
33
Conclusions- We show how thread scheduling not aware of the shared HW
resources available on the AMD Bulldozer processor can incur a significant performance penalty
- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can be applied and performance improved
- Our system is very low intrusive:- Low overhead (below 2%)- No kernel patching required- No code instrumentation- Works on any application
34
Conclusions
• Currently tuned for a specific HW architecture• Good for scientific workloads– Sampling rate is required (1 sec in our case, could
be less but can’t be 0…)• Based on a very simple scheduling policy– More sophisticated policies could be used
36
“Pow7Over”
• Work in progress on IBM Power7 processors– 1 CPU, 8 cores, up to 4 SMT units per core– Completely different…
• …operating system: RHEL 6.3• …architecture: PowerPC• …HPCs: IBM-specific ones (more than 500 available…)• …compiler: autotools 6.0
• Similar approach• Slightly less significant speedup
– But this is a full SMT– Similar overall behavior both for the PUs and L2 caches