Top Banner
Hardware-aware thread scheduling: the case of asymmetric multicore processors Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder * [email protected] http://sosoa.inf.unisi.ch
36

Hardware-aware thread scheduling: the case of asymmetric multicore processors

May 25, 2015

Download

Technology

Talk given at ICPADS 2012 in Singapore.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hardware-aware thread scheduling: the case of asymmetric multicore processors

Hardware-aware thread scheduling: the case of asymmetric multicore processors

Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder

* [email protected]://sosoa.inf.unisi.ch

Page 2: Hardware-aware thread scheduling: the case of asymmetric multicore processors

2

CONTEXT AND OVERALL IDEAIntroduction

Page 3: Hardware-aware thread scheduling: the case of asymmetric multicore processors

3

Context

• Modern CPUs increase the computational power through additional cores

• HW architectures are becoming increasingly more complex– Shared caches– Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers– Simultaneous MultiThreading (SMT) units

Page 4: Hardware-aware thread scheduling: the case of asymmetric multicore processors

4

Context

• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources– Based on the underlying HW – Using a limited set of performance indicators (CPU

time, memory usage, etc.)

Page 5: Hardware-aware thread scheduling: the case of asymmetric multicore processors

“Today it is impossible to estimate performance: you have to measure it. Programming has become an empirical science.”

Performance Anxiety: Performance analysis in the new millenniumJoshua Bloch, Google Inc.

Page 6: Hardware-aware thread scheduling: the case of asymmetric multicore processors

6

Contributions

2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis

- to improve processing units occupancy on SMT/asymmetric processors

1) Automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers

Page 7: Hardware-aware thread scheduling: the case of asymmetric multicore processors

7

FPUINT

The big pictureMonitoring daemon

OS threads and processes

Workload characterization

Page 8: Hardware-aware thread scheduling: the case of asymmetric multicore processors

8

FPUINT

The big picture

Workload characterization

Hardware-aware scheduler

Page 9: Hardware-aware thread scheduling: the case of asymmetric multicore processors

9

AMD BULLDOZER PROCESSORTarget architecture

Page 10: Hardware-aware thread scheduling: the case of asymmetric multicore processors

10

AMD Bulldozer

• AMD Bulldozer architecture– Each CPU is implemented as a series of modules

(a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”)

– Arithmetic-Logic Units (ALUs) are really available per SMT unit

– A module is more similar to:• A dual core when doing integer ops• A single core with SMT=2 when

doing floating point ops

Page 11: Hardware-aware thread scheduling: the case of asymmetric multicore processors

11

AMD Bulldozer

Page 12: Hardware-aware thread scheduling: the case of asymmetric multicore processors

12

AMD Bulldozer

X

Page 13: Hardware-aware thread scheduling: the case of asymmetric multicore processors

13

AMD Bulldozer

ok

Page 14: Hardware-aware thread scheduling: the case of asymmetric multicore processors

14

WORKLOAD CHARACTERIZATION

Page 15: Hardware-aware thread scheduling: the case of asymmetric multicore processors

15

Workload characterization

• Is used to sort processes and threads that are floating point intensive– Among the X most running threads• (where X = the number of cores available)

• Based on realtime monitoring system using Hardware Performance Counters (HPCs)

Page 16: Hardware-aware thread scheduling: the case of asymmetric multicore processors

16

…about HPCs…

• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.

• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used

at the same time– This limits their wide adoption (yet) on large scale

• HW-specific

Page 17: Hardware-aware thread scheduling: the case of asymmetric multicore processors

17

Workload characterization

• HPCs used:– PERF_COUNT_HW_CPU_CYCLES: measures the

total number of CPU cycles consumed by a thread during its execution time

– CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time

– L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time

Page 18: Hardware-aware thread scheduling: the case of asymmetric multicore processors

18

MONITORING AND SCHEDULING INFRASTUCTURE DESING

Page 19: Hardware-aware thread scheduling: the case of asymmetric multicore processors

19

BulldOver design

• Bulldozer Overseer -> BulldOver• Client-server architecture

Page 20: Hardware-aware thread scheduling: the case of asymmetric multicore processors

20

BulldOver design

• Server– Daemon – Scans the underlying architecture– Time-based HPC monitoring (once per sec)• We target scientific workloads, short-lived threads are

not well suitable

– Applies scheduling policies– libHpcOverseer, hwloc, libpfm

Page 21: Hardware-aware thread scheduling: the case of asymmetric multicore processors

21

BulldOver design

• Client– Command-line tool• prompt> bulldover java myprogram

– Traces the creation/termination of threads/processes

– Share information through shared memory with the server

– libmonitor, boost

Page 22: Hardware-aware thread scheduling: the case of asymmetric multicore processors

22

BulldOver design

User space

Page 23: Hardware-aware thread scheduling: the case of asymmetric multicore processors

23

EVALUATION

Page 24: Hardware-aware thread scheduling: the case of asymmetric multicore processors

24

Testing environment

• Dell PowerEdge M915– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8

modules each)• Limited to 1 CPU with 8 cores/4 modules

– Test limited to a single NUMA node• Avoiding latencies and other NUMA-related well known

effects

– Turbo mode and freq. scaling disabled

Page 25: Hardware-aware thread scheduling: the case of asymmetric multicore processors

25

Benchmark suites

• SPEC CPU 2006– Perfect match for evaluating Integer vs. Floating point

behaviors

• SciMark 2.0– Java based– Noisy environment (additional threads for garbage

collection, JIT, etc.)– Mainly FPU-oriented, with different levels of stress– Modified multi-threaded version running several random

benchmarks over a thread-pool

Page 26: Hardware-aware thread scheduling: the case of asymmetric multicore processors

26

Workload characterizationSpec CPU 2006

Empty FPU Cycles Total CPU Cycles

Page 27: Hardware-aware thread scheduling: the case of asymmetric multicore processors

27

Workload characterizationSciMark 2.0

Empty FPU Cycles Total CPU Cycles

Page 28: Hardware-aware thread scheduling: the case of asymmetric multicore processors

28

FPU usage and cachesFPU usage L2 cache miss ratio

Page 29: Hardware-aware thread scheduling: the case of asymmetric multicore processors

29

Results for SPEC CPU 2006

Inefficient baseline

Improved scheduling

Default OS scheduling

Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores)

Page 30: Hardware-aware thread scheduling: the case of asymmetric multicore processors

30

Discussion

• BulldOver avoids the worst case scenario– The default OS scheduler is not aware of the

workload characterization• Benefits coming both from improved cache

usage AND better FPU/Integer units occupancy

Page 31: Hardware-aware thread scheduling: the case of asymmetric multicore processors

31

Results for Scimark 2.0

Default OS scheduling

Improved scheduling

Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores)

Page 32: Hardware-aware thread scheduling: the case of asymmetric multicore processors

32

Discussion

• All the threads are FPU-intensive– But at different levels

• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage

intensity varies over time– BulldOver reacts accordingly

Page 33: Hardware-aware thread scheduling: the case of asymmetric multicore processors

33

Conclusions- We show how thread scheduling not aware of the shared HW

resources available on the AMD Bulldozer processor can incur a significant performance penalty

- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage

- Thanks to the realtime analysis, improved scheduling can be applied and performance improved

- Our system is very low intrusive:- Low overhead (below 2%)- No kernel patching required- No code instrumentation- Works on any application

Page 34: Hardware-aware thread scheduling: the case of asymmetric multicore processors

34

Conclusions

• Currently tuned for a specific HW architecture• Good for scientific workloads– Sampling rate is required (1 sec in our case, could

be less but can’t be 0…)• Based on a very simple scheduling policy– More sophisticated policies could be used

Page 35: Hardware-aware thread scheduling: the case of asymmetric multicore processors

35

Thanks!

Achille [email protected]://sosoa.inf.unisi.ch

Page 36: Hardware-aware thread scheduling: the case of asymmetric multicore processors

36

“Pow7Over”

• Work in progress on IBM Power7 processors– 1 CPU, 8 cores, up to 4 SMT units per core– Completely different…

• …operating system: RHEL 6.3• …architecture: PowerPC• …HPCs: IBM-specific ones (more than 500 available…)• …compiler: autotools 6.0

• Similar approach• Slightly less significant speedup

– But this is a full SMT– Similar overall behavior both for the PUs and L2 caches