When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled

Parallelism forImproved Efficiency

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy

University of Michigan

Motivation• Hardware trends

o CMPs are ubiquitous. o More and more cores in a system

• Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3.• Server: Tilera

• Multi-threaded applications are pervasive.• But, do we always want to maximize the number of threads?NO

Run fewer threads: DVFS• Most multi-threaded applications stop scaling beyond a

certain number of cores.• It becomes counter-productive to run more threads.• Maximum power budget is fixed for a system.• Fewer cores can “borrow” power from disabled cores.

o Intel Turbo Boost

Frequency increases in steps of 133 MHz

Scalability: Problems• Too many threads

o Increased contention for shared resources.o Increased synchronization costs.

• Too few threadso Underutilization of resources.

1 2 4 8 16 320

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

#threads = #cores

Scalability: Less threads are better

• 4 threads best for streamcluster

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

Scalability: Less threads are as good

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

• Ferret, facesim, x264, dedup show poor scalability

Scalability: Opportunities

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

• Run fewer threadso Disable some cores and increase frequency of the active ones.

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

Run fewer threads: DVFS

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

1 2 4 8 16 320

with DVFS

streamclus...

#threads

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

1 2 4 8 16 320

30 with DVFS

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

30 with DVFS

1 2 4 8 16 320

Parsec: speedup

#threads

#threads = #cores

• DVFS makes the case for fewer threads more compelling.• With fewer threads

o increase frequencyo reduce contention.

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

1 2 4 8 16 320

#threads

#threads = #cores

1 2 4 8 16 320

with DVFS

#threads

5 out of 11 benchmarks

Who can decide the best number of threads?

DVFS in current systems

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled

1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz

Turbo Boost increases frequency

Programmer decides how many threads to

run (e.g. 32 threads on 32 cores)

Inputs change

System resources change

Different hardware configurations

Program characteristics change

Our system

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Frequency is increased

Detection logic pro-actively

disables more threads

Turbo Boost

Less Is MOre (LIMO)• Less Is MOre for efficiency• Observation:

o Most programs do not scale after a certain limito DVFS can help provide better performance

• A runtime systemo Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

OutlineRoadblocks to scalability

Methodology

Results

Conclusion

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Roadblocks: Shared Cache

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170

#threads

Working set fits in

shared cache

Best performance

Working set does not fit in shared cache Working set too large

• Abstract representation of most multi-threaded programs• The peak performance point shifts depending on working set size

and shared cache size

Roadblocks: Program Resources

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Synchronization stalls (locks)

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

#threads

Roadblocks: Program Resources

Increased synchronization costs hurt performance

Best performance

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Frequency is increased

Pro-actively disables more

threads

8 threads disabled

After 100 million instructions,

working set size estimate calculated

Working set of 10 threads fits in cache - 6

threads disabled

Pro-actively disables more threads

1.8 GHz

• 20 threads at 1.1 GHz: 20 * 1.1 = 22

• 16 threads at 1.4 GHz: 16 * 1.4 = 22.4

• 10 threads at 1.4 GHz: 10 * 1.4 = 14

• 8 threads at 1.8 GHz: 8 * 1.8 = 14.4

Methodology: Configuration• Modified timing simulator FeS2 which uses Simics.• Hardware configuration:

Cores 32, out-of-orderCaches InclusiveCoherence protocol MOESI directoryTopology MeshOff-Chip memory bandwidth 5 GbpsL1 data cache PrivateL2 cache SharedMain memory latency 156 cyclesL1 hit latency 3 cyclesL2 hit latency 11 cyclesRouter + network link latency 5 cycles

Cores Frequency (GHz)

4 2.2688 1.816 1.42932 1.134

Methodology: Simulation• 9 evenly spaced checkpoints

• Timing simulations starting from these checkpoints

• 80 million useful instructions simulated/checkpointo Statistics cleared after the first 20 milliono Useful instructions: committed in user mode, excluding spin loops.

• Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.

1 2 3 4 5 6 7 8 902468

101214

8t 16t 32t

Execution interval

sExample perf. breakdown

Ferret

Example perf. breakdown

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

1 2 3 4 5 6 7 8 90

102030405060708090

8t 32t LIMO

Execution interval

Ferret

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

Ferret

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

Ferret

1 2 3 4 5 6 7 8 902468

101214 numProcs

Execution interval

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

Ferret

% Performance Improvement

blacksch

olesded

facesim

swaptions

fluidanimate

httpdsphinx

ferret

mcluste

TB_DVFS LIMO

Good scalability

Reduced synchronization

stalls

Reduced thrashing in

shared cache

Conclusion• Scalability is difficult to achieve and predict.• Determining best number of threads is hard.

o Contention in shared hardware resourceso Contention in program level shared objects

• LIMO frees the programmer from this burden.o Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

• 14% average improvement in performance over all threads.

Thank you!

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

fewer cores

dvfs9run fewer threads

number of threads

better4 threads

remaining cores

disabled cores

poor scalability

certain number of cores

Documents

Platinum Limo Presentation

Toronto limo services

limo economic analysis

Pthreads Topics Introduction to Pthreads Data Parallelism...

Limo Service Oakland

9. Alternative Konzepte: Parallele funktionale...

Parallelism and Orders of Signification (Parallelism...

Get the best wedding limo in Austin - Connect with Austin...

Hardware Parallelism vs. Software Parallelism ·...

Kentino Limo

Long Beach Limo

PARALLELISM PARALLELISM PARALLELISM

Concurrency 3 – Concurrent...

Charleston Limo Service

Trattoria Limo

Controlled Kernel Launch for Dynamic Parallelism in GPUs ·...