Multicores, Manycores and Amdahl’s Law

1

Multicores, Manycores and Amdahl’s Law

2012

2

Amdahl’s Law – Reminder

• Original Amdahl’s Law for n identical cores– f – fraction of parallelizable execution time– (1-f) – fraction of totally sequential execution time

• Sequential runs on a single core• Parallel runs on all n cores• Q: What are the hidden assumptions?

nff

speedup

1

1

3

Multicore CPU

Intel’s Sandy Bridge

• Manycore – Tens or hundreds of cores• Why don’t we have Sandy Bridge with 100 cores?

4

Core Performance Constraints

• Manufacturing technology

• Area (for more logic)– Area = Money; Manufacturing constraints

• Power (for more logic, higher frequencies)– Sub-threshold leakage current– More power requires better cooling solutions

5

So Why Not One Single Core?

Core

6

Large Core Performance

• We have a base line core (BCE) with area=1, performance=1

• We can add microarchitectural features– New core area is then r (r>1)– Large core is faster, with performance of perf(r)

• Q: For which perf(r) function, large core is better than multiple small ones?

• So what is perf(r) ?

Large CoreBCE

Big data caches

e.g., Simple

In-order core

OOOE

Accurate Branch

Prediction

uOp Cache

7

Area: Pollack’s Rule

• An empirical rule• Multicore implications. For example: double the CPU logic and get

– 40% more performance with a larger single-core– For purely parallel code – 100% more performance with dual-core

rrperf ~)(

8

Power• Power is usually considered as proportional to area• In this presentation we consider area as the main

constraint• Not completely true [Esmaeilzadeh’11]

• For simplicity we keep with rrperf ~)(

9

Why Multicore/Manycore?

• More performance per mm2 & watt for parallel code

• Less power (& heat)– Save power by turning on and off each CPU– Run each core in optimized frequency/power– Load balance to distribute heat– Lower die temperatures

• New performance constraint: parallel fraction

10

Cost Model

• To find the best performing CPU configuration we need a cost model

• Basic core - Baseline Core Equivalent (BCE)• Chip is limited to have no more than n BCEs• Performance

– Performance of each BCE is 1– Architects can expand the resources of r BCEs to

create a powerful core with performance of perf(r)• f – fraction of the parallelizable execution time

11

Symmetric Multicore Chips

n=16r=1

16 1-BCE cores 4 4-BCE cores

• Run the sequential part on one core• Run the parallel part on all cores

n=16r=4

12

Symmetric Multicore Chips

• n/r identical cores• Each core performance perf(r)• Execution

– Sequential part – 1 core; performance - perf(r)– Parallel part – all cores; performance - perf(r) * n/r

13

Symmetric, n=16

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

met

ric S

peed

upF=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F=0.9, R=2, Cores=8, Speedup=6.7

As Moore’s Law enables N to go from 16 to 256 BCEs,More core enhancements? More cores? Or both?

14

Symmetric, n=256

0.256 2.56 25.6 2560

50

100

150

200

250

R BCEs

Sym

met

ric S

peed

up F=0.999

F=0.99

F=0.975

F=0.9F=0.5

F=0.9R=28 (vs. 2)Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)CORE ENHANCEMENTS!

F1R=1 (vs. 1)Cores=256 (vs. 16)Speedup=204 (vs. 16) MORE CORES!

F=0.99R=3 (vs. 1)

Cores=85 (vs. 16)Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS& MORE CORES!

15

Symmetric Multicores

• In symmetric multicores with fixed n, perf(r)=sqrt(r), maximum performance is achieved when:

• Q1: When will a single core perform better than any symmetric multicore?

• Q2: In the optimal configuration, what are the proportions of the execution time between the optimal sequential and parallel parts?

ffnropt

1

16

Asymmetric Multicore Chips

One 4-BCE core; Twelve 1-BCE cores

• Run the sequential part on the big core• Run the parallel part on all cores

17

Asymmetric Multicore Chips

• One large r-BCE core with performance of perf(r)• n-r small 1-BCE cores with performance of 1• Execution:

– Sequential part – 1 core; performance - perf(r)– Parallel part – all cores; performance - perf(r) + n - r

18

Asymmetric, n=256

• Is asymmetric architecture potential greater than that of symmetric?

0.256 2.56 25.6 2560

50

100

150

200

250

R BCEs

Asym

met

ric S

peed

upF=0.999

F=0.99

F=0.975

F=0.9

F=0.5

Recall F=0.99R=41Cores=216Speedup=166

19

Dynamic (Composed) Multicore Chips

• Combine up to r cores to boost sequential performance– Helper threads– Thread Level

Speculation– Hardware support

may be required

• Q: Why “up to r cores”?

20

Dynamic (Composed) Multicore Chips

• Execution:– Sequential part – 1 big core; performance - perf(r)– Parallel part – all cores; performance – n

21

Dynamic, n=256

• Q: How does dynamic multicore scale relatively to symmetric and asymmetric?

0.256 2.56 25.6 2560

50

100

150

200

250

R BCEs

Dyna

mic

Spe

edup

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F=0.99R=256 (vs. 41)Cores=256 (vs. 216)Speedup=223 (vs. 166)

Note: #Cores always N=256

22

Manufacturing Technology

• New manufacturing technology will not save us

23

The Future…

24

Summary

• Multicores and manycores are required due to the diminishing returns of large cores

• Amdahl’s Law allows us to predict the performance of various architectures

• Dynamic (composed) architecture is promising• To take advantage of future CPUs, the parallel

part of the code must be very high• …and still we are going to have a problem

25

References

• Amdahl’s Law in the Multicore Era [Hill’08]• Thousand Core Chips—A Technology

Perspective [Borkar’07]• Dark Silicon and the End of Multicore Scaling

[Esmaeilzade’11]• Performance, Power Efficiency and Scalability

of Asymmetric Cluster Chip Multiprocessors [Morad’05]

Multicores, Manycores and Amdahl’s Law

Documents

core enhancements

powerful core

r r1large core

cores performance perfr

core55large core performancewe

base line core bce

maximum performance

performance of perfrq