Multicores, Manycores and Amdahl’s Law 2012 1
Feb 24, 2016
1
Multicores, Manycores and Amdahl’s Law
2012
2
Amdahl’s Law – Reminder
• Original Amdahl’s Law for n identical cores– f – fraction of parallelizable execution time– (1-f) – fraction of totally sequential execution time
• Sequential runs on a single core• Parallel runs on all n cores• Q: What are the hidden assumptions?
nff
speedup
1
1
3
Multicore CPU
Intel’s Sandy Bridge
• Manycore – Tens or hundreds of cores• Why don’t we have Sandy Bridge with 100 cores?
4
Core Performance Constraints
• Manufacturing technology
• Area (for more logic)– Area = Money; Manufacturing constraints
• Power (for more logic, higher frequencies)– Sub-threshold leakage current– More power requires better cooling solutions
5
So Why Not One Single Core?
Core
6
Large Core Performance
• We have a base line core (BCE) with area=1, performance=1
• We can add microarchitectural features– New core area is then r (r>1)– Large core is faster, with performance of perf(r)
• Q: For which perf(r) function, large core is better than multiple small ones?
• So what is perf(r) ?
Large CoreBCE
Big data caches
e.g., Simple
In-order core
OOOE
Accurate Branch
Prediction
uOp Cache
7
Area: Pollack’s Rule
• An empirical rule• Multicore implications. For example: double the CPU logic and get
– 40% more performance with a larger single-core– For purely parallel code – 100% more performance with dual-core
rrperf ~)(
8
Power• Power is usually considered as proportional to area• In this presentation we consider area as the main
constraint• Not completely true [Esmaeilzadeh’11]
• For simplicity we keep with rrperf ~)(
9
Why Multicore/Manycore?
• More performance per mm2 & watt for parallel code
• Less power (& heat)– Save power by turning on and off each CPU– Run each core in optimized frequency/power– Load balance to distribute heat– Lower die temperatures
• New performance constraint: parallel fraction
10
Cost Model
• To find the best performing CPU configuration we need a cost model
• Basic core - Baseline Core Equivalent (BCE)• Chip is limited to have no more than n BCEs• Performance
– Performance of each BCE is 1– Architects can expand the resources of r BCEs to
create a powerful core with performance of perf(r)• f – fraction of the parallelizable execution time
11
Symmetric Multicore Chips
n=16r=1
16 1-BCE cores 4 4-BCE cores
• Run the sequential part on one core• Run the parallel part on all cores
n=16r=4
12
Symmetric Multicore Chips
• n/r identical cores• Each core performance perf(r)• Execution
– Sequential part – 1 core; performance - perf(r)– Parallel part – all cores; performance - perf(r) * n/r
13
Symmetric, n=16
0.16 1.6 160
2
4
6
8
10
12
14
16
R BCEs
Sym
met
ric S
peed
upF=0.999
F=0.99
F=0.975
F=0.9
F=0.5
F=0.9, R=2, Cores=8, Speedup=6.7
As Moore’s Law enables N to go from 16 to 256 BCEs,More core enhancements? More cores? Or both?
14
Symmetric, n=256
0.256 2.56 25.6 2560
50
100
150
200
250
R BCEs
Sym
met
ric S
peed
up F=0.999
F=0.99
F=0.975
F=0.9F=0.5
F=0.9R=28 (vs. 2)Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)CORE ENHANCEMENTS!
F1R=1 (vs. 1)Cores=256 (vs. 16)Speedup=204 (vs. 16) MORE CORES!
F=0.99R=3 (vs. 1)
Cores=85 (vs. 16)Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS& MORE CORES!
15
Symmetric Multicores
• In symmetric multicores with fixed n, perf(r)=sqrt(r), maximum performance is achieved when:
• Q1: When will a single core perform better than any symmetric multicore?
• Q2: In the optimal configuration, what are the proportions of the execution time between the optimal sequential and parallel parts?
ffnropt
1
16
Asymmetric Multicore Chips
One 4-BCE core; Twelve 1-BCE cores
• Run the sequential part on the big core• Run the parallel part on all cores
17
Asymmetric Multicore Chips
• One large r-BCE core with performance of perf(r)• n-r small 1-BCE cores with performance of 1• Execution:
– Sequential part – 1 core; performance - perf(r)– Parallel part – all cores; performance - perf(r) + n - r
18
Asymmetric, n=256
• Is asymmetric architecture potential greater than that of symmetric?
0.256 2.56 25.6 2560
50
100
150
200
250
R BCEs
Asym
met
ric S
peed
upF=0.999
F=0.99
F=0.975
F=0.9
F=0.5
Recall F=0.99R=41Cores=216Speedup=166
19
Dynamic (Composed) Multicore Chips
• Combine up to r cores to boost sequential performance– Helper threads– Thread Level
Speculation– Hardware support
may be required
• Q: Why “up to r cores”?
20
Dynamic (Composed) Multicore Chips
• Execution:– Sequential part – 1 big core; performance - perf(r)– Parallel part – all cores; performance – n
21
Dynamic, n=256
• Q: How does dynamic multicore scale relatively to symmetric and asymmetric?
0.256 2.56 25.6 2560
50
100
150
200
250
R BCEs
Dyna
mic
Spe
edup
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5
F=0.99R=256 (vs. 41)Cores=256 (vs. 216)Speedup=223 (vs. 166)
Note: #Cores always N=256
22
Manufacturing Technology
• New manufacturing technology will not save us
23
The Future…
24
Summary
• Multicores and manycores are required due to the diminishing returns of large cores
• Amdahl’s Law allows us to predict the performance of various architectures
• Dynamic (composed) architecture is promising• To take advantage of future CPUs, the parallel
part of the code must be very high• …and still we are going to have a problem
25
References
• Amdahl’s Law in the Multicore Era [Hill’08]• Thousand Core Chips—A Technology
Perspective [Borkar’07]• Dark Silicon and the End of Multicore Scaling
[Esmaeilzade’11]• Performance, Power Efficiency and Scalability
of Asymmetric Cluster Chip Multiprocessors [Morad’05]