Advanced Microarchitecture Multi-This, Multi-That, …
Dec 19, 2015
Advanced MicroarchitectureMulti-This, Multi-That, …
2
Limits on IPC• Lam92
– This paper focused on impact of control flow on ILP– Speculative execution can expose 10-400 IPC
• assumes no machine limitations except for control dependencies and actual dataflow dependencies
• Wall91– This paper looked at limits more broadly
• No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC
• ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC
• perfect bpred, register renaming and memory disambiguation: 7-60 IPC
– This paper did not consider “control independent” instructions
Lecture 17: Multi-This, Multi-That, ...
3
Practical Limits• Today, 1-2 IPC sustained
– far from the 10’s-100’s reported by limit studies• Limited by:
– branch prediction accuracy– underlying DFG
• influenced by algorithms, compiler– memory bottleneck
– design complexity• implementation, test, validation, manufacturing, etc.
– power– die area
Lecture 17: Multi-This, Multi-That, ...
4
Differences BetweenReal Hardware and Limit Studies?• Real branch predictors aren’t 100%
accurate• Memory disambiguation is not perfect• Physical resources are limited
– can’t have infinite register renaming w/o infinite PRF
– need infinite-entry ROB, RS and LSQ– need 10’s-100’s of execution units for 10’s-
100’s of IPC• Bandwidth/Latencies are limited
– studies assumed single-cycle execution– infinite fetch/commit bandwidth– infinite memory bandwidth (perfect caching)Lecture 17: Multi-This, Multi-That, ...
5
Bridging the Gap
Lecture 17: Multi-This, Multi-That, ...
IPC
100
10
1
Single-IssuePipelined
SuperscalarOut-of-Order
(Today)
SuperscalarOut-of-Order
(Hypothetical-Aggressive)
Limits
Diminishing returns w.r.t.larger instruction window,
higher issue-width
Power has been growingexponentially as well
Watts/
6
Past the Knee of the Curve?
Lecture 17: Multi-This, Multi-That, ...
“Effort”
Performance
ScalarIn-Order
Moderate-PipeSuperscalar/OOO
Very-Deep-PipeAggressive
Superscalar/OOO
Made sense to goSuperscalar/OOO:
good ROI
Very little gain forsubstantial effort
7
So how do we get more Performance?• Keep pushing IPC and/or frequenecy?
– possible, but too costly• design complexity (time to market), cooling (cost),
power delivery (cost), etc.
• Look for other parallelism– ILP/IPC: fine-grained parallelism– Multi-programming: coarse grained parallelism
• assumes multiple user-visible processing elements• all parallelism up to this point was user-invisible
Lecture 17: Multi-This, Multi-That, ...
8
User Visible/Invisible• All microarchitecture performance gains up
to this point were “free”– free in that no user intervention required
beyond buying the new processor/system• recompilation/rewriting could provide even more
benefit, but you get some even if you do nothing
• Multi-processing pushes the problem of finding the parallelism to above the ISA interface
Lecture 17: Multi-This, Multi-That, ...
9
Workload Benefits
Lecture 17: Multi-This, Multi-That, ...
3-wideOOOCPU
Task A Task B
4-wideOOOCPU
Task A Task B
Benefit
3-wideOOOCPU
Task A Task B3-wideOOOCPU
2-wideOOOCPU
Task BTask A2-wide
OOOCPU
runtime
This assumes you have twotasks/programs to execute…
10
… If Only One Task
Lecture 17: Multi-This, Multi-That, ...
3-wideOOOCPU
Task A
4-wideOOOCPU
Task ABenefit
3-wideOOOCPU
3-wideOOOCPU
Task A
2-wideOOOCPU
2-wideOOOCPU
Task A
runtime
Idle
No benefit over 1 CPU
Performancedegradation!
11
Sources of (Coarse) Parallelism• Different applications
– MP3 player in background while you work on Office
– Other background tasks: OS/kernel, virus check, etc.
– Piped applications• gunzip -c foo.gz | grep bar | perl some-script.pl
• Within the same application– Java (scheduling, GC, etc.)– Explicitly coded multi-threading
• pthreads, MPI, etc.
Lecture 17: Multi-This, Multi-That, ...
12
(Execution) Latency vs. Bandwidth• Desktop processing
– typically want an application to execute as quickly as possible (minimize latency)
• Server/Enterprise processing– often throughput oriented (maximize
bandwidth)– latency of individual task less important
• ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time
Lecture 17: Multi-This, Multi-That, ...
13
Benefit of MP Depends on Workload• Limited number of parallel tasks to run on
PC– adding more CPUs than tasks provide zero
performance benefit• Even for parallel code, Amdahl’s law will
likely result in sub-linear speedup
Lecture 17: Multi-This, Multi-That, ...
parallelizable
1CPU 2CPUs 3CPUs 4CPUs
• In practice, parallelizable portion may not be evenly divisible
14
Cache Coherency Protocols• Not covered in this course
– You should have seen a bunch of this in CS6290
• Many different protocols– different number of states– different bandwidth/performance/complexity
tradeoffs
– current protocols usually referred to by their states• ex. MESI, MOESI, etc.
Lecture 17: Multi-This, Multi-That, ...
15
Shared Memory Focus• Most small-medium multi-processors (these
days) use some sort of shared memory– shared memory doesn’t scale as well to larger
number of nodes• communications are broadcast based• bus becomes a severe bottleneck
– or you have to deal with directory-based implementations
– message passing doesn’t need centralized bus• can arrange multi-processor like a graph
– nodes = CPUs, edges = independent links/routes• can have multiple communications/messages in
transit at the same time
Lecture 17: Multi-This, Multi-That, ...
16
SMP Machines• SMP = Symmetric Multi-Processing
– Symmetric = All CPUs are “equal”– Equal = any process can run on any CPU
• contrast with older parallel systems with master CPU and multiple worker CPUs
Lecture 17: Multi-This, Multi-That, ...
CPU0
CPU1
CPU2
CPU3
Pictures found from google images
17
Hardware Modifications for SMP• Processor
– mainly support for cache coherence protocols• includes caches, write buffers, LSQ• control complexity increases, as memory latencies may
be substantially more variable
• Motherboard– multiple sockets (one per CPU)– datapaths between CPUs and memory controller
• Other– Case: larger for bigger mobo, better airflow– Power: bigger power supply for N CPUs– Cooling: need to remove N CPUs’ worth of heat
Lecture 17: Multi-This, Multi-That, ...
18
Chip-Multiprocessing• Simple SMP on the same chip
Lecture 17: Multi-This, Multi-That, ...
Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX
Pictures found from google images
19
Shared Caches• Resources can be
shared between CPUs– ex. IBM Power 5
Lecture 17: Multi-This, Multi-That, ...
CPU0 CPU1
L2 cache shared betweenboth CPUs (no need to
keep two copies coherent)
L3 cache is also shared (only tagsare on-chip; data are off-chip)
20
Benefits?• Cheaper than mobo-based SMP
– all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)
– less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)
• Performance– on-chip communication is faster
• Efficiency– potentially better use of hardware resources
than trying to make wider/more OOO single-threaded CPU
Lecture 17: Multi-This, Multi-That, ...
21
Performance vs. Power• 2x CPUs not necessarily equal to 2x
performance
• 2x CPUs ½ power for each– maybe a little better than ½ if resources can be
shared
• Back-of-the-Envelope calculation:– 3.8 GHz CPU at 100W– Dual-core: 50W per CPU– P V3: Vorig
3/VCMP3 = 100W/50W VCMP = 0.8
Vorig
– f V: fCMP = 3.0GHzLecture 17: Multi-This, Multi-That, ...
22
Simultaneous Multi-Threading• Uni-Processor: 4-6 wide, lucky if you get 1-2
IPC– poor utilization
• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well
• SMT: Idea is to use a single large uni-processor as a multi-processor
Lecture 17: Multi-This, Multi-That, ...
23
SMT (2)
Lecture 17: Multi-This, Multi-That, ...
Regular CPU
CMP
2x HW Cost
SMT (4 threads)
Approx 1x HW Cost
24
Overview of SMT Hardware Changes• For an N-way (N threads) SMT, we need:
– Ability to fetch from N threads– N sets of registers (including PCs)– N rename tables (RATs)– N virtual memory spaces
• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)
Lecture 17: Multi-This, Multi-That, ...
25
SMT Fetch• Duplicate fetch logic
Lecture 17: Multi-This, Multi-That, ...
I$
fetch
fetch
fetch
Decode, Rename, DispatchPC0
PC1
PC2
RS
• Cycle-Multiplexed fetch logic
I$PC0
PC1
PC2
cycle % N
fetch Decode, etc.
RS
• Alternatives– Other-Multiplexed fetch
logic– Duplicate I$ as well
26
SMT Rename• Thread #1’s R12 != Thread #2’s R12
– separate name spaces– need to disambiguate
Lecture 17: Multi-This, Multi-That, ...
RAT0
RAT1
Thread0
Register #
Thread1
Register #
PRF RAT PRF
Thread-ID
Register #
concat
27
SMT Issue, Exec, Bypass, …• No change needed
Lecture 17: Multi-This, Multi-That, ...
Thread 0:
Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]
Thread 1:
Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]
Thread 0:
Add T12 = RT20 + T8Sub T19 = T12 – T16Xor T14 = T12 ^ T19Load T23 = 0[T14]Thread 1:
Add T17 = RT29 + T3Sub T5 = T17 – T2Xor T31 = T17 ^ T5Load T25 = 0[T31]
Add T12 = RT20 + T8
Sub T19 = T12 – T16
Xor T14 = T12 ^ T19
Load T23 = 0[T14]
Add T17 = RT29 + T3
Sub T5 = T17 – T2
Xor T31 = T17 ^ T5
Load T25 = 0[T31]
Shared RS Entries
After Renaming
28
SMT Cache• Each process has own virtual address
space– TLB must be thread-aware
• translate (thread-id,virtual page) physical page– Virtual portion of caches must also be thread-
aware• VIVT cache must now be (virutal addr, thread-id)-
indexed, (virtual addr, thread-id)-tagged• Similar for VIPT cache
Lecture 17: Multi-This, Multi-That, ...
29
SMT Commit• One “Commit PC” per thread• Register File Management
– ARF/PRF organization• need one ARF per thread
– Unified PRF• need one “architected RAT” per thread
• Need to maintain interrupts, exceptions, faults on a per-thread basis– like OOO needs to appear to outside world that
it is in-order, SMT needs to appear as if it is actually N CPUs
Lecture 17: Multi-This, Multi-That, ...
30
SMT Design Space• Number of threads• Full-SMT vs. Hard-partitioned SMT
– full-SMT: ROB-entries can be allocated arbitrarily between the threads
– hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.
• Amount of duplication– Duplicate I$, D$, fetch engine, decoders, schedulers,
etc.?– There’s a continuum of possibilities between SMT and
CMP• ex. could have CMP where FP unit is shared SMT-styled
Lecture 17: Multi-This, Multi-That, ...
31
SMT Performance• When it works, it fills idle “issue slots” with
work from other threads; throughput improves
Lecture 17: Multi-This, Multi-That, ...
• But sometimes it can cause performance degradation!
Time( ) < Time( )Finish one task,
then do the otherDo both at sametime using SMT
32
How?• Cache thrashing
Lecture 17: Multi-This, Multi-That, ...
I$ D$
Thread0 just fits inthe Level-1 Caches
Executesreasonablyquickly due
to high cachehit rates
Context switch to Thread1
I$ D$
Thread1 also fitsnicely in the caches
I$ D$
Caches were just big enoughto hold one thread’s data, but
not two thread’s worth
L2
Now both threads havesignificantly higher cache
miss rates
33
Fairness• Consider two programs
– By themselves:• Program A: runtime = 10 seconds• Program B: runtime = 10 seconds
– On SMT:• Program A: runtime = 14 seconds• Program B: runtime = 18 seconds
• Standard Deviation of Speedups (lower = better)– A’s speedup: 10/14 = 0.71– B’s speedup: 10/18 = 0.56– SDS = 0.11
Lecture 17: Multi-This, Multi-That, ...
34
Fairness (2)• SDS encourages everyone to be punished
similarly– does not account for actual performance, so if
everyone is 1000x slower, it’s still “fair”• Alternative: Harmonic Mean of Weighted
IPCs (HMWIPC)– IPCi = achieved IPC for thread i
– SingleIPCi = IPC when thread i runs alone– HMWIPC = N
SingleIPC1 + SingleIPC2 +… + SingleIPCN
IPC1 IPC2 IPCN
Lecture 17: Multi-This, Multi-That, ...
35
This is all combinable• Can have a system that supports SMP, CMP
and SMT at the same time
• Take a dual-socket SMP motherboard…• Insert two chips, each with a dual-core
CMP…• Where each core supports two-way SMT
• This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages
Lecture 17: Multi-This, Multi-That, ...
36
OS Confusion• SMT/CMP is supposed to look like multiple
CPUs to the software/OS
Lecture 17: Multi-This, Multi-That, ...
2-waySMT
2-waySMT
2 cores(either SMP/CMP)
CPU0
CPU1
CPU2
CPU3
Say OS has twotasks to run…
A
B
idle
idle
Schedule tasks to(virtual) CPUs
A/B
idle
Performanceworse thanif SMT wasturned offand used
2-way SMPonly
37
OS Confusion (2)• Asymmetries in MP-Hierarchy can be very
difficult for the OS to deal with– need to break abstraction: OS needs to know
which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)
– Distinct applications should be scheduled to physically different CPUs• no cache contention, no power contention
– Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP)• reduce latency of inter-thread communication,
possibly reduce duplication if shared L2 is used– Use SMT as last choice
Lecture 17: Multi-This, Multi-That, ...
38
Multi-* is Happening• Intel Pentium 4 already had
“Hyperthreading” (SMT)– went away for a while, but is back in Core i7
• IBM Power 5 and later have SMT• Dual, Quad core already here• Octo-core soon
– Intel Core i7: 8 cores, each with 2-thread SMT
• So is single-thread performance dead?• Is single-thread microarchitecture
performance dead?Lecture 17: Multi-This, Multi-That, ...
Following adapted from Mark Hill’s HPCA08 keynote talk
Recall Amdahl’s Law
• Begins with Simple Software Assumption (Limit Arg.)– Fraction F of execution time perfectly parallelizable– No Overhead for
– Scheduling– Synchronization– Communication, etc.
– Fraction 1 – F Completely Serial
• Time on 1 core = (1 – F) / 1 + F / 1 = 1
• Time on N cores = (1 – F) / 1 + F / N
39 The following slides derived from Mark Hill’s HPCA’08 Keynote
Recall Amdahl’s Law [1967]
• For mainframes, Amdahl expected 1 - F = 35%– For a 4-processor speedup = 2– For infinite-processor speedup < 3– Therefore, stay with mainframes with one/few
processors
• Do multicore chips repeal Amdahl’s Law?• Answer: No, But.
40
Amdahl’s Speedup =1
+1 - F1
F
N
Designing Multicore Chips Hard• Designers must confront single-core design
options– Instruction fetch, wakeup, select– Execution unit configuation & operand bypass– Load/queue(s) & data cache– Checkpoint, log, runahead, commit.
• As well as additional design degrees of freedom– How many cores? How big each?– Shared caches: levels? How many banks?– Memory interface: How many banks?– On-chip interconnect: bus, switched, ordered? 41
Want Simple Multicore Hardware ModelTo Complement Amdahl’s Simple Software
Model
(1) Chip Hardware Roughly Partitioned into– Multiple Cores (with L1 caches)– The Rest (L2/L3 cache banks, interconnect,
pads, etc.)– Changing Core Size/Number does NOT change
The Rest
(2) Resources for Multiple Cores Bounded– Bound of N resources per chip for cores– Due to area, power, cost ($$$), or multiple
factors– Bound = Power? (but our pictures use Area)
42
Want Simple Multicore Hardware Model, cont.
(3) Micro-architects can improve single-core performance using more of the bounded resource
• A Simple Base Core– Consumes 1 Base Core Equivalent (BCE) resources– Provides performance normalized to 1
• An Enhanced Core (in same process generation)– Consumes R BCEs– Performance as a function Perf(R)
• What does function Perf(R) look like?
43
More on Enhanced Cores• (Performance Perf(R) consuming R BCEs
resources)
• If Perf(R) > R Always enhance core• Cost-effectively speedups both sequential &
parallel
• Therefore, Equations Assume Perf(R) < R
• Graphs Assume Perf(R) = square root of R– 2x performance for 4 BCEs, 3x for 9 BCEs, etc.– Why? Models diminishing returns with “no coefficients”
• How to speedup enhanced core?– <Insert favorite or TBD micro-architectural ideas here>
44
How Many (Symmetric) Cores per Chip?
• Each Chip Bounded to N BCEs (for all cores)• Each Core consumes R BCEs• Assume Symmetric Multicore = All Cores
Identical• Therefore, N/R Cores per Chip — (N/R)*R = N• For an N = 16 BCE Chip:
45
Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core
Performance of Symmetric Multicore Chips• Serial Fraction 1-F uses 1 core at rate
Perf(R) • Serial time = (1 – F) / Perf(R)
• Parallel Fraction uses N/R cores at rate Perf(R) each
• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
• Therefore, w.r.t. one base core:
• Implications?46
Symmetric Speedup =
1
+1 - FPerf(R)
F * R
Perf(R)*N
Enhanced Cores speed Serial & Parallel
Symmetric Multicore Chip, N = 16 BCEs
F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))Need to increase parallelism to make multicore optimal!
47
0.16 1.6 160
2
4
6
8
10
12
14
16
R BCEs
Sym
metr
ic S
peed
up
F=0.5
(16 cores)
(8 cores) (2 cores) (1 core)
F=0.5R=16,
Cores=1,Speedup=4
(4 cores)
0.16 1.6 160
2
4
6
8
10
12
14
16
R BCEs
Sym
metr
ic S
peed
up
F=0.9
F=0.5
Symmetric Multicore Chip, N = 16 BCEs
At F=0.9, Multicore optimal, but speedup limited
Need to obtain even more parallelism!48
F=0.5R=16,
Cores=1,Speedup=4
F=0.9, R=2, Cores=8, Speedup=6.7
Symmetric Multicore Chip, N = 16 BCEs
49
0.16 1.6 160
2
4
6
8
10
12
14
16
R BCEs
Sym
metr
ic S
peed
up
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5
F matters: Amdahl’s Law applies to multicore chipsResearchers should target parallelism F first
F1, R=1, Cores=16, Speedup16
Symmetric Multicore Chip, N = 16 BCEs
50
As Moore’s Law enables N to go from 16 to 256 BCEs,More core enhancements? More cores? Or both?
0.16 1.6 160
2
4
6
8
10
12
14
16
R BCEs
Sym
metr
ic S
peed
up
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5
Recall F=0.9, R=2, Cores=8, Speedup=6.7
Symmetric Multicore Chip, N = 256 BCEs
As Moore’s Law increases N, often need enhanced core designsSome researchers should target single-core performance
51
0.256 2.56 25.6 2560
50
100
150
200
250
R BCEs
Sym
metr
ic S
peed
up
F=0.999
F=0.99
F=0.975
F=0.9F=0.5
F=0.9R=28 (vs. 2)
Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)
CORE ENHANCEMENTS!
F1R=1 (vs. 1)
Cores=256 (vs. 16)Speedup=204 (vs. 16)
MORE CORES!
F=0.99R=3 (vs. 1)
Cores=85 (vs. 16)Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS& MORE CORES!