Multi-This, Multi-That, …. Lam92 –This paper focused on impact of control flow on ILP –Speculative execution can expose 10-400 IPC assumes no machine.

Advanced MicroarchitectureMulti-This, Multi-That, …

2

Limits on IPC• Lam92

– This paper focused on impact of control flow on ILP– Speculative execution can expose 10-400 IPC

• assumes no machine limitations except for control dependencies and actual dataflow dependencies

• Wall91– This paper looked at limits more broadly

• No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC

• ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC

• perfect bpred, register renaming and memory disambiguation: 7-60 IPC

– This paper did not consider “control independent” instructions

Lecture 17: Multi-This, Multi-That, ...

3

Practical Limits• Today, 1-2 IPC sustained

– far from the 10’s-100’s reported by limit studies• Limited by:

– branch prediction accuracy– underlying DFG

• influenced by algorithms, compiler– memory bottleneck

– design complexity• implementation, test, validation, manufacturing, etc.

– power– die area


4

Differences BetweenReal Hardware and Limit Studies?• Real branch predictors aren’t 100%

accurate• Memory disambiguation is not perfect• Physical resources are limited

– can’t have infinite register renaming w/o infinite PRF

– need infinite-entry ROB, RS and LSQ– need 10’s-100’s of execution units for 10’s-

100’s of IPC• Bandwidth/Latencies are limited

– studies assumed single-cycle execution– infinite fetch/commit bandwidth– infinite memory bandwidth (perfect caching)Lecture 17: Multi-This, Multi-That, ...

5

Bridging the Gap


IPC

100

10

1

Single-IssuePipelined

SuperscalarOut-of-Order

(Today)

SuperscalarOut-of-Order

(Hypothetical-Aggressive)

Limits

Diminishing returns w.r.t.larger instruction window,

higher issue-width

Power has been growingexponentially as well

Watts/

6

Past the Knee of the Curve?


“Effort”

Performance

ScalarIn-Order

Moderate-PipeSuperscalar/OOO

Very-Deep-PipeAggressive

Superscalar/OOO

Made sense to goSuperscalar/OOO:

good ROI

Very little gain forsubstantial effort

7

So how do we get more Performance?• Keep pushing IPC and/or frequenecy?

– possible, but too costly• design complexity (time to market), cooling (cost),

power delivery (cost), etc.

• Look for other parallelism– ILP/IPC: fine-grained parallelism– Multi-programming: coarse grained parallelism

• assumes multiple user-visible processing elements• all parallelism up to this point was user-invisible


8

User Visible/Invisible• All microarchitecture performance gains up

to this point were “free”– free in that no user intervention required

beyond buying the new processor/system• recompilation/rewriting could provide even more

benefit, but you get some even if you do nothing

• Multi-processing pushes the problem of finding the parallelism to above the ISA interface


9

Workload Benefits


3-wideOOOCPU

Task A Task B

4-wideOOOCPU

Task A Task B

Benefit

3-wideOOOCPU

Task A Task B3-wideOOOCPU

2-wideOOOCPU

Task BTask A2-wide

OOOCPU

runtime

This assumes you have twotasks/programs to execute…

10

… If Only One Task


3-wideOOOCPU

Task A

4-wideOOOCPU

Task ABenefit

3-wideOOOCPU

3-wideOOOCPU

Task A

2-wideOOOCPU

2-wideOOOCPU

Task A

runtime

Idle

No benefit over 1 CPU

Performancedegradation!

11

Sources of (Coarse) Parallelism• Different applications

– MP3 player in background while you work on Office

– Other background tasks: OS/kernel, virus check, etc.

– Piped applications• gunzip -c foo.gz | grep bar | perl some-script.pl

• Within the same application– Java (scheduling, GC, etc.)– Explicitly coded multi-threading

• pthreads, MPI, etc.


12

(Execution) Latency vs. Bandwidth• Desktop processing

– typically want an application to execute as quickly as possible (minimize latency)

• Server/Enterprise processing– often throughput oriented (maximize

bandwidth)– latency of individual task less important

• ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time


13

Benefit of MP Depends on Workload• Limited number of parallel tasks to run on

PC– adding more CPUs than tasks provide zero

performance benefit• Even for parallel code, Amdahl’s law will

likely result in sub-linear speedup


parallelizable

1CPU 2CPUs 3CPUs 4CPUs

• In practice, parallelizable portion may not be evenly divisible

14

Cache Coherency Protocols• Not covered in this course

– You should have seen a bunch of this in CS6290

• Many different protocols– different number of states– different bandwidth/performance/complexity

tradeoffs

– current protocols usually referred to by their states• ex. MESI, MOESI, etc.


15

Shared Memory Focus• Most small-medium multi-processors (these

days) use some sort of shared memory– shared memory doesn’t scale as well to larger

number of nodes• communications are broadcast based• bus becomes a severe bottleneck

– or you have to deal with directory-based implementations

– message passing doesn’t need centralized bus• can arrange multi-processor like a graph

– nodes = CPUs, edges = independent links/routes• can have multiple communications/messages in

transit at the same time


16

SMP Machines• SMP = Symmetric Multi-Processing

– Symmetric = All CPUs are “equal”– Equal = any process can run on any CPU

• contrast with older parallel systems with master CPU and multiple worker CPUs


CPU0

CPU1

CPU2

CPU3

Pictures found from google images

17

Hardware Modifications for SMP• Processor

– mainly support for cache coherence protocols• includes caches, write buffers, LSQ• control complexity increases, as memory latencies may

be substantially more variable

• Motherboard– multiple sockets (one per CPU)– datapaths between CPUs and memory controller

• Other– Case: larger for bigger mobo, better airflow– Power: bigger power supply for N CPUs– Cooling: need to remove N CPUs’ worth of heat


18

Chip-Multiprocessing• Simple SMP on the same chip


Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

Pictures found from google images

19

Shared Caches• Resources can be

shared between CPUs– ex. IBM Power 5


CPU0 CPU1

L2 cache shared betweenboth CPUs (no need to

keep two copies coherent)

L3 cache is also shared (only tagsare on-chip; data are off-chip)

20

Benefits?• Cheaper than mobo-based SMP

– all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)

– less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)

• Performance– on-chip communication is faster

• Efficiency– potentially better use of hardware resources

than trying to make wider/more OOO single-threaded CPU


21

Performance vs. Power• 2x CPUs not necessarily equal to 2x

performance

• 2x CPUs ½ power for each– maybe a little better than ½ if resources can be

shared

• Back-of-the-Envelope calculation:– 3.8 GHz CPU at 100W– Dual-core: 50W per CPU– P V3: Vorig

3/VCMP3 = 100W/50W VCMP = 0.8

Vorig

– f V: fCMP = 3.0GHzLecture 17: Multi-This, Multi-That, ...

22

Simultaneous Multi-Threading• Uni-Processor: 4-6 wide, lucky if you get 1-2

IPC– poor utilization

• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well

• SMT: Idea is to use a single large uni-processor as a multi-processor


23

SMT (2)


Regular CPU

CMP

2x HW Cost

SMT (4 threads)

Approx 1x HW Cost

24

Overview of SMT Hardware Changes• For an N-way (N threads) SMT, we need:

– Ability to fetch from N threads– N sets of registers (including PCs)– N rename tables (RATs)– N virtual memory spaces

• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)


25

SMT Fetch• Duplicate fetch logic


I$

fetch

fetch

fetch

Decode, Rename, DispatchPC0

PC1

PC2

RS

• Cycle-Multiplexed fetch logic

I$PC0

PC1

PC2

cycle % N

fetch Decode, etc.

RS

• Alternatives– Other-Multiplexed fetch

logic– Duplicate I$ as well

26

SMT Rename• Thread #1’s R12 != Thread #2’s R12

– separate name spaces– need to disambiguate


RAT0

RAT1

Thread0

Register #

Thread1

Register #

PRF RAT PRF

Thread-ID

Register #

concat

27

SMT Issue, Exec, Bypass, …• No change needed


Thread 0:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 1:

Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]

Thread 0:

Add T12 = RT20 + T8Sub T19 = T12 – T16Xor T14 = T12 ^ T19Load T23 = 0[T14]Thread 1:

Add T17 = RT29 + T3Sub T5 = T17 – T2Xor T31 = T17 ^ T5Load T25 = 0[T31]

Add T12 = RT20 + T8

Sub T19 = T12 – T16

Xor T14 = T12 ^ T19

Load T23 = 0[T14]

Add T17 = RT29 + T3

Sub T5 = T17 – T2

Xor T31 = T17 ^ T5

Load T25 = 0[T31]

Shared RS Entries

After Renaming

28

SMT Cache• Each process has own virtual address

space– TLB must be thread-aware

• translate (thread-id,virtual page) physical page– Virtual portion of caches must also be thread-

aware• VIVT cache must now be (virutal addr, thread-id)-

indexed, (virtual addr, thread-id)-tagged• Similar for VIPT cache


29

SMT Commit• One “Commit PC” per thread• Register File Management

– ARF/PRF organization• need one ARF per thread

– Unified PRF• need one “architected RAT” per thread

• Need to maintain interrupts, exceptions, faults on a per-thread basis– like OOO needs to appear to outside world that

it is in-order, SMT needs to appear as if it is actually N CPUs


30

SMT Design Space• Number of threads• Full-SMT vs. Hard-partitioned SMT

– full-SMT: ROB-entries can be allocated arbitrarily between the threads

– hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.

• Amount of duplication– Duplicate I$, D$, fetch engine, decoders, schedulers,

etc.?– There’s a continuum of possibilities between SMT and

CMP• ex. could have CMP where FP unit is shared SMT-styled


31

SMT Performance• When it works, it fills idle “issue slots” with

work from other threads; throughput improves


• But sometimes it can cause performance degradation!

Time( ) < Time( )Finish one task,

then do the otherDo both at sametime using SMT

32

How?• Cache thrashing


I$ D$

Thread0 just fits inthe Level-1 Caches

Executesreasonablyquickly due

to high cachehit rates

Context switch to Thread1

I$ D$

Thread1 also fitsnicely in the caches

I$ D$

Caches were just big enoughto hold one thread’s data, but

not two thread’s worth

L2

Now both threads havesignificantly higher cache

miss rates

33

Fairness• Consider two programs

– By themselves:• Program A: runtime = 10 seconds• Program B: runtime = 10 seconds

– On SMT:• Program A: runtime = 14 seconds• Program B: runtime = 18 seconds

• Standard Deviation of Speedups (lower = better)– A’s speedup: 10/14 = 0.71– B’s speedup: 10/18 = 0.56– SDS = 0.11


34

Fairness (2)• SDS encourages everyone to be punished

similarly– does not account for actual performance, so if

everyone is 1000x slower, it’s still “fair”• Alternative: Harmonic Mean of Weighted

IPCs (HMWIPC)– IPCi = achieved IPC for thread i

– SingleIPCi = IPC when thread i runs alone– HMWIPC = N

SingleIPC1 + SingleIPC2 +… + SingleIPCN

IPC1 IPC2 IPCN


35

This is all combinable• Can have a system that supports SMP, CMP

and SMT at the same time

• Take a dual-socket SMP motherboard…• Insert two chips, each with a dual-core

CMP…• Where each core supports two-way SMT

• This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages


36

OS Confusion• SMT/CMP is supposed to look like multiple

CPUs to the software/OS


2-waySMT

2-waySMT

2 cores(either SMP/CMP)

CPU0

CPU1

CPU2

CPU3

Say OS has twotasks to run…

A

B

idle

idle

Schedule tasks to(virtual) CPUs

A/B

idle

Performanceworse thanif SMT wasturned offand used

2-way SMPonly

http://images.google.com/imgres?imgurl=http://upload.wikimedia.org/wikipedia/simple/9/9c/Microsoft_Windows_logo.gif&imgrefurl=http://simple.wikipedia.org/wiki/Microsoft_Windows&h=196&w=287&sz=12&tbnid=eynSai-BDnQJ:&tbnh=75&tbnw=110&hl=en&prev=/images?q=windows+logo&hl=en&lr=&oi=imagesr&start=3

http://images.google.com/imgres?imgurl=http://www.bettercomponents.be/images/linux-logo.gif&imgrefurl=http://www.bettercomponents.be/index.php?cPath=96&h=360&w=327&sz=132&tbnid=UKfPlBMXgToJ:&tbnh=117&tbnw=106&prev=/images?q=linux+logo&hl=en&lr=&oi=imagesr&start=1

http://images.google.com/imgres?imgurl=http://www.hpc.co.jp/B-Common/EM64T_Processor/Pentium4_Processor-VL.jpg&imgrefurl=http://www.hpc.co.jp/B-Common/tech_term.html&h=800&w=800&sz=385&tbnid=GtLYj9PAQ30J:&tbnh=142&tbnw=142&hl=en&start=3&prev=/images?q=cpu+pentium+4&svnum=10&hl=en&lr=&rls=RNWE,RNWE:2004-21,RNWE:en




37

OS Confusion (2)• Asymmetries in MP-Hierarchy can be very

difficult for the OS to deal with– need to break abstraction: OS needs to know

which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)

– Distinct applications should be scheduled to physically different CPUs• no cache contention, no power contention

– Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP)• reduce latency of inter-thread communication,

possibly reduce duplication if shared L2 is used– Use SMT as last choice


38

Multi-* is Happening• Intel Pentium 4 already had

“Hyperthreading” (SMT)– went away for a while, but is back in Core i7

• IBM Power 5 and later have SMT• Dual, Quad core already here• Octo-core soon

– Intel Core i7: 8 cores, each with 2-thread SMT

• So is single-thread performance dead?• Is single-thread microarchitecture

performance dead?Lecture 17: Multi-This, Multi-That, ...

Following adapted from Mark Hill’s HPCA08 keynote talk

Recall Amdahl’s Law

• Begins with Simple Software Assumption (Limit Arg.)– Fraction F of execution time perfectly parallelizable– No Overhead for

– Scheduling– Synchronization– Communication, etc.

– Fraction 1 – F Completely Serial

• Time on 1 core = (1 – F) / 1 + F / 1 = 1

• Time on N cores = (1 – F) / 1 + F / N

39 The following slides derived from Mark Hill’s HPCA’08 Keynote

Recall Amdahl’s Law [1967]

• For mainframes, Amdahl expected 1 - F = 35%– For a 4-processor speedup = 2– For infinite-processor speedup < 3– Therefore, stay with mainframes with one/few

processors

• Do multicore chips repeal Amdahl’s Law?• Answer: No, But.

40

Amdahl’s Speedup =1

+1 - F1

F

N

Designing Multicore Chips Hard• Designers must confront single-core design

options– Instruction fetch, wakeup, select– Execution unit configuation & operand bypass– Load/queue(s) & data cache– Checkpoint, log, runahead, commit.

• As well as additional design degrees of freedom– How many cores? How big each?– Shared caches: levels? How many banks?– Memory interface: How many banks?– On-chip interconnect: bus, switched, ordered? 41

Want Simple Multicore Hardware ModelTo Complement Amdahl’s Simple Software

Model

(1) Chip Hardware Roughly Partitioned into– Multiple Cores (with L1 caches)– The Rest (L2/L3 cache banks, interconnect,

pads, etc.)– Changing Core Size/Number does NOT change

The Rest

(2) Resources for Multiple Cores Bounded– Bound of N resources per chip for cores– Due to area, power, cost ($$$), or multiple

factors– Bound = Power? (but our pictures use Area)

42

Want Simple Multicore Hardware Model, cont.

(3) Micro-architects can improve single-core performance using more of the bounded resource

• A Simple Base Core– Consumes 1 Base Core Equivalent (BCE) resources– Provides performance normalized to 1

• An Enhanced Core (in same process generation)– Consumes R BCEs– Performance as a function Perf(R)

• What does function Perf(R) look like?

43

More on Enhanced Cores• (Performance Perf(R) consuming R BCEs

resources)

• If Perf(R) > R Always enhance core• Cost-effectively speedups both sequential &

parallel

• Therefore, Equations Assume Perf(R) < R

• Graphs Assume Perf(R) = square root of R– 2x performance for 4 BCEs, 3x for 9 BCEs, etc.– Why? Models diminishing returns with “no coefficients”

• How to speedup enhanced core?– <Insert favorite or TBD micro-architectural ideas here>

44

How Many (Symmetric) Cores per Chip?

• Each Chip Bounded to N BCEs (for all cores)• Each Core consumes R BCEs• Assume Symmetric Multicore = All Cores

Identical• Therefore, N/R Cores per Chip — (N/R)*R = N• For an N = 16 BCE Chip:

45

Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

Performance of Symmetric Multicore Chips• Serial Fraction 1-F uses 1 core at rate

Perf(R) • Serial time = (1 – F) / Perf(R)

• Parallel Fraction uses N/R cores at rate Perf(R) each

• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N

• Therefore, w.r.t. one base core:

• Implications?46

Symmetric Speedup =

1

+1 - FPerf(R)

F * R

Perf(R)*N

Enhanced Cores speed Serial & Parallel

Symmetric Multicore Chip, N = 16 BCEs

F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))Need to increase parallelism to make multicore optimal!

47

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.5

(16 cores)

(8 cores) (2 cores) (1 core)

F=0.5R=16,

Cores=1,Speedup=4

(4 cores)

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.9

F=0.5


At F=0.9, Multicore optimal, but speedup limited

Need to obtain even more parallelism!48

F=0.5R=16,

Cores=1,Speedup=4

F=0.9, R=2, Cores=8, Speedup=6.7


49

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F matters: Amdahl’s Law applies to multicore chipsResearchers should target parallelism F first

F1, R=1, Cores=16, Speedup16


50

As Moore’s Law enables N to go from 16 to 256 BCEs,More core enhancements? More cores? Or both?

0.16 1.6 160

2

4

6

8

10

12

14

16

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

Recall F=0.9, R=2, Cores=8, Speedup=6.7


As Moore’s Law increases N, often need enhanced core designsSome researchers should target single-core performance

51

0.256 2.56 25.6 2560

50

100

150

200

250

R BCEs

Sym

metr

ic S

peed

up

F=0.999

F=0.99

F=0.975

F=0.9F=0.5

F=0.9R=28 (vs. 2)

Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!

F1R=1 (vs. 1)

Cores=256 (vs. 16)Speedup=204 (vs. 16)

MORE CORES!

F=0.99R=3 (vs. 1)

Cores=85 (vs. 16)Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS& MORE CORES!

Multi-This, Multi-That, …. Lam92 –This paper focused on impact of control flow on ILP –Speculative execution can expose 10-400 IPC assumes no machine.

Documents

wide ooo cpu runtime

wide ooo cpu task b

s of ipc bandwidthlatencies

ipc entry

watts slide

perfect memory disambiguation

substantial effort slide

ipc perfect bpred