Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM Corporation

Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor

Ron Kalla, Balaram Sinharoy, Joel TendlerIBM Systems Group

© 2003 IBM CorporationHotchips 15, August 20032

SMT Implementation in POWER5

Outline

§ Motivation§ Background§ Threading Fundamentals§ Enhanced SMT POWER5 Implentation§ Additional SMT Considerations§ Summary



Microprocessor Design Optimization Focus Areas

§ Memory latency4 Increased processor speeds make memory appear further away

4 Longer stalls possible§ Branch processing

4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power

4 Predication drives increased power and larger chip area§ Execution Unit Utilization

4 Currently 20-25% execution unit utilization common§ Simultaneous multi-threading (SMT) and POWER architecture

address these areas



POWER4 --- Shipped in Systems December 2001

§ Technology: 180nm lithography, Cu, SOI4 POWER4+ shipping in 130nm today

§ Dual processor core§ 8-way superscalar

4 Out of Order execution

4 2 Load / Store units

4 2 Fixed Point units

4 2 Floating Point units

4 Logical operations on Condition Register

4 Branch Execution unit§ > 200 instructions in flight§ Hardware instruction and data

prefetch

L3

Dir

ecto

ry/C

on

tro

lL2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

FX

U

FXUISU ISU



POWER5 --- The Next Step

§ Technology: 130nm lithography, Cu, SOI§ Dual processor core§ 8-way superscalar§ Simultaneous multithreaded

(SMT) core4 Up to 2 virtual processors per

real processor

4 24% area growth per core for SMT

4 Natural extension to POWER4 design



Multi-threading Evolution

Thread 0 ExecutingThread 0 Executing No Thread Executing

FX0FX1FP0FP1LS0LS1BRXCRL

Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL

Coarse Grain Threading


Fine Grain Threading


Simultaneous Multi-Threading



Changes Going From ST to SMT Core§ SMT easily added to Superscalar Micro-architecture

4 Second Program Counter (PC) added to share I-fetch bandwidth4 GPR/FPR rename mapper expanded to map second set of registers

(High order address bit indicates thread)4 Completion logic replicated to track two threads4 Thread bit added to most address/tag buses

FetchUnit

I-Cache

Decode RegisterRename

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

CR, LR,CTR

FPRs

GPRs

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

DataCache

PCPC

RegisterRename

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

CR, LR,CTR

FPRs

GPRs

BR, CRLUnits

FPUsFPUs

FXUs,LSUs



Resource Sizes

Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0

~~

§ Analysis done to optimize every micro-architectural resource size4GPR/FPR rename pool size

4I-fetch buffers

4Reservation Station

4SLB/TLB/ERAT

4I-cache/D-cache§Many Workloads examined§ Associativity also examined

50 60 70 80 90 100 110 120 13080 120

Number of GPR Renames

IPC

ST SMT



Resource Sharing

Results based on simulation of an online transaction processing application

§ Threads share many resources

4 Global Completion Table, BHT, TLB, . . .

§ Higher performance realized when resources balanced across threads

4 Tendency to drift toward extremes accompanied by reduced performance

§ Solution: Dynamically adjust resource utilization

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

With dynamic resource utilization

adjustment

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

Without dynamic resource utilization

adjustment

Global Completion Table Occupancy



Thread Priority§ Instances when unbalanced

execution desirable4 No work for opposite thread

4 Thread waiting on lock

4 Software determined non uniform balance

4 Power management

4 …

§ Solution: Control instruction decode rate

4 Software/hardware controls 8 priority levels for each thread

0

0

0

1

1

1

1

1

2

2

IPC

-5 -3 -1 0 1 3 50,7 7,0 1,1

Thread 1 Priority - Thread 0 Priority

Thread 0 IPC Thread 1 IPCPower Save Mode

Single Thread Mode



Dynamic Thread Switching

§ Used if no task ready for second thread to run§ Allocates all machine resources to

one thread§ Initiated by software§ Dormant thread wakes up on:

4 External interrupt

4 Decrementer interrupt

4 Special instruction from active thread

Active

Dormant

Null

software

hardware or software

software

software

Thread States



Single Thread Operation

§ Advantageous for execution unit limited applications

4 Floating or fixed point intensive workloads

§ Execution unit limited applications provide minimal performance leverage for SMT4 Extra resources necessary for SMT

provide higher performance benefit when dedicated to single thread

§ Determined dynamically on a per processor basis

PO

WE

R4+

PO

WE

R5

ST

Matrix Multiply

IPC

PO

WE

R5

SM

T



Other SMT Considerations

§ Power Management4 SMT Increases execution unit utilization

4 Dynamic power management does not impact performance§ Debug tools / Lab bring-up

4 Instruction tracing

4 Hang detection

4 Forward progress monitor§ Performance Monitoring§ Serviceability



Summary

§ POWER5 SMT implementation is more than SMT4 Good ROI for silicon area: Performance gain > Area increase

4 Resource sizes optimized

4 Dynamic feedback enhances instruction throughput

4 Software controlled priority exploits machine architecture

4 Dynamic ST to/from SMT mode capability optimizes system resources

§ SMT impacts pervasive throughout chip§ Operating in laboratory

4 AIX, Linux and OS/400 booted and running

Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

Documents