© 2003 IBM Corporation Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
© 2003 IBM Corporation
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor
Ron Kalla, Balaram Sinharoy, Joel TendlerIBM Systems Group
© 2003 IBM CorporationHotchips 15, August 20032
SMT Implementation in POWER5
Outline
§ Motivation§ Background§ Threading Fundamentals§ Enhanced SMT POWER5 Implentation§ Additional SMT Considerations§ Summary
© 2003 IBM CorporationHotchips 15, August 20033
SMT Implementation in POWER5
Microprocessor Design Optimization Focus Areas
§ Memory latency4 Increased processor speeds make memory appear further away
4 Longer stalls possible§ Branch processing
4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power
4 Predication drives increased power and larger chip area§ Execution Unit Utilization
4 Currently 20-25% execution unit utilization common§ Simultaneous multi-threading (SMT) and POWER architecture
address these areas
© 2003 IBM CorporationHotchips 15, August 20034
SMT Implementation in POWER5
POWER4 --- Shipped in Systems December 2001
§ Technology: 180nm lithography, Cu, SOI4 POWER4+ shipping in 130nm today
§ Dual processor core§ 8-way superscalar
4 Out of Order execution
4 2 Load / Store units
4 2 Fixed Point units
4 2 Floating Point units
4 Logical operations on Condition Register
4 Branch Execution unit§ > 200 instructions in flight§ Hardware instruction and data
prefetch
L3
Dir
ecto
ry/C
on
tro
lL2 L2 L2
LSU LSUIFUBXU
IDU IDU
IFUBXU
FPU FPU
FX
U
FXUISU ISU
© 2003 IBM CorporationHotchips 15, August 20035
SMT Implementation in POWER5
POWER5 --- The Next Step
§ Technology: 130nm lithography, Cu, SOI§ Dual processor core§ 8-way superscalar§ Simultaneous multithreaded
(SMT) core4 Up to 2 virtual processors per
real processor
4 24% area growth per core for SMT
4 Natural extension to POWER4 design
© 2003 IBM CorporationHotchips 15, August 20036
SMT Implementation in POWER5
Multi-threading Evolution
Thread 0 ExecutingThread 0 Executing No Thread Executing
FX0FX1FP0FP1LS0LS1BRXCRL
Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL
Coarse Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Fine Grain Threading
FX0FX1FP0FP1LS0LS1BRXCRL
Simultaneous Multi-Threading
© 2003 IBM CorporationHotchips 15, August 20037
SMT Implementation in POWER5
Changes Going From ST to SMT Core§ SMT easily added to Superscalar Micro-architecture
4 Second Program Counter (PC) added to share I-fetch bandwidth4 GPR/FPR rename mapper expanded to map second set of registers
(High order address bit indicates thread)4 Completion logic replicated to track two threads4 Thread bit added to most address/tag buses
FetchUnit
I-Cache
Decode RegisterRename
IntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
CR, LR,CTR
FPRs
GPRs
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
DataCache
PCPC
RegisterRename
IntegerIssue Qs
FPIssue Qs
BR/CRUIssue Qs
CR, LR,CTR
FPRs
GPRs
BR, CRLUnits
FPUsFPUs
FXUs,LSUs
© 2003 IBM CorporationHotchips 15, August 20038
SMT Implementation in POWER5
Resource Sizes
Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0
~~
§ Analysis done to optimize every micro-architectural resource size4GPR/FPR rename pool size
4I-fetch buffers
4Reservation Station
4SLB/TLB/ERAT
4I-cache/D-cache§Many Workloads examined§ Associativity also examined
50 60 70 80 90 100 110 120 13080 120
Number of GPR Renames
IPC
ST SMT
© 2003 IBM CorporationHotchips 15, August 20039
SMT Implementation in POWER5
Resource Sharing
Results based on simulation of an online transaction processing application
§ Threads share many resources
4 Global Completion Table, BHT, TLB, . . .
§ Higher performance realized when resources balanced across threads
4 Tendency to drift toward extremes accompanied by reduced performance
§ Solution: Dynamically adjust resource utilization
0
2
4
6
8
10
Rel
ativ
e O
ccu
rren
ce
0 5 10 15 20
05
1015
20
Thread 0
Threa
d 1
With dynamic resource utilization
adjustment
0
2
4
6
8
10
Rel
ativ
e O
ccu
rren
ce
0 5 10 15 20
05
1015
20
Thread 0
Threa
d 1
Without dynamic resource utilization
adjustment
Global Completion Table Occupancy
© 2003 IBM CorporationHotchips 15, August 200310
SMT Implementation in POWER5
Thread Priority§ Instances when unbalanced
execution desirable4 No work for opposite thread
4 Thread waiting on lock
4 Software determined non uniform balance
4 Power management
4 …
§ Solution: Control instruction decode rate
4 Software/hardware controls 8 priority levels for each thread
0
0
0
1
1
1
1
1
2
2
IPC
-5 -3 -1 0 1 3 50,7 7,0 1,1
Thread 1 Priority - Thread 0 Priority
Thread 0 IPC Thread 1 IPCPower Save Mode
Single Thread Mode
© 2003 IBM CorporationHotchips 15, August 200311
SMT Implementation in POWER5
Dynamic Thread Switching
§ Used if no task ready for second thread to run§ Allocates all machine resources to
one thread§ Initiated by software§ Dormant thread wakes up on:
4 External interrupt
4 Decrementer interrupt
4 Special instruction from active thread
Active
Dormant
Null
software
hardware or software
software
software
Thread States
© 2003 IBM CorporationHotchips 15, August 200312
SMT Implementation in POWER5
Single Thread Operation
§ Advantageous for execution unit limited applications
4 Floating or fixed point intensive workloads
§ Execution unit limited applications provide minimal performance leverage for SMT4 Extra resources necessary for SMT
provide higher performance benefit when dedicated to single thread
§ Determined dynamically on a per processor basis
PO
WE
R4+
PO
WE
R5
ST
Matrix Multiply
IPC
PO
WE
R5
SM
T
© 2003 IBM CorporationHotchips 15, August 200313
SMT Implementation in POWER5
Other SMT Considerations
§ Power Management4 SMT Increases execution unit utilization
4 Dynamic power management does not impact performance§ Debug tools / Lab bring-up
4 Instruction tracing
4 Hang detection
4 Forward progress monitor§ Performance Monitoring§ Serviceability
© 2003 IBM CorporationHotchips 15, August 200314
SMT Implementation in POWER5
Summary
§ POWER5 SMT implementation is more than SMT4 Good ROI for silicon area: Performance gain > Area increase
4 Resource sizes optimized
4 Dynamic feedback enhances instruction throughput
4 Software controlled priority exploits machine architecture
4 Dynamic ST to/from SMT mode capability optimizes system resources
§ SMT impacts pervasive throughout chip§ Operating in laboratory
4 AIX, Linux and OS/400 booted and running