8/2/2019 HE01_Tendler - Power 7 Architecture
1/58
Materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.3 Copyright IBM Corporation 2011
2011
IBM Power Systems Technical UniversityOctober 10-14 | Fontainebleau Miami Beach | Miami, FL
Title: An Inside Look at POWER7 Architecture
Session ID: HE01
Joel M. Tendler
8/2/2019 HE01_Tendler - Power 7 Architecture
2/58
2 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Deliver business value by leveraging technology
IBM Power Systems value proposition
ReliabilityPerformance Flexibility Affordability
+
. . . the highest value at the lowest riskwith leading technology
8/2/2019 HE01_Tendler - Power 7 Architecture
3/58
3 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
-603
POWER6TM
-Ultra High Frequency
POWER3TM
-630
Over 20 Years of POWER Processors
1990 1995 2000 2005 2010
POWER1-AMERICAs
RSC
-601
POWER5TM
-SMT
POWER4TM
-Dual Core
POWER7-Multi-core
Major POWER Innovation-1990 RISC Architecture-1994 SMP-1995 Out of Order Execution-1996 64 Bit Enterprise Architecture
-1997 Hardware Multi-Threading-2001 Dual Core Processors-2001 Large System Scaling-2001 Shared Caches-2003 On Chip Memory Control-2003 SMT-2006 Ultra High Frequency
-2006 Dual Scope Coherence Mgmt-2006 Decimal Float/VSX-2006 Processor Recovery/Sparing-2009 Balanced Multi-core Processor-2009 On Chip EDRAM
-Cobra A10-64 bit
45nm
65nm
130nm
180nm
.5um
.35um.25um
.18um
.5um
.5um
1.0um
.72um
.6um
.35um
.25um
604e
.22um
POWER2TM
P2SC
.35um
RS64I ApacheBiCMOS
RS64II North Star
RS64III Pulsar
RS64IV Sstar
Muskie A35
Next Gen.
* Dates represent approximate processor power-on dates, not system availability
8/2/2019 HE01_Tendler - Power 7 Architecture
4/58
4 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
IBM POWER Processor Roadmap
-3 Year Revolution
POWER8
Future2001
POWER4/4+
Dual Core Chip Multi Processing Distributed Switch Shared L2 Dynamic LPARs (32)180nm,
First Dual Corein Industry
2004
POWER5/5+
Dual Core & Quad Core MdMicropartitioningEnhanced Scaling2 Thread SMTDistributed Switch +Core Parallelism +FP Performance +Memory bandwidth +130nm, 90nm
HardwareVirtualizationfor Unix & Linux
2007
POWER6/6+
Dual Core High Frequencies Virtualization + Live Partition Migration Memory Subsystem + Altivec Instruction Retry Alt Proc Recovery Dyn Energy Mgmt 2 Thread SMT + Protection Keys 65nm
FastestProcessorIn Industry
2010
POWER7
4,6,8 Core 32MB On-Chip eDRAM Power Optimized Cores Mem Subsystem ++ Advanced Memory
Expansion 4 Thread SMT++ Reliability + VSM & VSX Protection Keys+ 45nm
Most
POWERful &ScalableProcessor inIndustry
IBM is the leaderin Processorand Serverdesign
8/2/2019 HE01_Tendler - Power 7 Architecture
5/58
6 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Processor Chip
567mm2 Technology: 45nm lithography, Cu,SOI, eDRAM
Eight processor cores
12 execution units per core4 Way SMT per core
32 Threads per chip
256KB L2 per core
32MB on chip eDRAM shared L3
1.2B transistors
Equivalent function of 2.7B
eDRAM efficiency
Dual DDR3 Memory Controllers
100GB/s Memory bandwidth per chipsustained
Scalability up to 32 Sockets360GB/s SMP bandwidth/chip
20,000 coherent operations in flight
Advanced pre-fetching Data andInstruction
Binary Compatibility with POWER6 and prior
systems
* Statements regarding SMP servers
do not imply that IBM will introducea system with this capability.
8/2/2019 HE01_Tendler - Power 7 Architecture
6/58
8/2/2019 HE01_Tendler - Power 7 Architecture
7/58
8 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Simultaneous Multithreading Support Intelligent Threads
Standard Cache Option
All cores active
Requires POWER7 Mode
POWER6 Mode supports single thread (ST)and 2-way simultaneous multithreading(SMT2)
POWER7 Mode also supports 4-waysimultaneous multithreading (SMT4)
Operating System Support
AIX 6.1 and AIX 7.1
IBM i 6.1 and 7.1
Linux
Dynamic Runtime SMT scheduling
Spread work among cores to execute inappropriate threaded mode
Can dynamical shift between modes asrequired: ST / SMT2 / SMT4
LPAR-wide SMT controls
ST, SMT2, SMT4 modes
0
0.5
1
1.5
2
ST SMT2 SMT4
8/2/2019 HE01_Tendler - Power 7 Architecture
8/58
9 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Processor Chip
567mm2 Technology: 45nm lithography, Cu,SOI, eDRAM
Eight processor cores
12 execution units per core4 Way SMT per core
32 Threads per chip
256KB L2 per core
32MB on chip eDRAM shared L3
1.2B transistors
Equivalent function of 2.7B
eDRAM efficiency
Dual DDR3 Memory Controllers
100GB/s Memory bandwidth per chipsustained
Scalability up to 32 Sockets360GB/s SMP bandwidth/chip
20,000 coherent operations in flight
Advanced pre-fetching Data andInstruction
Binary Compatibility with POWER6 and prior
systems
* Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.
8/2/2019 HE01_Tendler - Power 7 Architecture
9/58
10 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Computer Systems Memory: DRAM
Dynamic RAM
Uses MOSFETs and Capacitors
Charge , i.e., 0 or 1, decays over time(capacitor discharges) losing its state
To be useful, cell needs periodic refreshing(dynamic)
Volatile memory (data lost when memory isnot powered)
Each cell stores 1 bit
Memory cell: 1xMOSFET + 1xCapacitor
Comparison with SRAM
Less components, much more dense
Higher latency, variable access timings
Charge refresh & amplification complexity DiagramcourtesyofWikipedia
4x4 Dynamic RAM memory
8/2/2019 HE01_Tendler - Power 7 Architecture
10/58
11 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Computer Systems Memory: SRAM
Static RAM
Uses transistors only, e.g., MOSFET
Charge does not need to be refreshed (static)
Uses bi-stable latching circuitry
Volatile memory (data lost when memory isnot powered)
Each cell stores 1 bit 6 or more MOSFETs
Comparison with DRAM
Uses more components per bit Lower latency
Higher frequency access
Simpler access interface
Static RAM memory cell: 6 x MOSFET DiagramcourtesyofWikipedia
8/2/2019 HE01_Tendler - Power 7 Architecture
11/58
13 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Processor Chip
567mm2 Technology: 45nm lithography, Cu, SOI,eDRAM
Eight processor cores
12 execution units per core
4 Way SMT per core
32 Threads per chip
256KB L2 per core
32MB on chip eDRAM shared L3
1.2B transistors
Equivalent function of 2.7B
eDRAM efficiency Dual DDR3 Memory Controllers
100GB/s Memory bandwidth per chip sustained
Scalability up to 32 Sockets
360GB/s SMP bandwidth/chip
20,000 coherent operations in flight
Advanced pre-fetching Data andInstruction
Binary Compatibility with POWER6 and prior systems
* Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.
8/2/2019 HE01_Tendler - Power 7 Architecture
12/58
14 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Processor Chip
567mm2 Technology: 45nm lithography, Cu, SOI,eDRAM
Eight processor cores
12 execution units per core
4 Way SMT per core
32 Threads per chip
256KB L2 per core
32MB on chip eDRAM shared L3
1.2B transistors
Equivalent function of 2.7B
eDRAM efficiency Dual DDR3 Memory Controllers
100GB/s Memory bandwidth per chip sustained
Scalability up to 32 Sockets
360GB/s SMP bandwidth/chip
20,000 coherent operations in flight
Advanced pre-fetching Data andInstruction
Binary Compatibility with POWER6 and prior systems
* Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.
8/2/2019 HE01_Tendler - Power 7 Architecture
13/58
15 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Balanced View
Thread
Core
32 Chip
System
Socket
1
10
100
1000
10000
Single Thread Performance
SystemT
hruput
POWER7 Design Principles:
Balanced Design Multiple optimization points
Improved energy efficiency RAS improvements
Improved Thread Performance
Dynamic allocation of resources
Shared L3
Increased Core parallelism 4 Way SMT Aggressive out of order execution
Extreme Increase in SocketThroughput
Continued growth in socket bandwidth
Balanced core, cache, memoryimprovements
System
Scalable interconnect
Reduced coherence traffic
Multiple optimization Points
POWER6
Gra hs for illustration ur oses only (Not actual data
POWER7
Traditional Performance View
1
10
100
1000
10000
Thread Core Socket 32 Chip
System
8/2/2019 HE01_Tendler - Power 7 Architecture
14/58
16 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Design Principles:
Cores:
4, 6, and 8-core offerings with up to 32MB of L3 Cache
Dynamically turn cores on and off, reallocating energy
Dynamically vary individual core frequencies, reallocating energy Dynamically enable and disable up to 4 threads per core
Memory Subsystem:
4 or 8 channel configurations
System Topologies:
Standard, half-width, and double-width SMP busses supported Multiple System Packages
Flexibility and Adaptability
Power 70x,710-755Single Chip Organic
Pow er 770-795Single Chip Glass Ceramic
Comput e I ntensiveQuad-chip MCM
1 Memory Controller
3 4B local links
2 Memory Controllers
3 8B local links
2 8B Remote links
8 Memory Controllers
3 16B local links (on MCM)
8/2/2019 HE01_Tendler - Power 7 Architecture
15/58
17 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7: Core
Execution Units
2 Fixed point units
2 Load store units
4 Double precision floating point 1 Vector unit
1 Branch
1 Condition register
1 Decimal floating point unit 6 Wide dispatch / 8 Wide Issue
Recovery Function Distributed
1,2,4 Way SMT Support
Out of Order Execution 32KB I-Cache
32KB D-Cache
256KB L2
Tightly coupled to core
256KB L2
IFUCRU/BRU
ISU
DFU
FXU
VSXFPU
LSU
8/2/2019 HE01_Tendler - Power 7 Architecture
16/58
18 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Computer Systems Memory: Hierarchy
Memory wall
Differential in performancebetween addressable memory and
the microprocessor Implement a memory hierarchy and
intelligence to reduce accesstimes/latency
Combination of memory-celltechnology and distance from theprocessor
Microprocessor
Integrated Circuit
Instructions
InputData
OutputData
Registers
Instruction cache
Data cache
L2 cache
L3 cache
DIMM Buffer ASIC
Addressable memory
{L1 cache
Processorint
imacy,smallercapacity
IncreasedLa
tency,greatercapacity
Typical Memory Hierarchy
8/2/2019 HE01_Tendler - Power 7 Architecture
17/58
19 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalS
MPLinks
RemoteSMP
+I/OLinks
POWER7 is an 8-core, high performance Server chip. A solid chip is a good start.
But to win the race, you need a balanced system. POWER7 enables that balance.
Challenge: Beating Physics to Realize Multi-core Potential
8/2/2019 HE01_Tendler - Power 7 Architecture
18/58
20 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Challenge: Beating Physics to Realize Multi-core Potential
Multi-coreevolution
Compute Throughput Potential
Multi-coreevolution
Socket Throughput Limitation
(Physical signal economics)
Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential
8/2/2019 HE01_Tendler - Power 7 Architecture
19/58
21 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
2 to 4 socket16 to 32-way SMP Server
Emerging Entry ServerVirtualized/Cloud Platform
8-core
8-core
8-core
8-core
Trends in Server Evolution
Time
Single Image Virtualized/Cloud
8 to 32 socket
16 to 64-way SMP Server
Traditional High-End ServerVirtualized Consolidation Platform
Enabled by:- Technology
- Innovation
Driven by:- IT Evolution- Economics
2-core 2-core
2-core 2-core
2 to 4 socket
4 to 8-way SMP Server
Traditional Entry ServerSingle Image Platform
- A simple matter of ridingthe multi-core trend?
- Add more cores to the die,beef up some interfaces,and scale to a large SMP?
* Statements regarding SMP servers
do not imply that IBM will introducea system with this capability.
8/2/2019 HE01_Tendler - Power 7 Architecture
20/58
22 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
2 to 4 socket16 to 32-way SMP Server
Emerging Entry ServerVirtualized/Cloud Platform
8-core
8-core
8-core
8-core
Time
Single Image Virtualized/Cloud
8 to 32 socket16 to 64-way SMP Server
Traditional High-End ServerVirtualized Consolidation Platform
Sim
ilarChallen
ge
Enabled by:- Technology
- Innovation
Driven by:- IT Evolution- Economics
2-core 2-core
2-core 2-core
2 to 4 socket4 to 8-way SMP Server
Traditional Entry ServerSingle Image Platform
- A simple matter of ridingthe multi-core trend?
- Add more cores to the die,beef up some interfaces,and scale to a large SMP?
Not so simple:- Emerging entry servers
have characteristics similarto traditional high-endlarge SMP servers
Achieving solid virtualmachine performance
requires a BalancedSystem Structure.
* Statements regarding SMP servers
do not imply that IBM will introducea system with this capability.
Trends in Server Evolution
8/2/2019 HE01_Tendler - Power 7 Architecture
21/58
23 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
2 to 4 socket16 to 32-way SMP Server
Emerging Entry ServerVirtualized/Cloud Platform
8 to 32 socket64 to 256-way SMP Server
Emerging High-End ServerUltraScale Cloud Platform
8-core
8-core
8-core
8-core
Time
Single Image Virtualized/Cloud UltraScale Cloud
Enabled by:- Technology
- Innovation
Driven by:- IT Evolution- Economics
2-core 2-core
2-core 2-core
2 to 4 socket4 to 8-way SMP Server
Traditional Entry ServerSingle Image Platform
8 to 32 socket16 to 64-way SMP Server
Traditional High-End ServerVirtualized Consolidation Platform
Same enablers anddriving factors applyat larger scale
* Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.
Trends in Server Evolution
8/2/2019 HE01_Tendler - Power 7 Architecture
22/58
24 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Challenge: How does POWER7 maintain the Balance?
Multi-coreevolution
Compute Throughput Potential
Socket Throughput Limitation
(Physical signal economics)
Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential
Cache Hierarchy Technologyand Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
23/58
25 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Cache Hierarchy Rqmtfor POWERServers
Core Core
LowLatency2M to 4Mper Core
Cachefootprint
Large, Shared, 30+ MBCache footprint
much closer thanLocal Memory
. . .
LowLatency2M to 4Mper Core
Cachefootprint
Challenge
for Multi-core POWER7
POWER4TM, POWER5TM, and
POWER6TM systems derive hugebenefit from high bandwidth accessto large, off-chip cache.
But socket pin count constraints
prevent scaling the off-chip cacheinterface to support 8 cores.
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
24/58
26 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Cache Hierarchy Rqmtfor POWER Servers
Core Core
LowLatency2M to 4Mper Core
Cachefootprint
Large, Shared, 30+ MBCache footprint
much closer thanLocal Memory
. . .
LowLatency2M to 4Mper Core
Cachefootprint
Need to satisfy both caching
requirements with one cache.
Challenge
for Multi-core POWER7
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
25/58
27 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Cache Hierarchy Rqmtfor POWER Servers
Core Core
LowLatency2M to 4Mper Core
Cachefootprint
Large, Shared, 30+ MBCache footprint
much closer thanLocal Memory
. . .
LowLatency2M to 4Mper Core
Cachefootprint
Challenge
for Multi-core POWER7
IBM CustomeDRAM
CustomFast SRAM
High Area/powerHigh speed/bandwidth
On-processor30+ MB Cache
Private coreSub-MB Cache
Dense, low powerLower speed/bandwidth
Low power, dense eDRAM
value enhanced withlow latency, high bandwidth,
fast SRAM structures
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
26/58
28 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
ConventionalMemory DRAM
IBM ASICeDRAM
IBM CustomeDRAM
CustomDense SRAM
CustomFast SRAM
Solution: High speed eDRAM on the processor die
Dense, low powerLow speed/bandwidth
High Area/powerHigh speed/bandwidth
ConventionalMemory DIMMs
Large, Off-chip30+ MB Cache
On-processor30+ MB Cache
On-processorMulti-MB Cache
Private coreSub-MB Cache
With POWER7, IBM introduces on-processor, high-speed,custom eDRAM, combining the dense, low power attributes
of eDRAM with the speed and bandwidth of SRAM.
OnuP
Chip
OffuP
Chip
IBMs POWER Servers have leveraged large off-chipeDRAM caches in POWER4, 5, and 6.
Industry Standard Caching and Memory Technologies:Conventional DIMMs, Dense and Fast SRAMs.
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
27/58
30 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalSMPLinks
RemoteS
MP+I/OLinks
L3 Cache Structure
8/2/2019 HE01_Tendler - Power 7 Architecture
28/58
31 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Hybrid L3 Fluid Cache Structure Intelligent Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalSMPLinks
RemoteS
MP+I/OLinks
8/2/2019 HE01_Tendler - Power 7 Architecture
29/58
32 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
- Keeps multiple footprints at ~3X lower latency than local memory.
Core Core Core Core Core Core Core Core
Large, Shared32M L3 Cache
Private
Private
Private
SharedPrivate
PrivatePrivate
SharedPrivate
Private Private Shared
Working SetFootprints
Core Core
Cache Hierarchy Technology and Innovation
Hybrid L3 Fluid Cache Structure Intelligent Cache
8/2/2019 HE01_Tendler - Power 7 Architecture
30/58
33 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
- Keeps multiple footprints at ~3X lower latency than local memory.
Core Core Core Core Core Core Core Core
Large, Shared32M L3 Cache
Private
Private
Private
SharedPrivate
PrivatePrivate
SharedPrivate
Private Private Shared
Working SetFootprints
- Automatically migrates private footprints (up to 4M) to fast localregion (per core) at ~5X lower latency than full L3 cache.
- Automatically clones shared data to multiple private regions.
Core Core
Fast, LocalL3 Region
Fast, LocalL3 Region
ClonedCloned
Hybrid L3 Fluid Cache Structure Intelligent Cache
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
31/58
34 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core Core Core Core Core Core Core Core
Fast, LocalL3 Region
Private
Private
Private
PrivateLarge, Shared32M L3 Cache
Private
Private
Private
Private
Private
Private
Private
Private
Private
Private
Private
PrivatePrivate
Private
Private
Private
Private Private
Private
- Enables a subset of the cores to utilize the entire large sharedL3 cache when the remaining cores are not using it.
Hybrid L3 Fluid Cache Structure Intelligent Cache
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
32/58
35 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
Fast LocalL3 Region
LocalSMPLinks
RemoteSM
P+I/OLinks
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
33/58
36 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Solution: L2 Turbo Cache
- L2 Turbo cache keeps a tight 256K working set with extremelylow latency (~3X lower than local L3 region) and high bandwidth,reducing L3 power and boosting performance.
Fast, LocalL3 Region
Fast, LocalL3 Region
Private
Private
SharedPrivate
Private
Private
Private
Private
Private
Private
Cloned
Cloned Cloned
ClonedLarge, Shared32M L3 Cache
L2 TurboCache
L2 TurboCache
L2 TurboCache
L2 TurboCache
L2 TurboCache
L2 TurboCache
Core Core Core Core Core Core
L2 TurboCache
Core
L2 TurboCache
Core
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
34/58
37 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalSMPLinks
RemoteSM
P+I/OLinks
Cache Hierarchy Technology and Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
35/58
38 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Cache Hierarchy Summary
Fast, Local
L3 RegionFast, LocalL3 Region
Private
Private
SharedPrivate
Private
Private
Private
Private
Private
Private
Cloned
Cloned Cloned
ClonedLarge, Shared32M L3 Cache
32M
Up to 4M
256K
32K
Capacity
eDRAM
eDRAM
Fast SRAM
Fast SRAM
Array
De-coupled global storage updateStore-InPrivate L2
Local thread storage updateStore-thruL1 Data
Large 32M shared footprintAdaptiveShared L3
Reduced power footprint (up to 4M)Partial VictimFast L3 Region
CommentPolicyCache Level
L2 Turbo
Cache
L2 TurboCache
L2 TurboCache
L2 TurboCache
L2 TurboCache
L2 Turbo
Cache
Core Core Core Core Core Core
L2 TurboCache
Core
L2 TurboCache
Core
Cache Hierarchy Technology and Innovation
Challenge: How does POWER7 maintain the Balance?
8/2/2019 HE01_Tendler - Power 7 Architecture
36/58
39 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Multi-coreevolution
Compute Throughput Potential
Socket Throughput Limitation
(Physical signal economics)
Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential
Cache Hierarchy Technologyand Innovation
Advances in Memory Subsystem
Challenge: How does POWER7 maintain the Balance?
Advances in Memory Subsystem
8/2/2019 HE01_Tendler - Power 7 Architecture
37/58
40 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
POWER7 Requirements
Core:
10GB/s to 20GB/s sustainedmemory bandwidth per core
16GB to 32GB of cache
Socket: 4 times growth in memory
bandwidth & capacity
System: Packaging more memory into
similar volume, with similar energyand cooling constraints
Memory subsystem requirement for POWER7 processor-based servers
10-20GB/ssustainedbandwidthper core
16-32GB
storage percore
Energyconstraints
Core
Advances in Memory Subsystem
Advances in Memory Subsystem
8/2/2019 HE01_Tendler - Power 7 Architecture
38/58
41 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
1) Dual Integrated DDR3 Controllers- Massive 16KB scheduling windowper POWER7 chip insures high
channel and DIMM utilization- Sparse access acceleration
- Advanced Energy Management- Numerous RAS advances
Advances in Memory Subsystem
POWER7 Chip
MemoryController
MemoryController
AdvancedBuffer
Chip
Multi-faceted Solution
2) Eight high speed 6.4 GHz channels- New low power differential signaling- Sustained 100+ GB/s per socket
3) New DDR3 buffer chip architecture
- Larger capacity support (32 GB / core)- Energy Management support
- RAS enablement
4) DDR3 DRAMs
- Supports 800, 1066, 1333, and 1600
Ad i M S b t
8/2/2019 HE01_Tendler - Power 7 Architecture
39/58
42 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalS
MPLinks
RemoteSM
P+I/OLinks
Advances in Memory Subsystem
C O ?
8/2/2019 HE01_Tendler - Power 7 Architecture
40/58
43 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Multi-coreevolution
Compute Throughput Potential
Socket Throughput Limitation
(Physical signal economics)
Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential
Cache Hierarchy Technologyand Innovation
Advances in Memory Subsystem
Advances in Off-Chip SignalingTechnology
Challenge: How does POWER7 maintain the Balance?
Advances in Off chip Signaling Technology
8/2/2019 HE01_Tendler - Power 7 Architecture
41/58
44 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
1) Enhanced Signal-ended Elastic Interface Technology
2) New high speed, low power Differential Technology
360 GB/s3.0 Ghz120 bytesSingle-endedSMP Interconnect
50 GB/s2.5 Ghz20 bytesSingle-endedI/O Bridge
590 GB/sTotal Bandwidth
180 GB/s6.4 Ghz28 bytesDifferentialMemory Channels
Off-chip Cache
Interface
nonenonenonenone
BandwidthFrequencyInfo WidthSignal Type
Moving L3 onto POWER7 along with advances in signaling
technology enables significant raw bandwidth growth for bothmemory and I/O subsystems. Note that advanced scheduling
improves POWER7s ability to utilize memory bandwidth.
Advances in Off-chip Signaling Technology
(Note that bandwidths shown are raw, peak signal bandwidths)
Challenge: How does POWER7 maintain the Balance?
8/2/2019 HE01_Tendler - Power 7 Architecture
42/58
45 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Multi-coreevolution
Compute Throughput Potential
Socket Throughput Limitation
(Physical signal economics)
Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential
Cache Hierarchy Technologyand Innovation
Advances in Memory Subsystem
Advances in Off-Chip SignalingTechnology
Exploit Long Term Investmentin Coherence Innovation
Challenge: How does POWER7 maintain the Balance?
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
43/58
46 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Core
L2 Cache
Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect
LocalS
MPLinks
RemoteSMP
+I/OLinks
Exploit Long Term Investment in Coherence Innovation
Using local and remote SMP links, up to 32 POWER7 chips are connected
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
44/58
47 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Up to 32 POWER7 chips form a massive SMP system.
* Statements regarding SMP serversdo not imply that IBM will introduce
a system with this capability.
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
45/58
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
46/58
49 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
8 to 32 socket64 to 256-way SMP Server
POWER7 High-End ServerUltraScale Cloud Platform
8 to 32 socket16 to 64-way SMP Server
POWER6 High-End ServerVirtualized Consolidation Platform
ComputeThroughput
Compute Throughput
1X
~5X
Global CoherenceThroughput
Global CoherenceThroughput
320GB/s
450GB/s
Challenge: As system size grows, Coherence broadcast traffic increases
Global ScopeCoherenceBroadcast
Exploit Long Term Investment in Coherence Innovation
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
47/58
51 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
8 to 32 socket64 to 256-way SMP Server
POWER7 High-End ServerUltraScale Cloud Platform
8 to 32 socket16 to 64-way SMP Server
POWER6 High-End ServerVirtualized Consolidation Platform
Solution: Speculative limited scope Coherence broadcast- In 2003, recognized emerging trend- Developed Dual-Scope Broadcast Coherence Protocol for POWER6- Utilizes 13 cache states and integrated scope indicator in memory
Global ScopeCoherenceBroadcast
Nodal ScopeSpeculativeCoherenceBroadcast
Provides value for POWER6- Latency reduction- Near Perfect Scaling for extreme
memory intensive workloads
Exploit Long Term Investment in Coherence Innovation
8/2/2019 HE01_Tendler - Power 7 Architecture
48/58
Energy Management: Architected Idle Modes
8/2/2019 HE01_Tendler - Power 7 Architecture
49/58
53 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Energy Management: Architected Idle Modes
Two Design Points Chosen for Technology
Nap (optimized for wake-up time) Turn off clocks to execution units
Reduce frequency to core
Caches and TLB remain coherent
Fast wake-Up
Sleep (optimized for power reduction) Purge caches and TLB
Turn off clocks to full core and caches
Reduce voltage to V-retention
Leakage current reduced substantially Voltage ramps-up on wake up
No core re-initialization required
Wake-
UpLatency
Energy Reduction
Nap
Sleep
RV WinklePower gate
Doze
4 PowerPC Architected States
Adaptive Energy Management: Energy ScaleTM
8/2/2019 HE01_Tendler - Power 7 Architecture
50/58
54 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
p gy g gy
Chip FO4 Tuned for Optimal Performance/Wattin Technology
DVFS (Dynamic Voltage and FrequencySlewing)
-50% to +10% frequency slew independent
per core Frequency and voltage adjusted based on:
Work load and utilization.
On board activity monitors
Turbo-Mode
Up to 10% frequency boost
Leverages excess energy capacity from:
Non worst case work loads
Idle cores
Processor and Memory Energy Usage can beindependently Balanced.
Real time hardware performance monitorsused.
On board power proxy logic estimates power
Power Capping Support
Allows budgeting of power to different parts ofsystem
SPECPower: Mean System Power per Load
Level
020406080100
Load Level (%)
AC
Pow
er
Vmin
8/2/2019 HE01_Tendler - Power 7 Architecture
51/58
55 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Power Systems Reliability,Availability, Serviceability (RAS)
OS Downtime Comparison Survey
8/2/2019 HE01_Tendler - Power 7 Architecture
52/58
56 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
0
1
2
3
4
5
6
7
8
9
10
Win2000 Win2003 RHEL SOLARIS HP-UX SUSE AIX
The Yankee Group 2007-2008 Global Server Operating Systems Reliability Survey as quoted in Windows Server: The New King of Downtime by MarkJoseph Edwards at www.windowsitpro.com/article/articleid/98475/windows-server-the-new-king-of-downtime.html , March 5, 2008 and in
http://www.sunbeltsoftware.com/stu/Yankee-Group-2007-2008-Server-Reliability.pdf
Hours400 participants in 27 countries
ITIC Survey says Power Systems with AIX deliver 99.997% uptime
8/2/2019 HE01_Tendler - Power 7 Architecture
53/58
57 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
- 54% of IT executives and managers say that they require 99.99% or better availability for their applications
Power Systems with AIX delivers thebest RAS of UNIX, Linux, Windowschoices
1. Availability: The least amount ofdowntime
15 minutes a year
2.3 times better than the closest UNIX
competitor more than 10X better than Windows
2. Reliability: The fewest unscheduledoutages
less than one outage per year3. Serviceability: The fastest patch
time
11 minutes to apply a patch
Source: Network World, dated July 14, 2009, reports on the 2009 ITIC Global Server Hardware & Server OS Reliability Survey Results
POWER7: Reliability and Availability Features
8/2/2019 HE01_Tendler - Power 7 Architecture
54/58
58 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Fabric Bus Interface to other Chips andNodes
ECC protected Node hot add /repair
Core Recovery Leverage speculative execution resources to
enable recovery Error detected in GPRs FPRs VSR, flushed
and retried Stacked latches to improve SER
Alternate Processor Recovery Partition isolation for core checkstops
L3 eDRAM ECC protected SUE handling Line delete Spare rows and columns
GX IO Bus ECC protected Hot add
InfiniBand Interface Redundant paths
IO Hub
PCIBridge
PCI Adapter
64 Byte ECC on Memory Corrects full chip kill on X8 dimms Spare X8 devices implemented
Dual memory chip failures do not causeoutage
Selective memory mirror capability to recoverpartition from dimm failures
Hardware assisted scrubbing SUE handling Dynamic sparing on channel interface PowerVM Hypervisor protected from full DIMM
failures
OSC0 OSC1Dynamic Oscillator
Failover
BUF
BUF
BUF
BUF
X8 Dimms
Fabric Interface
IBM Power Systems value proposition
8/2/2019 HE01_Tendler - Power 7 Architecture
55/58
59 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Deliver business value by leveraging technology
y p p
ReliabilityPerformance Flexibility Affordability
+
. . . the highest value at the lowest risk
with leading technology
Session Evals in 3 Easy Steps
8/2/2019 HE01_Tendler - Power 7 Architecture
56/58
60 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Go to: http://ibmtechu.com/up1
Select Register button and complete the form
(One time only)
2
Session Evals in 3 Easy Steps
8/2/2019 HE01_Tendler - Power 7 Architecture
57/58
61 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
Select Session Evals button and complete the online
form3
HE01Session Code:
Fantastic
Stay Connected & Continue Skills Transfer via IBM Training
8/2/2019 HE01_Tendler - Power 7 Architecture
58/58
62 Copyright IBM Corporation 2011
Power Systems Technical UniversityPower Systems Technical University
y g
Training pathsWhat to take,when to take it
Social mediaJoin theconversation
Custom catalogCreate a catalogthat meets yourinterest areas
RSS feedsUp-to-dateinformation onthe trainingyou need
IBM TrainingNewsTargeted toyour needs
New toInstructor LedOnline (ILO)?Take a freetest drive!
Education PacksOnline discount program for ALL IBMTraining courses for your company
Questions? Email Lisa Ryan ([email protected])