Top Banner

of 58

HE01_Tendler - Power 7 Architecture

Apr 05, 2018

Download

Documents

mdesoto1791
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    1/58

    Materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.3 Copyright IBM Corporation 2011

    2011

    IBM Power Systems Technical UniversityOctober 10-14 | Fontainebleau Miami Beach | Miami, FL

    Title: An Inside Look at POWER7 Architecture

    Session ID: HE01

    Joel M. Tendler

    [email protected]

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    2/58

    2 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Deliver business value by leveraging technology

    IBM Power Systems value proposition

    ReliabilityPerformance Flexibility Affordability

    +

    . . . the highest value at the lowest riskwith leading technology

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    3/58

    3 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    -603

    POWER6TM

    -Ultra High Frequency

    POWER3TM

    -630

    Over 20 Years of POWER Processors

    1990 1995 2000 2005 2010

    POWER1-AMERICAs

    RSC

    -601

    POWER5TM

    -SMT

    POWER4TM

    -Dual Core

    POWER7-Multi-core

    Major POWER Innovation-1990 RISC Architecture-1994 SMP-1995 Out of Order Execution-1996 64 Bit Enterprise Architecture

    -1997 Hardware Multi-Threading-2001 Dual Core Processors-2001 Large System Scaling-2001 Shared Caches-2003 On Chip Memory Control-2003 SMT-2006 Ultra High Frequency

    -2006 Dual Scope Coherence Mgmt-2006 Decimal Float/VSX-2006 Processor Recovery/Sparing-2009 Balanced Multi-core Processor-2009 On Chip EDRAM

    -Cobra A10-64 bit

    45nm

    65nm

    130nm

    180nm

    .5um

    .35um.25um

    .18um

    .5um

    .5um

    1.0um

    .72um

    .6um

    .35um

    .25um

    604e

    .22um

    POWER2TM

    P2SC

    .35um

    RS64I ApacheBiCMOS

    RS64II North Star

    RS64III Pulsar

    RS64IV Sstar

    Muskie A35

    Next Gen.

    * Dates represent approximate processor power-on dates, not system availability

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    4/58

    4 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    IBM POWER Processor Roadmap

    -3 Year Revolution

    POWER8

    Future2001

    POWER4/4+

    Dual Core Chip Multi Processing Distributed Switch Shared L2 Dynamic LPARs (32)180nm,

    First Dual Corein Industry

    2004

    POWER5/5+

    Dual Core & Quad Core MdMicropartitioningEnhanced Scaling2 Thread SMTDistributed Switch +Core Parallelism +FP Performance +Memory bandwidth +130nm, 90nm

    HardwareVirtualizationfor Unix & Linux

    2007

    POWER6/6+

    Dual Core High Frequencies Virtualization + Live Partition Migration Memory Subsystem + Altivec Instruction Retry Alt Proc Recovery Dyn Energy Mgmt 2 Thread SMT + Protection Keys 65nm

    FastestProcessorIn Industry

    2010

    POWER7

    4,6,8 Core 32MB On-Chip eDRAM Power Optimized Cores Mem Subsystem ++ Advanced Memory

    Expansion 4 Thread SMT++ Reliability + VSM & VSX Protection Keys+ 45nm

    Most

    POWERful &ScalableProcessor inIndustry

    IBM is the leaderin Processorand Serverdesign

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    5/58

    6 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Processor Chip

    567mm2 Technology: 45nm lithography, Cu,SOI, eDRAM

    Eight processor cores

    12 execution units per core4 Way SMT per core

    32 Threads per chip

    256KB L2 per core

    32MB on chip eDRAM shared L3

    1.2B transistors

    Equivalent function of 2.7B

    eDRAM efficiency

    Dual DDR3 Memory Controllers

    100GB/s Memory bandwidth per chipsustained

    Scalability up to 32 Sockets360GB/s SMP bandwidth/chip

    20,000 coherent operations in flight

    Advanced pre-fetching Data andInstruction

    Binary Compatibility with POWER6 and prior

    systems

    * Statements regarding SMP servers

    do not imply that IBM will introducea system with this capability.

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    6/58

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    7/58

    8 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Simultaneous Multithreading Support Intelligent Threads

    Standard Cache Option

    All cores active

    Requires POWER7 Mode

    POWER6 Mode supports single thread (ST)and 2-way simultaneous multithreading(SMT2)

    POWER7 Mode also supports 4-waysimultaneous multithreading (SMT4)

    Operating System Support

    AIX 6.1 and AIX 7.1

    IBM i 6.1 and 7.1

    Linux

    Dynamic Runtime SMT scheduling

    Spread work among cores to execute inappropriate threaded mode

    Can dynamical shift between modes asrequired: ST / SMT2 / SMT4

    LPAR-wide SMT controls

    ST, SMT2, SMT4 modes

    0

    0.5

    1

    1.5

    2

    ST SMT2 SMT4

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    8/58

    9 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Processor Chip

    567mm2 Technology: 45nm lithography, Cu,SOI, eDRAM

    Eight processor cores

    12 execution units per core4 Way SMT per core

    32 Threads per chip

    256KB L2 per core

    32MB on chip eDRAM shared L3

    1.2B transistors

    Equivalent function of 2.7B

    eDRAM efficiency

    Dual DDR3 Memory Controllers

    100GB/s Memory bandwidth per chipsustained

    Scalability up to 32 Sockets360GB/s SMP bandwidth/chip

    20,000 coherent operations in flight

    Advanced pre-fetching Data andInstruction

    Binary Compatibility with POWER6 and prior

    systems

    * Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    9/58

    10 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Computer Systems Memory: DRAM

    Dynamic RAM

    Uses MOSFETs and Capacitors

    Charge , i.e., 0 or 1, decays over time(capacitor discharges) losing its state

    To be useful, cell needs periodic refreshing(dynamic)

    Volatile memory (data lost when memory isnot powered)

    Each cell stores 1 bit

    Memory cell: 1xMOSFET + 1xCapacitor

    Comparison with SRAM

    Less components, much more dense

    Higher latency, variable access timings

    Charge refresh & amplification complexity DiagramcourtesyofWikipedia

    4x4 Dynamic RAM memory

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    10/58

    11 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Computer Systems Memory: SRAM

    Static RAM

    Uses transistors only, e.g., MOSFET

    Charge does not need to be refreshed (static)

    Uses bi-stable latching circuitry

    Volatile memory (data lost when memory isnot powered)

    Each cell stores 1 bit 6 or more MOSFETs

    Comparison with DRAM

    Uses more components per bit Lower latency

    Higher frequency access

    Simpler access interface

    Static RAM memory cell: 6 x MOSFET DiagramcourtesyofWikipedia

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    11/58

    13 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Processor Chip

    567mm2 Technology: 45nm lithography, Cu, SOI,eDRAM

    Eight processor cores

    12 execution units per core

    4 Way SMT per core

    32 Threads per chip

    256KB L2 per core

    32MB on chip eDRAM shared L3

    1.2B transistors

    Equivalent function of 2.7B

    eDRAM efficiency Dual DDR3 Memory Controllers

    100GB/s Memory bandwidth per chip sustained

    Scalability up to 32 Sockets

    360GB/s SMP bandwidth/chip

    20,000 coherent operations in flight

    Advanced pre-fetching Data andInstruction

    Binary Compatibility with POWER6 and prior systems

    * Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    12/58

    14 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Processor Chip

    567mm2 Technology: 45nm lithography, Cu, SOI,eDRAM

    Eight processor cores

    12 execution units per core

    4 Way SMT per core

    32 Threads per chip

    256KB L2 per core

    32MB on chip eDRAM shared L3

    1.2B transistors

    Equivalent function of 2.7B

    eDRAM efficiency Dual DDR3 Memory Controllers

    100GB/s Memory bandwidth per chip sustained

    Scalability up to 32 Sockets

    360GB/s SMP bandwidth/chip

    20,000 coherent operations in flight

    Advanced pre-fetching Data andInstruction

    Binary Compatibility with POWER6 and prior systems

    * Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    13/58

    15 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Balanced View

    Thread

    Core

    32 Chip

    System

    Socket

    1

    10

    100

    1000

    10000

    Single Thread Performance

    SystemT

    hruput

    POWER7 Design Principles:

    Balanced Design Multiple optimization points

    Improved energy efficiency RAS improvements

    Improved Thread Performance

    Dynamic allocation of resources

    Shared L3

    Increased Core parallelism 4 Way SMT Aggressive out of order execution

    Extreme Increase in SocketThroughput

    Continued growth in socket bandwidth

    Balanced core, cache, memoryimprovements

    System

    Scalable interconnect

    Reduced coherence traffic

    Multiple optimization Points

    POWER6

    Gra hs for illustration ur oses only (Not actual data

    POWER7

    Traditional Performance View

    1

    10

    100

    1000

    10000

    Thread Core Socket 32 Chip

    System

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    14/58

    16 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Design Principles:

    Cores:

    4, 6, and 8-core offerings with up to 32MB of L3 Cache

    Dynamically turn cores on and off, reallocating energy

    Dynamically vary individual core frequencies, reallocating energy Dynamically enable and disable up to 4 threads per core

    Memory Subsystem:

    4 or 8 channel configurations

    System Topologies:

    Standard, half-width, and double-width SMP busses supported Multiple System Packages

    Flexibility and Adaptability

    Power 70x,710-755Single Chip Organic

    Pow er 770-795Single Chip Glass Ceramic

    Comput e I ntensiveQuad-chip MCM

    1 Memory Controller

    3 4B local links

    2 Memory Controllers

    3 8B local links

    2 8B Remote links

    8 Memory Controllers

    3 16B local links (on MCM)

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    15/58

    17 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7: Core

    Execution Units

    2 Fixed point units

    2 Load store units

    4 Double precision floating point 1 Vector unit

    1 Branch

    1 Condition register

    1 Decimal floating point unit 6 Wide dispatch / 8 Wide Issue

    Recovery Function Distributed

    1,2,4 Way SMT Support

    Out of Order Execution 32KB I-Cache

    32KB D-Cache

    256KB L2

    Tightly coupled to core

    256KB L2

    IFUCRU/BRU

    ISU

    DFU

    FXU

    VSXFPU

    LSU

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    16/58

    18 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Computer Systems Memory: Hierarchy

    Memory wall

    Differential in performancebetween addressable memory and

    the microprocessor Implement a memory hierarchy and

    intelligence to reduce accesstimes/latency

    Combination of memory-celltechnology and distance from theprocessor

    Microprocessor

    Integrated Circuit

    Instructions

    InputData

    OutputData

    Registers

    Instruction cache

    Data cache

    L2 cache

    L3 cache

    DIMM Buffer ASIC

    Addressable memory

    {L1 cache

    Processorint

    imacy,smallercapacity

    IncreasedLa

    tency,greatercapacity

    Typical Memory Hierarchy

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    17/58

    19 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalS

    MPLinks

    RemoteSMP

    +I/OLinks

    POWER7 is an 8-core, high performance Server chip. A solid chip is a good start.

    But to win the race, you need a balanced system. POWER7 enables that balance.

    Challenge: Beating Physics to Realize Multi-core Potential

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    18/58

    20 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Challenge: Beating Physics to Realize Multi-core Potential

    Multi-coreevolution

    Compute Throughput Potential

    Multi-coreevolution

    Socket Throughput Limitation

    (Physical signal economics)

    Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    19/58

    21 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    2 to 4 socket16 to 32-way SMP Server

    Emerging Entry ServerVirtualized/Cloud Platform

    8-core

    8-core

    8-core

    8-core

    Trends in Server Evolution

    Time

    Single Image Virtualized/Cloud

    8 to 32 socket

    16 to 64-way SMP Server

    Traditional High-End ServerVirtualized Consolidation Platform

    Enabled by:- Technology

    - Innovation

    Driven by:- IT Evolution- Economics

    2-core 2-core

    2-core 2-core

    2 to 4 socket

    4 to 8-way SMP Server

    Traditional Entry ServerSingle Image Platform

    - A simple matter of ridingthe multi-core trend?

    - Add more cores to the die,beef up some interfaces,and scale to a large SMP?

    * Statements regarding SMP servers

    do not imply that IBM will introducea system with this capability.

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    20/58

    22 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    2 to 4 socket16 to 32-way SMP Server

    Emerging Entry ServerVirtualized/Cloud Platform

    8-core

    8-core

    8-core

    8-core

    Time

    Single Image Virtualized/Cloud

    8 to 32 socket16 to 64-way SMP Server

    Traditional High-End ServerVirtualized Consolidation Platform

    Sim

    ilarChallen

    ge

    Enabled by:- Technology

    - Innovation

    Driven by:- IT Evolution- Economics

    2-core 2-core

    2-core 2-core

    2 to 4 socket4 to 8-way SMP Server

    Traditional Entry ServerSingle Image Platform

    - A simple matter of ridingthe multi-core trend?

    - Add more cores to the die,beef up some interfaces,and scale to a large SMP?

    Not so simple:- Emerging entry servers

    have characteristics similarto traditional high-endlarge SMP servers

    Achieving solid virtualmachine performance

    requires a BalancedSystem Structure.

    * Statements regarding SMP servers

    do not imply that IBM will introducea system with this capability.

    Trends in Server Evolution

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    21/58

    23 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    2 to 4 socket16 to 32-way SMP Server

    Emerging Entry ServerVirtualized/Cloud Platform

    8 to 32 socket64 to 256-way SMP Server

    Emerging High-End ServerUltraScale Cloud Platform

    8-core

    8-core

    8-core

    8-core

    Time

    Single Image Virtualized/Cloud UltraScale Cloud

    Enabled by:- Technology

    - Innovation

    Driven by:- IT Evolution- Economics

    2-core 2-core

    2-core 2-core

    2 to 4 socket4 to 8-way SMP Server

    Traditional Entry ServerSingle Image Platform

    8 to 32 socket16 to 64-way SMP Server

    Traditional High-End ServerVirtualized Consolidation Platform

    Same enablers anddriving factors applyat larger scale

    * Statements regarding SMP serversdo not imply that IBM will introducea system with this capability.

    Trends in Server Evolution

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    22/58

    24 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Challenge: How does POWER7 maintain the Balance?

    Multi-coreevolution

    Compute Throughput Potential

    Socket Throughput Limitation

    (Physical signal economics)

    Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential

    Cache Hierarchy Technologyand Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    23/58

    25 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Cache Hierarchy Rqmtfor POWERServers

    Core Core

    LowLatency2M to 4Mper Core

    Cachefootprint

    Large, Shared, 30+ MBCache footprint

    much closer thanLocal Memory

    . . .

    LowLatency2M to 4Mper Core

    Cachefootprint

    Challenge

    for Multi-core POWER7

    POWER4TM, POWER5TM, and

    POWER6TM systems derive hugebenefit from high bandwidth accessto large, off-chip cache.

    But socket pin count constraints

    prevent scaling the off-chip cacheinterface to support 8 cores.

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    24/58

    26 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Cache Hierarchy Rqmtfor POWER Servers

    Core Core

    LowLatency2M to 4Mper Core

    Cachefootprint

    Large, Shared, 30+ MBCache footprint

    much closer thanLocal Memory

    . . .

    LowLatency2M to 4Mper Core

    Cachefootprint

    Need to satisfy both caching

    requirements with one cache.

    Challenge

    for Multi-core POWER7

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    25/58

    27 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Cache Hierarchy Rqmtfor POWER Servers

    Core Core

    LowLatency2M to 4Mper Core

    Cachefootprint

    Large, Shared, 30+ MBCache footprint

    much closer thanLocal Memory

    . . .

    LowLatency2M to 4Mper Core

    Cachefootprint

    Challenge

    for Multi-core POWER7

    IBM CustomeDRAM

    CustomFast SRAM

    High Area/powerHigh speed/bandwidth

    On-processor30+ MB Cache

    Private coreSub-MB Cache

    Dense, low powerLower speed/bandwidth

    Low power, dense eDRAM

    value enhanced withlow latency, high bandwidth,

    fast SRAM structures

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    26/58

    28 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    ConventionalMemory DRAM

    IBM ASICeDRAM

    IBM CustomeDRAM

    CustomDense SRAM

    CustomFast SRAM

    Solution: High speed eDRAM on the processor die

    Dense, low powerLow speed/bandwidth

    High Area/powerHigh speed/bandwidth

    ConventionalMemory DIMMs

    Large, Off-chip30+ MB Cache

    On-processor30+ MB Cache

    On-processorMulti-MB Cache

    Private coreSub-MB Cache

    With POWER7, IBM introduces on-processor, high-speed,custom eDRAM, combining the dense, low power attributes

    of eDRAM with the speed and bandwidth of SRAM.

    OnuP

    Chip

    OffuP

    Chip

    IBMs POWER Servers have leveraged large off-chipeDRAM caches in POWER4, 5, and 6.

    Industry Standard Caching and Memory Technologies:Conventional DIMMs, Dense and Fast SRAMs.

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    27/58

    30 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalSMPLinks

    RemoteS

    MP+I/OLinks

    L3 Cache Structure

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    28/58

    31 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Hybrid L3 Fluid Cache Structure Intelligent Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalSMPLinks

    RemoteS

    MP+I/OLinks

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    29/58

    32 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    - Keeps multiple footprints at ~3X lower latency than local memory.

    Core Core Core Core Core Core Core Core

    Large, Shared32M L3 Cache

    Private

    Private

    Private

    SharedPrivate

    PrivatePrivate

    SharedPrivate

    Private Private Shared

    Working SetFootprints

    Core Core

    Cache Hierarchy Technology and Innovation

    Hybrid L3 Fluid Cache Structure Intelligent Cache

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    30/58

    33 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    - Keeps multiple footprints at ~3X lower latency than local memory.

    Core Core Core Core Core Core Core Core

    Large, Shared32M L3 Cache

    Private

    Private

    Private

    SharedPrivate

    PrivatePrivate

    SharedPrivate

    Private Private Shared

    Working SetFootprints

    - Automatically migrates private footprints (up to 4M) to fast localregion (per core) at ~5X lower latency than full L3 cache.

    - Automatically clones shared data to multiple private regions.

    Core Core

    Fast, LocalL3 Region

    Fast, LocalL3 Region

    ClonedCloned

    Hybrid L3 Fluid Cache Structure Intelligent Cache

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    31/58

    34 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core Core Core Core Core Core Core Core

    Fast, LocalL3 Region

    Private

    Private

    Private

    PrivateLarge, Shared32M L3 Cache

    Private

    Private

    Private

    Private

    Private

    Private

    Private

    Private

    Private

    Private

    Private

    PrivatePrivate

    Private

    Private

    Private

    Private Private

    Private

    - Enables a subset of the cores to utilize the entire large sharedL3 cache when the remaining cores are not using it.

    Hybrid L3 Fluid Cache Structure Intelligent Cache

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    32/58

    35 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    Fast LocalL3 Region

    LocalSMPLinks

    RemoteSM

    P+I/OLinks

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    33/58

    36 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Solution: L2 Turbo Cache

    - L2 Turbo cache keeps a tight 256K working set with extremelylow latency (~3X lower than local L3 region) and high bandwidth,reducing L3 power and boosting performance.

    Fast, LocalL3 Region

    Fast, LocalL3 Region

    Private

    Private

    SharedPrivate

    Private

    Private

    Private

    Private

    Private

    Private

    Cloned

    Cloned Cloned

    ClonedLarge, Shared32M L3 Cache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    Core Core Core Core Core Core

    L2 TurboCache

    Core

    L2 TurboCache

    Core

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    34/58

    37 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalSMPLinks

    RemoteSM

    P+I/OLinks

    Cache Hierarchy Technology and Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    35/58

    38 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Cache Hierarchy Summary

    Fast, Local

    L3 RegionFast, LocalL3 Region

    Private

    Private

    SharedPrivate

    Private

    Private

    Private

    Private

    Private

    Private

    Cloned

    Cloned Cloned

    ClonedLarge, Shared32M L3 Cache

    32M

    Up to 4M

    256K

    32K

    Capacity

    eDRAM

    eDRAM

    Fast SRAM

    Fast SRAM

    Array

    De-coupled global storage updateStore-InPrivate L2

    Local thread storage updateStore-thruL1 Data

    Large 32M shared footprintAdaptiveShared L3

    Reduced power footprint (up to 4M)Partial VictimFast L3 Region

    CommentPolicyCache Level

    L2 Turbo

    Cache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    L2 TurboCache

    L2 Turbo

    Cache

    Core Core Core Core Core Core

    L2 TurboCache

    Core

    L2 TurboCache

    Core

    Cache Hierarchy Technology and Innovation

    Challenge: How does POWER7 maintain the Balance?

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    36/58

    39 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Multi-coreevolution

    Compute Throughput Potential

    Socket Throughput Limitation

    (Physical signal economics)

    Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential

    Cache Hierarchy Technologyand Innovation

    Advances in Memory Subsystem

    Challenge: How does POWER7 maintain the Balance?

    Advances in Memory Subsystem

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    37/58

    40 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    POWER7 Requirements

    Core:

    10GB/s to 20GB/s sustainedmemory bandwidth per core

    16GB to 32GB of cache

    Socket: 4 times growth in memory

    bandwidth & capacity

    System: Packaging more memory into

    similar volume, with similar energyand cooling constraints

    Memory subsystem requirement for POWER7 processor-based servers

    10-20GB/ssustainedbandwidthper core

    16-32GB

    storage percore

    Energyconstraints

    Core

    Advances in Memory Subsystem

    Advances in Memory Subsystem

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    38/58

    41 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    1) Dual Integrated DDR3 Controllers- Massive 16KB scheduling windowper POWER7 chip insures high

    channel and DIMM utilization- Sparse access acceleration

    - Advanced Energy Management- Numerous RAS advances

    Advances in Memory Subsystem

    POWER7 Chip

    MemoryController

    MemoryController

    AdvancedBuffer

    Chip

    Multi-faceted Solution

    2) Eight high speed 6.4 GHz channels- New low power differential signaling- Sustained 100+ GB/s per socket

    3) New DDR3 buffer chip architecture

    - Larger capacity support (32 GB / core)- Energy Management support

    - RAS enablement

    4) DDR3 DRAMs

    - Supports 800, 1066, 1333, and 1600

    Ad i M S b t

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    39/58

    42 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalS

    MPLinks

    RemoteSM

    P+I/OLinks

    Advances in Memory Subsystem

    C O ?

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    40/58

    43 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Multi-coreevolution

    Compute Throughput Potential

    Socket Throughput Limitation

    (Physical signal economics)

    Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential

    Cache Hierarchy Technologyand Innovation

    Advances in Memory Subsystem

    Advances in Off-Chip SignalingTechnology

    Challenge: How does POWER7 maintain the Balance?

    Advances in Off chip Signaling Technology

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    41/58

    44 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    1) Enhanced Signal-ended Elastic Interface Technology

    2) New high speed, low power Differential Technology

    360 GB/s3.0 Ghz120 bytesSingle-endedSMP Interconnect

    50 GB/s2.5 Ghz20 bytesSingle-endedI/O Bridge

    590 GB/sTotal Bandwidth

    180 GB/s6.4 Ghz28 bytesDifferentialMemory Channels

    Off-chip Cache

    Interface

    nonenonenonenone

    BandwidthFrequencyInfo WidthSignal Type

    Moving L3 onto POWER7 along with advances in signaling

    technology enables significant raw bandwidth growth for bothmemory and I/O subsystems. Note that advanced scheduling

    improves POWER7s ability to utilize memory bandwidth.

    Advances in Off-chip Signaling Technology

    (Note that bandwidths shown are raw, peak signal bandwidths)

    Challenge: How does POWER7 maintain the Balance?

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    42/58

    45 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Multi-coreevolution

    Compute Throughput Potential

    Socket Throughput Limitation

    (Physical signal economics)

    Need to Amplify EffectiveSocket Throughputto Close Gap andAchieve Potential

    Cache Hierarchy Technologyand Innovation

    Advances in Memory Subsystem

    Advances in Off-Chip SignalingTechnology

    Exploit Long Term Investmentin Coherence Innovation

    Challenge: How does POWER7 maintain the Balance?

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    43/58

    46 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Core

    L2 Cache

    Mem Ctrl Mem CtrlL3 Cache and Chip Interconnect

    LocalS

    MPLinks

    RemoteSMP

    +I/OLinks

    Exploit Long Term Investment in Coherence Innovation

    Using local and remote SMP links, up to 32 POWER7 chips are connected

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    44/58

    47 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Up to 32 POWER7 chips form a massive SMP system.

    * Statements regarding SMP serversdo not imply that IBM will introduce

    a system with this capability.

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    45/58

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    46/58

    49 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    8 to 32 socket64 to 256-way SMP Server

    POWER7 High-End ServerUltraScale Cloud Platform

    8 to 32 socket16 to 64-way SMP Server

    POWER6 High-End ServerVirtualized Consolidation Platform

    ComputeThroughput

    Compute Throughput

    1X

    ~5X

    Global CoherenceThroughput

    Global CoherenceThroughput

    320GB/s

    450GB/s

    Challenge: As system size grows, Coherence broadcast traffic increases

    Global ScopeCoherenceBroadcast

    Exploit Long Term Investment in Coherence Innovation

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    47/58

    51 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    8 to 32 socket64 to 256-way SMP Server

    POWER7 High-End ServerUltraScale Cloud Platform

    8 to 32 socket16 to 64-way SMP Server

    POWER6 High-End ServerVirtualized Consolidation Platform

    Solution: Speculative limited scope Coherence broadcast- In 2003, recognized emerging trend- Developed Dual-Scope Broadcast Coherence Protocol for POWER6- Utilizes 13 cache states and integrated scope indicator in memory

    Global ScopeCoherenceBroadcast

    Nodal ScopeSpeculativeCoherenceBroadcast

    Provides value for POWER6- Latency reduction- Near Perfect Scaling for extreme

    memory intensive workloads

    Exploit Long Term Investment in Coherence Innovation

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    48/58

    Energy Management: Architected Idle Modes

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    49/58

    53 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Energy Management: Architected Idle Modes

    Two Design Points Chosen for Technology

    Nap (optimized for wake-up time) Turn off clocks to execution units

    Reduce frequency to core

    Caches and TLB remain coherent

    Fast wake-Up

    Sleep (optimized for power reduction) Purge caches and TLB

    Turn off clocks to full core and caches

    Reduce voltage to V-retention

    Leakage current reduced substantially Voltage ramps-up on wake up

    No core re-initialization required

    Wake-

    UpLatency

    Energy Reduction

    Nap

    Sleep

    RV WinklePower gate

    Doze

    4 PowerPC Architected States

    Adaptive Energy Management: Energy ScaleTM

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    50/58

    54 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    p gy g gy

    Chip FO4 Tuned for Optimal Performance/Wattin Technology

    DVFS (Dynamic Voltage and FrequencySlewing)

    -50% to +10% frequency slew independent

    per core Frequency and voltage adjusted based on:

    Work load and utilization.

    On board activity monitors

    Turbo-Mode

    Up to 10% frequency boost

    Leverages excess energy capacity from:

    Non worst case work loads

    Idle cores

    Processor and Memory Energy Usage can beindependently Balanced.

    Real time hardware performance monitorsused.

    On board power proxy logic estimates power

    Power Capping Support

    Allows budgeting of power to different parts ofsystem

    SPECPower: Mean System Power per Load

    Level

    020406080100

    Load Level (%)

    AC

    Pow

    er

    Vmin

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    51/58

    55 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Power Systems Reliability,Availability, Serviceability (RAS)

    OS Downtime Comparison Survey

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    52/58

    56 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Win2000 Win2003 RHEL SOLARIS HP-UX SUSE AIX

    The Yankee Group 2007-2008 Global Server Operating Systems Reliability Survey as quoted in Windows Server: The New King of Downtime by MarkJoseph Edwards at www.windowsitpro.com/article/articleid/98475/windows-server-the-new-king-of-downtime.html , March 5, 2008 and in

    http://www.sunbeltsoftware.com/stu/Yankee-Group-2007-2008-Server-Reliability.pdf

    Hours400 participants in 27 countries

    ITIC Survey says Power Systems with AIX deliver 99.997% uptime

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    53/58

    57 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    - 54% of IT executives and managers say that they require 99.99% or better availability for their applications

    Power Systems with AIX delivers thebest RAS of UNIX, Linux, Windowschoices

    1. Availability: The least amount ofdowntime

    15 minutes a year

    2.3 times better than the closest UNIX

    competitor more than 10X better than Windows

    2. Reliability: The fewest unscheduledoutages

    less than one outage per year3. Serviceability: The fastest patch

    time

    11 minutes to apply a patch

    Source: Network World, dated July 14, 2009, reports on the 2009 ITIC Global Server Hardware & Server OS Reliability Survey Results

    POWER7: Reliability and Availability Features

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    54/58

    58 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Fabric Bus Interface to other Chips andNodes

    ECC protected Node hot add /repair

    Core Recovery Leverage speculative execution resources to

    enable recovery Error detected in GPRs FPRs VSR, flushed

    and retried Stacked latches to improve SER

    Alternate Processor Recovery Partition isolation for core checkstops

    L3 eDRAM ECC protected SUE handling Line delete Spare rows and columns

    GX IO Bus ECC protected Hot add

    InfiniBand Interface Redundant paths

    IO Hub

    PCIBridge

    PCI Adapter

    64 Byte ECC on Memory Corrects full chip kill on X8 dimms Spare X8 devices implemented

    Dual memory chip failures do not causeoutage

    Selective memory mirror capability to recoverpartition from dimm failures

    Hardware assisted scrubbing SUE handling Dynamic sparing on channel interface PowerVM Hypervisor protected from full DIMM

    failures

    OSC0 OSC1Dynamic Oscillator

    Failover

    BUF

    BUF

    BUF

    BUF

    X8 Dimms

    Fabric Interface

    IBM Power Systems value proposition

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    55/58

    59 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Deliver business value by leveraging technology

    y p p

    ReliabilityPerformance Flexibility Affordability

    +

    . . . the highest value at the lowest risk

    with leading technology

    Session Evals in 3 Easy Steps

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    56/58

    60 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Go to: http://ibmtechu.com/up1

    Select Register button and complete the form

    (One time only)

    2

    Session Evals in 3 Easy Steps

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    57/58

    61 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    Select Session Evals button and complete the online

    form3

    HE01Session Code:

    Fantastic

    Stay Connected & Continue Skills Transfer via IBM Training

  • 8/2/2019 HE01_Tendler - Power 7 Architecture

    58/58

    62 Copyright IBM Corporation 2011

    Power Systems Technical UniversityPower Systems Technical University

    y g

    Training pathsWhat to take,when to take it

    Social mediaJoin theconversation

    Custom catalogCreate a catalogthat meets yourinterest areas

    RSS feedsUp-to-dateinformation onthe trainingyou need

    IBM TrainingNewsTargeted toyour needs

    New toInstructor LedOnline (ILO)?Take a freetest drive!

    Education PacksOnline discount program for ALL IBMTraining courses for your company

    Questions? Email Lisa Ryan ([email protected])