Piranha - Barroso · Basem Nayfeh Andreas Nowatzyk Joan Pendleton Shaz Qadeer Brian Robinson Barton Sano Daniel Scales Ben Verghese ... – limit fan-out/fan-in serialization with

Piranha: Designing a Scalable CMP-based System for

Commercial Workloads

Piranha: Designing a Scalable CMP-based System for

Commercial Workloads

Luiz André BarrosoWestern Research Laboratory

Luiz André BarrosoWestern Research Laboratory

April 27, 2001 Asilomar Microcomputer Workshop

What is Piranha?What is Piranha?What is Piranha?lA scalable shared memory architecture based on chip

multiprocessing (CMP) and targeted at commercialworkloads

lA research prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGroup

lA departure from ever increasing processor complexityand system design/verification cycles

Importance of Commercial ApplicationsImportance of Commercial ApplicationsImportance of Commercial Applications

lTotal server market size in 1999: ~$55-60B– technical applications: less than $6B– commercial applications: ~$40B

Worldwide Server Customer Spending (IDC 1999)

Infrastructure29%

Business processing

22%

Decision support

14%

Software development

14%

Collaborative12%

Other3%

Scientific & engineering

6%

Price Structure of ServersPrice Structure of ServersPrice Structure of Serversl IBM eServer 680

(220KtpmC; $43/tpmC)§ 24 CPUs§ 96GB DRAM, 18 TB Disk§ $9M price tag

lCompaq ProLiant ML370(32KtpmC; $12/tpmC)§ 4 CPUs§ 8GB DRAM, 2TB Disk§ $240K price tag

- Software maintenance/management costs even higher (up to $100M)- Storage prices dominate (50%-70% in customer installations)

- Price of expensive CPUs/memory system amortized

Normalized breakdown of HW cost

0%10%20%30%40%50%60%70%80%90%

100%

IBM eServer 680 Compaq ProLiant ML570

I/ODRAMCPUBase

$/CPU $/MB DRAM $/GB Disk

IBM eServer 680 $65,417 $9 $359Compaq ProLiant ML570 $6,048 $4 $64

Price per componentSystem

OutlineOutlineOutline

l Importance of Commercial Workloads

lCommercial Workload Requirements

lTrends in Processor Design

lPiranha

lDesign Methodology

lSummary

Studies of Commercial WorkloadsStudies of Commercial WorkloadsStudies of Commercial Workloadsl Collaboration with Kourosh Gharachorloo (Compaq WRL)

– ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion)

– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors

(with J. Lo, S. Eggers, H. Levy, and S. Parekh)

– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors

(with P. Ranganathan and S. Adve)

– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads

(with A. Nowatzyk and B. Verghese)

– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads

(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)

Studies of Commercial Workloads: summaryStudies of Commercial Workloads: summaryStudies of Commercial Workloads: summarylMemory system is the main bottleneck

– astronomically high CPI– dominated by memory stall times– instruction stalls as important as data stalls– fast/large L2 caches are critical

lVery poor Instruction Level Parallelism (ILP)– frequent hard-to-predict branches– large L1 miss ratios– Ld-Ld dependencies– disappointing gains from wide-issue out-of-order techniques!





lPiranha

lDesign Methodology

lSummary

Increasing Complexity of Processor DesignsIncreasing Complexity of Processor DesignsIncreasing Complexity of Processor Designs

lPushing limits of instruction-level parallelism– multiple instruction issue– speculative out-of-order (OOO) execution

lDriven by applications such as SPECl Increasing design time and team size

lYielding diminishing returns in performance

Processor(SGI MIPS)

YearShipped

TransistorCount

(millions)

DesignTeamSize

DesignTime

(months)

VerificationTeam Size

(% of total)R2000 1985 0.10 20 15 15%R4000 1991 1.40 55 24 20%R10000 1996 6.80 >100 36 >35%

courtesy: John Hennessy, IEEE Computer, 32(8)

Exploiting Higher Levels of IntegrationExploiting Higher Levels of IntegrationExploiting Higher Levels of Integration

l lower latency, higher bandwidthl reuse of existing CPU core

addresses complexity issues

1.5MBL2$

1GHz21264 CPU

64KBD$

64KBI$

I/ONet

wo

rk In

terf

ace

Co

her

ence

En

gin

e

ME

M-C

TL

0

31

ME

M-C

TL

0

31

Alpha 21364

364M

IO364

M

IO

364M

IO364

M

IO

364M

IO364

M

IO

l incrementally scalableglueless multiprocessing

Singlechip

Exploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial Apps

L2$

CPU

D$I$

I/O

Net

wo

rkC

oh

eren

ce

ME

M-C

TL

ME

M-C

TL

CPU

D$I$

Chip Multiprocessing (CMP)

Example: IBM Power4

time

thread 1thread 2thread 3thread 4

Simultaneous Multithreading (SMT)

l SMT superior in single-thread performance

l CMP addresses complexity by using simpler cores

time

thread 1thread 2thread 3thread 4

Example: Alpha 21464

OutlineOutlineOutlinel Importance of Commercial Workloads



lPiranha– Architecture– Performance

lDesign Methodology

lSummary

Piranha ProjectPiranha ProjectPiranha Project

lExplore chip multiprocessing for scalable serverslFocus on parallel commercial workloadslSmall team, modest investment, short design timelAddress complexity by using:

– simple processor cores– standard ASIC methodology

Give up on ILP, embrace TLP

Piranha Team MembersPiranha Team MembersPiranha Team MembersResearch

– Luiz André Barroso (WRL)– Kourosh Gharachorloo (WRL)– David Lowell (WRL)– Joel McCormack (WRL)– Mosur Ravishankar (WRL)– Rob Stets (WRL)– Yuan Yu (SRC)

NonStop Hardware DevelopmentASIC Design Center

– Tom Heynemann– Dan Joyce– Harland Maxwell– Harold Miller– Sanjay Singh– Scott Smith– Jeff Sprouse– … several contractors

Robert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz Qadeer

Brian RobinsonBarton SanoDaniel ScalesBen Verghese

Former Contributors

Piranha Processing NodePiranha Processing NodePiranha Processing Node

CPU

Alpha core: 1-issue, in-order, 500MHzL1 caches: I&D, 64KB, 2-wayIntra-chip switch (ICS) 32GB/sec, 1-cycle delayL2 cache: shared, 1MB, 8-wayMemory Controller (MC) RDRAM, 12.8GB/secProtocol Engines (HE & RE): µprog., 1K µinstr., even/odd interleavingSystem Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth

D$I$

L2$

ICS

CPU

D$I$

L2$

L2$

CPU

D$I$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

L2$

CPU

D$I$L2$

CPU

D$I$

MEM-CTL

MEM-CTL

MEM-CTL MEM-CTL MEM-CTL

MEM-CTL MEM-CTL MEM-CTL

RE

HE

Ro

ute

r

Single Chip

Piranha I/O NodePiranha I/O NodePiranha I/O Node

Ro

ute

r

2 Links @8GB/s D$

L2$

CPU

I$

FBFB

RE

HE

ICS

D$PCI-X

MEM-CTL

l I/O node is a full-fledged member of system interconnect– CPU indistinguishable from Processing Node CPUs– participates in global coherence protocol

Example ConfigurationExample ConfigurationExample Configuration

P

P P P

P- I/O

P- I/O

P

P

l Arbitrary topologies

l Match ratio of Processing to I/O nodes to application requirements

L2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node Coherence

lNo inclusion between L1s and L2 cache– total L1 capacity equals L2 capacity– L2 misses go directly to L1– L2 filled by L1 replacements

l L2 keeps track of all lines in the chip– sends Invalidates, Forwards– orchestrates L1-to-L2 write-backs to maximize

chip-memory utilization– cooperates with Protocol Engines to enforce

system-wide coherence

Inter-Node Coherence ProtocolInter-Node Coherence ProtocolInter-Node Coherence Protocoll ‘Stealing’ ECC bits for memory directory

lDirectory (2b state + 40b sharing info)

lDual representation: limited pointer + coarse vectorl “Cruise Missile” Invalidations (CMI)

– limit fan-out/fan-in serialization with CV

lSeveral new protocol optimizations

info on sharersstate

2b 20b

info on sharersstate

2b 20b

8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53)Data-bitsECCDirectory-bits

0 28 44 53

010000001000CMI

Simulated ArchitecturesSimulated ArchitecturesSimulated Architectures

Single-Chip Piranha PerformanceSingle-Chip Piranha PerformanceSingle-Chip Piranha Performance

0

50

100

150

200

250

300

350

P1500 MHz1-issue

INO1GHz

1-issue

OOO1GHz

4-issue

P8500MHz1-issue

P1500 MHz1-issue

INO1GHz

1-issue

OOO1GHz

4-issue

P8500MHz1-issue

No

rmal

ized

Exe

cuti

on

Tim

e L2MissL2HitCPU

233

145

100

34

350

191

100

44

OLTP DSS

l Piranha’s performance margin 3x for OLTP and 2.2x for DSS

l Piranha has more outstanding misses è better utilizes memory system

Single-Chip Performance (Cont.)Single-Chip Performance Single-Chip Performance (Cont.)(Cont.)

lNear-linear scalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Number of Cores

Sp

eed

up

010

2030

4050

6070

8090

100

P1 P2 P4 P8

500 MHz, 1-issue

No

rmal

ized

Bre

akd

ow

n o

f L

1 M

isse

s (%

)

L2 MissL2 FwdL2 Hit

Potential of a Full-Custom PiranhaPotential of a Full-Custom PiranhaPotential of a Full-Custom Piranha

l 5x margin over OOO for OLTP and DSSl Full-custom design benefits substantially from boost in core speed

0

20

40

60

80

100

120

OOO1GHz

4-issue

P8500MHz1-issue

P8F1.25GHz1-issue

OOO1GHz

4-issue

P8500MHz1-issue

P8F1.25GHz1-issue

No

rmal

ized

Exe

cuti

on

Tim

e

L2 MissL2 HitCPU

OLTP DSS

100

34

20

100

43

19





lPiranha

lDesign Methodology

lSummary

Managing Complexity in the ArchitectureManaging Complexity in the ArchitectureManaging Complexity in the ArchitecturelUse of many simpler logic modules

– shorter design– easier verification– only short wires*– faster synthesis– simpler chip-level layout

lSimplify intra-chip communication– all traffic goes through ICS (no backdoors)

lUse of microprogrammed protocol engineslAdoption of large VM pagesl Implement sub-set of Alpha ISA

– no VAX floating point, no multimedia instructions, etc.

Methodology ChallengesMethodology ChallengesMethodology Challengesl Isolated sub-module testing

– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small team

lSystem-level (integrated) testing– much easier to create tests– only one BFM at the processor interface– simpler to assert correct operation– Verilog simulation is too slow for comprehensive testing

Our Approach:Our Approach:Our Approach:

lDesign in stylized C++ (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C++ (faster & cheaper than Verilog)§ simulation speed ~1000 clocks/second

– employ directed tests to fill test coverage gaps

lAutomatic C++ to Verilog translation– single design database– reduce translation errors– faster turnaround of design changes– risk: untested methodology

lUsing industry-standard synthesis tools

l IBM ASIC process (Cu11)

Piranha Methodology: OverviewPiranha Methodology: OverviewPiranha Methodology: Overview

C++ RTLModels

C++ RTL Models: Cycleaccurate and “synthesizeable”

Physical Design: leveragesindustry standard Verilog-basedtools

PhysicalDesign

cxx: C++ compiler

PS1

PS1: Fast (C++) LogicSimulator

cxx

PS1V PS1V: Can “co-simulate” C++and Verilog module versionsand check correspondence

cxx VerilogModels

Verilog Models: Machinetranslated from C++ models

CLevel

CLevel: C++-to-Verilog Translator

SummarySummarySummarylCMP architectures are inevitable in the near future

lPiranha investigates an extreme point in CMP design– many simple cores

lPiranha has a large architectural advantage over complexsingle-core designs (> 3x) for database applications

lPiranha methodology enables faster design turnaround

lKey to Piranha is application focus:– One-size-fits-all solutions may soon be infeasible

ReferenceReferenceReferencelPapers on commercial workload performance & Piranha

research.compaq.com/wrl/projects/Database

Piranha - Barroso · Basem Nayfeh Andreas Nowatzyk Joan Pendleton Shaz Qadeer Brian Robinson Barton Sano Daniel Scales Ben Verghese ... – limit fan-out/fan-in serialization with

Documents