Piranha: Designing a Scalable CMP-based System for Commercial Workloads Piranha: Designing a Scalable CMP-based System for Commercial Workloads Luiz André Barroso Western Research Laboratory Luiz André Barroso Western Research Laboratory April 27, 2001 Asilomar Microcomputer Workshop
31
Embed
Piranha - Barroso · Basem Nayfeh Andreas Nowatzyk Joan Pendleton Shaz Qadeer Brian Robinson Barton Sano Daniel Scales Ben Verghese ... – limit fan-out/fan-in serialization with
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Luiz André BarrosoWestern Research Laboratory
Luiz André BarrosoWestern Research Laboratory
April 27, 2001 Asilomar Microcomputer Workshop
What is Piranha?What is Piranha?What is Piranha?lA scalable shared memory architecture based on chip
multiprocessing (CMP) and targeted at commercialworkloads
lA research prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGroup
lA departure from ever increasing processor complexityand system design/verification cycles
Importance of Commercial ApplicationsImportance of Commercial ApplicationsImportance of Commercial Applications
lTotal server market size in 1999: ~$55-60B– technical applications: less than $6B– commercial applications: ~$40B
Worldwide Server Customer Spending (IDC 1999)
Infrastructure29%
Business processing
22%
Decision support
14%
Software development
14%
Collaborative12%
Other3%
Scientific & engineering
6%
Price Structure of ServersPrice Structure of ServersPrice Structure of Serversl IBM eServer 680
Studies of Commercial WorkloadsStudies of Commercial WorkloadsStudies of Commercial Workloadsl Collaboration with Kourosh Gharachorloo (Compaq WRL)
– ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion)
– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors
(with J. Lo, S. Eggers, H. Levy, and S. Parekh)
– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors
(with P. Ranganathan and S. Adve)
– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads
(with A. Nowatzyk and B. Verghese)
– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads
(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)
Studies of Commercial Workloads: summaryStudies of Commercial Workloads: summaryStudies of Commercial Workloads: summarylMemory system is the main bottleneck
– astronomically high CPI– dominated by memory stall times– instruction stalls as important as data stalls– fast/large L2 caches are critical
lVery poor Instruction Level Parallelism (ILP)– frequent hard-to-predict branches– large L1 miss ratios– Ld-Ld dependencies– disappointing gains from wide-issue out-of-order techniques!
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Increasing Complexity of Processor DesignsIncreasing Complexity of Processor DesignsIncreasing Complexity of Processor Designs
Exploiting Higher Levels of IntegrationExploiting Higher Levels of IntegrationExploiting Higher Levels of Integration
l lower latency, higher bandwidthl reuse of existing CPU core
addresses complexity issues
1.5MBL2$
1GHz21264 CPU
64KBD$
64KBI$
I/ONet
wo
rk In
terf
ace
Co
her
ence
En
gin
e
ME
M-C
TL
0
31
ME
M-C
TL
0
31
Alpha 21364
364M
IO364
M
IO
364M
IO364
M
IO
364M
IO364
M
IO
l incrementally scalableglueless multiprocessing
Singlechip
Exploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial Apps
L2$
CPU
D$I$
I/O
Net
wo
rkC
oh
eren
ce
ME
M-C
TL
ME
M-C
TL
CPU
D$I$
Chip Multiprocessing (CMP)
Example: IBM Power4
time
thread 1thread 2thread 3thread 4
Simultaneous Multithreading (SMT)
l SMT superior in single-thread performance
l CMP addresses complexity by using simpler cores
time
thread 1thread 2thread 3thread 4
Example: Alpha 21464
OutlineOutlineOutlinel Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha– Architecture– Performance
lDesign Methodology
lSummary
Piranha ProjectPiranha ProjectPiranha Project
lExplore chip multiprocessing for scalable serverslFocus on parallel commercial workloadslSmall team, modest investment, short design timelAddress complexity by using:
– simple processor cores– standard ASIC methodology
Give up on ILP, embrace TLP
Piranha Team MembersPiranha Team MembersPiranha Team MembersResearch
– Luiz André Barroso (WRL)– Kourosh Gharachorloo (WRL)– David Lowell (WRL)– Joel McCormack (WRL)– Mosur Ravishankar (WRL)– Rob Stets (WRL)– Yuan Yu (SRC)
NonStop Hardware DevelopmentASIC Design Center
– Tom Heynemann– Dan Joyce– Harland Maxwell– Harold Miller– Sanjay Singh– Scott Smith– Jeff Sprouse– … several contractors
Robert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz Qadeer
Brian RobinsonBarton SanoDaniel ScalesBen Verghese
lNear-linear scalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Number of Cores
Sp
eed
up
010
2030
4050
6070
8090
100
P1 P2 P4 P8
500 MHz, 1-issue
No
rmal
ized
Bre
akd
ow
n o
f L
1 M
isse
s (%
)
L2 MissL2 FwdL2 Hit
Potential of a Full-Custom PiranhaPotential of a Full-Custom PiranhaPotential of a Full-Custom Piranha
l 5x margin over OOO for OLTP and DSSl Full-custom design benefits substantially from boost in core speed
0
20
40
60
80
100
120
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
No
rmal
ized
Exe
cuti
on
Tim
e
L2 MissL2 HitCPU
OLTP DSS
100
34
20
100
43
19
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Managing Complexity in the ArchitectureManaging Complexity in the ArchitectureManaging Complexity in the ArchitecturelUse of many simpler logic modules
– shorter design– easier verification– only short wires*– faster synthesis– simpler chip-level layout
lSimplify intra-chip communication– all traffic goes through ICS (no backdoors)
lUse of microprogrammed protocol engineslAdoption of large VM pagesl Implement sub-set of Alpha ISA
– no VAX floating point, no multimedia instructions, etc.
– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small team
lSystem-level (integrated) testing– much easier to create tests– only one BFM at the processor interface– simpler to assert correct operation– Verilog simulation is too slow for comprehensive testing
Our Approach:Our Approach:Our Approach:
lDesign in stylized C++ (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C++ (faster & cheaper than Verilog)§ simulation speed ~1000 clocks/second
– employ directed tests to fill test coverage gaps
lAutomatic C++ to Verilog translation– single design database– reduce translation errors– faster turnaround of design changes– risk: untested methodology