FEC: 1 Oct 26, 2005 Custom vs Commodity Processors Bill Dally October 26, 2005
FEC: 1 Oct 26, 2005
Custom vs Commodity Processors
Bill DallyOctober 26, 2005
FEC: 2 Oct 26, 2005
95% of top 500 machines use commodity processors
Why?
FEC: 3 Oct 26, 2005
Why 95% Commodity
• Are they faster?• Are they more cost effective?• Do they “ride the Moore’s law curve”?
• Or is it just the “easy path”
FEC: 4 Oct 26, 2005
Definitions
Custom Processor: Processor built specifically for high-end scientific computing. Incorporates high-bandwidth memory system, latency hiding mechanisms, and ability to exploit data- and thread-level parallelism. Intended to scale to large numbers of processors.
Commodity Processor: Processor build primarily for mass market – workstation and enterprise database/web server. Incorporates cache-based memory system, and ability to exploit instruction-level parallelism
FEC: 5 Oct 26, 2005
Objective
Capacity: Deliver maximum sustained performance per lifetime $ at modest scale ($1M-10M)
Capability: Deliver maximum sustained performance per lifetime $ at large scale (>$100M).
(You pay for scalability, but not necessarily at small scales)
FEC: 6 Oct 26, 2005
Custom vs. Commodity – Pros and Cons
• Custom+ High bandwidth memory
sys– 2-8x raw bandwidth
+ High bandwidth gather– 16x bandwidth on irregular
acc.
+ Latency hiding– 100s of outstanding refs– 10x bandwidth*
+ Data/Thread parallelism– 100s of FPUs per chip (20x)
- Lower frequency (0.5x)– Circuits and process
- Non recurring costs– 10K to 100K units
Commodity+ Better process
– 1.4x freq
+ Better circuits– 1.4x freq
+ Amortized development costs– 100M units for desktops– 1M units for servers
- 128-byte memory access– 1/16 performance on gather
- 4-8 outstanding memory refs– Need 100s to hide latency
- Little DP or TLP- 2-4 FPUs
FEC: 7 Oct 26, 2005
Balance in Machine Design
• Each phase of an application is limited by one of:– Arithmetic bandwidth– Local memory bandwidth– Global memory bandwidth (network bandwidth)
Local Memory
Processor(FPUs)
Network
FEC: 8 Oct 26, 2005
Technology makes arithmetic cheap and bandwidth expensive
90nm Chip$2001GHz
64-bit FPU(to scale)
12mm
0.5mm
1 clock50pJ/FLOP
100s of FPUs per chip$0.50/GFLOPS50mW/GFLOPS
40GB/s off-chip BW$5/GB/s
0.25W/GB/s
Cost of BW increases with distance
4x over backplane6x over cable
25x VSR optics
FEC: 9 Oct 26, 2005
Cost is dominated by bandwidth (and memory)
• Arithmetic is cheap $0.50/GFLOPS, – (200GFLOPS chips)
• Memory is $200/GByte, ~$10/GB/s– 1GByte of memory costs 400GFLOPS– 1GB/s of bandwidth costs 20GFLOPS
• Global bandwidth moderate cost– $1 (board), $4 (backplane), $25 (fiber)
per GB/s– 2GFLOPS (board), 8GFLOPS (backplane),
50GFLOPS (global)
FEC: 10 Oct 26, 2005
Recurring vs. Non-recurring costs
• Developing a custom processor costs $5-10M– Several examples– Quotes on Merrimac processor from 3 vendors ($6M)– Standard-cell design with semi-custom datapaths– Two mask sets in 90nm or 65nm
• Recurring costs $100-200 per unit• Overall costs depend on volume
– $1,200 per processor for 10K processors (20PFLOPS)– $300 per processor for 100K processors
• Costs $10M for the first one, then $200 per node
• NRE less than 10% the cost of a $100M Capability machine
FEC: 11 Oct 26, 2005
Frequency can be misleading
• Commodity processors operate at 3+ GHz– The result of aggressive process and circuit design.
• However, what matters is– Total arithmetic performance
• 200 FPUs at 1GHz (200GF) is better than 2 FPUs at 3GHz (6GF)
– Latency around critical loops• Roughly the same
– Memory bandwidth + latency hiding• Much better for custom
– Performance per unit power• Better for custom
• Bottom line– A custom processor at 1GHz may greatly outperform
a commodity processor at 3GHz (20x or more)
FEC: 12 Oct 26, 2005
Example: The Merrimac Stream Processor
• 64 64-bit MULADD FPUs– Arranged in 16 clusters
• Capable memory system• Designed for reliability• 1 GHz in 90nm
– 128 GFLOPS
• Area efficient– ~150mm2 in 90nm– Pentium 4 is ~120mm2 in
90nmbut only 6.4 GFLOPS
• Efficient at ~25W– Pentium 4 is 100W– 28% of energy in
ALUs
Merrimac processor is tuned for scientific computing
MIPS6420kc
(sample)
MIPS6420kc
(sample)
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Microcontroller
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Clu
ste
r
Network Interface
16 D
RAM
Inte
rface
s
Forw
ard
EC
C
Cache Bank
Cache Bank
Cache Bank
Cache Bank
Cache Bank
Cache Bank
Cache Bank
Cache Bank
Mem
ory
Sw
itch Ad
drG
en
Reord
er B
uffer
Ad
drG
en
Reord
er B
uffer
12.5 mm
12.5
mm
Merrimac Bandwidth Hierarchy
Register hierarchy and data-parallel executionenable high performance and efficiency – 128 GFLOPS ~25
W
~61 KB
6464-bitMADDs
(16 clusters)
100wire
cluste
r switch
cluste
r switch
SR
F lane
SR
F lane
1Kswitch
1 MB
3,840 GB/s512 GB/s
Inte
r-cluste
r and m
em
ory
switch
es
10Kswitch
cach
e b
an
kcach
e b
an
k
DR
AM
ban
kD
RA
M b
an
k
I/O p
ins
chipcrossing
512 KB2 GB
64 GB/s<64 GB/s
High memory bandwidth provided
64 GB/s sequential access bandwidth
~61 KB
6464-bitMADDs
(16 clusters)
100wire
cluste
r switch
cluste
r switch
SR
F lane
SR
F lane
1Kswitch
1 MB
3,840 GB/s512 GB/s
Inte
r-cluste
r and m
em
ory
switch
es
10Kswitch
cach
e b
an
kcach
e b
an
k
DR
AM
ban
kD
RA
M b
an
k
I/O p
ins
chipcrossing
512 KB2 GB
64 GB/s<64 GB/s
Latency hiding via stream transfers
1,000s of words tranferred in one instruction
~61 KB
6464-bitMADDs
(16 clusters)
100wire
cluste
r switch
cluste
r switch
SR
F lane
SR
F lane
1Kswitch
1 MB
3,840 GB/s512 GB/s
Inte
r-cluste
r and m
em
ory
switch
es
10Kswitch
cach
e b
an
kcach
e b
an
k
DR
AM
ban
kD
RA
M b
an
k
I/O p
ins
chipcrossing
512 KB2 GB
64 GB/s<64 GB/s
FEC: 16 Oct 26, 2005
SRF Decouples Execution from Memory
Decoupling allows efficient SRF allocationDecoupling allows efficient scheduling of instructions on
FPUs
cluste
r switch
cluste
r switch
SR
F lane
SR
F lane
Inte
r-cluste
r and m
em
ory
switch
es
cach
e b
an
kcach
e b
an
k
DR
AM
ban
kD
RA
M b
an
k
I/O p
ins
Unpredictable I/O Latencies Static latencies
FEC: 17 Oct 26, 2005
Enables high utilization of large numbers of FPUs
0
10
20
30
40
50
60
70
80
90
100
110
120
0
10
20
30
40
50
60
70
80
90
100
110
120
20
30
40
50
60
70
80
90
100
110
120
20
30
40
50
60
70
80
90
100
110
120
ComputeCellInt kernel from StreamFem3D
Over 95% of peak with simple hardware
Depends on explicit communication to make delays predictable
One iteration SW Pipeline
FEC: 18 Oct 26, 2005
And efficient use of on-chip storage
StreamFEM application
Prefetching, reuse, use/def, limited spilling
ComputeFlux
States
ComputeNumerical
Flux
ElementFaces
GatheredElements
NumericalFlux
GatherCell
ComputeCell
Interior
AdvanceCell
Elements(Current)
Elements(New)
Read-Only Table Lookup Data(Master Element)
FaceGeometry
CellOrientations
CellGeometry
FEC: 19 Oct 26, 2005
Merrimac Application Results
Simulated on a machine with 64GFLOPS peak performance and no fused MADD* The low numbers are a result of many divide and square-root operations
Application SustainedGFLOPS
FP Ops / Mem Ref
LRF Refs SRF Refs Mem Refs
StreamFEM3D(Euler, quadratic)
31.6 17.1 153.0M(95.0%)
6.3M(3.9%)
1.8M(1.1%)
StreamFEM3D(MHD, constant)
39.2 13.8 186.5M(99.4%)
7.7M(0.4%)
2.8M(0.2%)
StreamMD(grid algorithm)
14.2* 12.1* 90.2M(97.5%)
1.6M(1.7%)
0.7M(0.8%)
GROMACS 38.8* 9.7* 108M(95.0%)
4.2M(2.9%)
1.5M(1.3%)
StreamFLO 12.9* 7.4* 234.3M(95.7%)
7.2M(2.9%)
3.4M(1.4%)
Applications achieve high performance and make good use of the bandwidth hierarchy
FEC: 20 Oct 26, 2005
What about software?
• Software costs are typically much greater than hardware costs– Particularly applications software
• Custom processors can make software easier– High local and global bandwidth– Less sensitivity to “cache issues”– Fewer “performance surprises”
• Compilers for custom processors are not difficult– Leverage existing compiler infrastructure
• However, little application software is written to take advantage of such processors– MPI encourages LCD applications
FEC: 21 Oct 26, 2005
Summary
• Custom processors can provide more sustained performance per $ than commodity processors.– Better at Capability and Capacity
• Bandwidth is expensive and scarce• Tailor memory system to characteristics of scientific
applications– Lots of bandwidth (64 GB/s per node)– Latency hiding (1000s of cycles)– Good gather/scatter performance (about 20GB/s per node)
• Provide explicit on-chip storage hierarchy to reduce BW demand and enable FPU scheduling
• Overprovision inexpensive FPUs• Design in RAS for large configurations• Can deliver 128GFLOPS node for $1K (parts cost)
– 1 PFLOPS at 8K nodes and $8M