Specialized Supercomputers
Piero ViciniINFN
Istituto Nazionale di Fisica NucleareItalian National Institutefor Nuclear Physics
Dedicated SuperComputing
• WHY– The Scientific Case– Custom vs Commodity– Italian Experience
• APE project
• HOW TO – The international scenario– Petaflops machine
• Some ideas
• TOOLS– EU funding– National funding
SuperComputing: the Scientific Case
Large Scale numerical applications
– Astrophysics and Plasma Physics
• Today: 70-100 TF/s, 2009: >500 TFs/s
• Dedicated architecture: Grape (Japan/Europe)
– High-Energy Physics (LQCD)
• Today: 10-50 TF/s, several projects 2009: 500-1000 TFs/s
• Dedicated architecture: APE (Europe), QCDOC(USA/UK)
– Weather, Climatology, Earth sciences
• Today: 10-30 TF/s, 2009: several projects per 200-300 TF/s aggregated power
• Dedicated architecture: Earth Simulator (Japan)
– Life Sciences (molecular dynamics, protein folding, in silico drug design,…)
• Today:…., 2009-2010: > N*Petaflops
• Dedicated architecture: IBM Blue/Gene (USA)
– .........
Dedicated vs General Purpose Parallel Machine
• Processor level -> very well balanced architecture– Computing unit designed to be very efficent on kernel of
(several) classes of applications – Integration of “unusual” memory interfaces based on large
register File, huge multiport,…. – Integration of optimized interconnection network (low
latency, high bandwidth)
* Communication overhead not included
Eff. (H) 0.56 0.53 0.27* 0.11 0.42 0.05
QCDbenchmarks
Dedicated vs General Purpose Parallel Machine(2)
• System level:Dense, safe and cheap systems
– Very high ratio of Flops/Watt
– Very high ratio of Flops/Volume
– Cost effective systems• 0.5 €/Mflops• Very low cost
maintenance
3670
80
46
apeNEXT
3670
72
50972
apeNEXT
The Italian experience: ape project
Our line of Home Made Computers …
APE(1988)
APE100(1993)
APEmille(1999)
apeNEXT(2004)
Italian research team
Italian research
team
European research team
+Industry(QSW,Eurotech)
European research team+
Industry(Eurotech)
Architecture SIMD SIMD SIMD SIMD++
comp. nodes 16 2048 2048 4096
Interc. Topology
flexible 1D rigid 3D flexible 3D flexible 3D
Memory size 256 MB 8 GB 64 GB 1 TB
registers(w.size)
64 (x32) 128 (x32) 512 (x32) 512 (x64)
Clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Peak power 1 GFlops 100 GFlops 1 TFlops 7 TFlops
apeNEXT architecture
•3D mesh of computing nodes
• Custom VLSI processor - 200 MHz (J&T)
• 1.6 GFlops per node (complex “normal”)
• 256 MB (1 GB) memory per node
•First neighbor communication network “loosely synchronous”
•YZ internal, X on cables
•r = 8/16 => 200 MB/s per channel
•Scalable 25 GFlops -> 6 Tflops• Processing Board 4 x 2 x 2 ~ 26 GF• Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF• Rack (32 PB) 8 x 8 x 8 ~ 1 TF• Large systems (8*n) x 8 x 8
•Linux PCs as Host system
Z+(bp)
Y+(bp)
X+(cables)
•0 •2
•4 •6
•8 •10
•12 •14
•1 •3
•5 •7
•9 •11
•13 •15
•J&T
•DDR-MEM
•X+
•…•…•Z-
Evaluating the success of APE(1)
Apemille (2000):Italy 1365 GFGermany 650 GFUK 65 GFFrance 16 GF
Total 2 TF
apeNEXT (2005):Development costs = 2000 k€uro 1100 k€uro VLSI NRE 250 k€uro non-VLSI NRE 650 k€uro prototype procurement Manpower = 20 man/yearMass production cost ~ 0.5 €uro/Mflops Installations:
Italy 10.6 TF
Germany 8.0 TF
France 1.6 TFTotal 20.2 TF
Evaluating the success of APE(2)
• Scientific, technological and social impacts:– APE is standard “de facto” in European LQCD computing area– Huge number of scientific and technological (HW, SW, Architecture)
papers– Establishment of an international computing facility fully
dedicated to scientific numerical computing• Laboratorio di Calcolo apeNEXT: 12 TFs installed, opening on February,
8th
– Strategic opportunities to increase national(European) industry capability• Eurotech
– INFN collaboration -> HPC division, market expansion, international visibility• Finmeccanica/QSW
– Training, dissemination and establishment of spin-off company• Atmel/Ipitec• Nergal • Digital Video• Venere
What’s next after apeNEXT?: scenario
• In the future (2010) the required computing platform for numerical “large-scale” applications will be of the order of PetaFlops
• The International scenario – Today (www.top500.org):
• IBM Blue/Gene: dedicated architecture (very similar to APE….), N*100TFlops
• Earth Simulator: N*10TFlops• PC Clusters approach: N*10TFlops
– Future (2010 and beyond):• USA: IBM, Blue/Gene evolution, N*Petaflops• Japan: NEC/Hitachi/University, 3 Petaflops per biotech and nanotech,
custom silicon, custom interconnect• Japan: Fujitsu, 3 Petaflops, cluster approach with optical
interconnection
• Europe?
Brainstorming• Silicon shrink
– apeNEXT: 0.18 um – today: 0.13um – Next years: 0.90 – 0.65 um
1319
78
39
0
10
20
30
40
50
60
70
80
90
0.18 0.13 0.09 0.06
Silicon Process (um)
Area (mm2)
Die area per FP Node
Worst case: 6 computing Nodes per chip (Tiled architecture)
Brainstorming(2)• Performance scaling
– Clock frequency scales with silicon process– Power consumption decrease with silicon process (est. 0.3 W/Gflops)– Architecture: Multi-Tiles versus Single-Tile
1,6
4,8
13,1
24,0
1,6 2,4 3,2 4,0
0,0
5,0
10,0
15,0
20,0
25,0
30,0
0:18 0:13 0:09 0:06 Silicon Process (um)
GFs
Multi-Tiles Perf (GFs)Single-Tile perf (GFs)
Brainstorming(3)• “Smart memory architecture” and new “3D Engineering”
– On chip large and hierarchical memory buffers -> reduction of components per board– Processing board “sandwich” (stacked) -> surface distributed network connectors – 512 FP Nodes per board
Worst case: Factor 100 in 5 years…
26,239,3
52,4 65,5
0,81,2
1,6 2,0
26,2
78,6
215,2393,2
0,1
1,0
10,0
100,0
1000,0
0:18 0:13 0:09 0:06 Silicon Process
TFs
"3D Eng Rack Comp. Power"
"apeNEXT Eq. Rack Comp. Power"
3D Eng. Multi Tiles
apeNEXT rack
PetaFlops class computer proposal
• Leverage on European leadership in embedded processor technology
• European collaboration (research + industry) to design a new computing architecture for scientific and engineering numerical applications
• Parameters:– (Less) dedicated architecture suitable for future great
challenging applications– 0.5/1 PetaFlops system (factor 50 better than apeNEXT)– 300W/TeraFlops– 10KEuro/Teraflops (factor 50 better than apeNEXT)– Programming environment to produce parallel code with very
high efficiency
Tools(1)
• EU Level– FP6 and beyond
• SHAPES (Scalable sw/Hw Architecture Platform for Embedded Systems)
– FP6-2004-IST-4 2.3.4(viii) Advanced Computing Architectures
– Partners: INFN-Roma, ATMEL-ROMA, ST, TIMA(FR), TARGET COMPILER(BE)….
– Target: technology R&D to study feasibility of 2TFs board in 4 years (Tiled architecture, NoC, Off-chip network and 3D Engineering multi board system)
• HPC Europe Initiative
– Joint action at EU level (France, Germany, UK, Spain + NederLand, Finland,Italy) to consolidate European role in supercomputing applications and to ensuring the availability of the most advanced supercomputer systems in the EU
– Main target: In 2010, 4 computing centre in Europe equipped with “general purpose” (->not-european) supercomputers
– 800 MEuro (!) partially funded by EU and national governements
Tools(2)
• National Level– PNR
• “High Performance Computing for scientific and engineering applications: architecture, hardware, development software and selected applications”– Partners: INFN, EUROTECH, CNR(MI), CILEA, SISSA, UNI MI
BICOCCA,UNI PADOVA– Target: Petaflops supercomputer suitable for
engineering and scientific applications – 200 W/Teraflops, 10 KEuro/Petaflops– Development cost (+ prototype procurement): 20 Meuro +
40 man/years– Project duration: 4 years
Backup slides
apeNET: status ed attivita’ future
• Testato con successo un cluster di 16 PCs interconnessi via apeNET– Performance misurate >800 MB/s
(send-receive) per direzione
• Ottimizzazioni SW/HW/FW– RDMA, Network driver, LAM/MPI
• INFN Roma2 ha finanziato un cluster da 128 nodi (dual Xeon + apeNET)– Fornitura prevista per Settembre
2005
• Attivita’ future:– PCI-X -> PCI Express – Integrazione di uP core su FPGA– Sviluppo applicazioni QCD e
dinamica molecolare
apeNEXT: l’ultima generazione
• Network di comunicazione a primo vicino debolmente sincrona
• Sistema scalabile da 25 GF a 6 TF – 16 processori per scheda (PB)– Sistemi 8x8x16=1024 nodi o 8x8x64=4096 nodi
• “Host system” realizzato con PC (Linux)
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
J&T
DDR-MEM
X+
……… Z-
• Reticolo 3D di 4096 “nodi di calcolo” (6.5 TF)
– Processore custom VLSI- 200 MHz (J&T)
– 1.6 GFlops per nodo (a*b+c su dati complessi)
• 4Q03: !! apeNEXT running !!
apeNEXT
J&T Chip layout
PCI Host interfaceProcessing board
Next Rack
APENET
• Network d’interconnessione per PC cluster con topologia 3D toroidale per cluster di PC– apeLINK: PCI-X (133MHz) board
• 6 link LVDS, bidirezionali e full-duplex
• 700 MB/s per link per direzione (-> 8.4GByte/s)
• Link basati su National Instr. SERDES– Capacita’ di routing e switching
integrata– Alta banda passante e bassa latenza
grazie all’adozione di un protocollo “leggero”