[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

TSUBAME2.5 to 3.0 and Convergence with Extreme

Big DataSatoshi Matsuoka

ProfessorGlobal Scientific Information and Computing (GSIC) Center

Tokyo Institute of TechnologyFellow, Association for Computing Machinery (ACM)

Rakuten Technology Conference 20132013/10/26Tokyo, Japan

Supercomputers from the Past

Fast, Big, Special, Inefficient, Evil device to conquer the world…

Let us go back to the mid ’70sBirth of “microcomputers” and arrival of

commodity computing (start of my career)• Commodity 8-bit CPUs…

– Intel 4004/8008/8080/8085, Zilog Z-80, Motorola 6800, MOS Tech. 6502, …

• Lead to hobbyist computing…– Evaluation boards: Intel SDK-80,

Motorola MEK6800D2, MOS Tech. KIM-1, (in Japan) NEC TK-80, Fujitsu Lkit-8, …

– System Kits: MITS Altair 8800/680b, IMSAI 8080, Proc. Tech. SOL-20, SWTPC 6800, …

• & Lead to early personal computers– Commodore PET, Tandy TRS-80,

Apple II– (in Japan): Hitachi Basic Master,

NEC CompoBS / PC8001, Fujitsu FM-8, …

Supercomputing vs. Personal Computing in the late 1970s.• Hitachi Basic Master

(1978)– “The first PC in Japan”– Motorola 6802--1Mhz,

16KB ROM, 16KB RAM– Linpack in BASIC: Approx.

70-80 FLOPS (1/1,000,000)• We got “simulation” done

(in assembly language)– Nintendo NES (1982)

• MOS Technology 6502 1Mhz (Same as Apple II)

– “Pinball” by Matsuoka & Iwata (now CEO Nintendo)• Realtime dynamics +

collision + lots of shortcuts• Average ~a few KFLOPS

Cf. Cray-1 (1976)Linpack

80-90MFlops(est.)

Running Linpack 10

Then things got accelerated around the mid 80s to mid 90s

(rapid commoditization towards what we use now)• PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP

x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s– C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, …

• Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash• Network Evolution: RS-232C to Ethernet now to FDR Infinininband• PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe• Software Evolution: CP/M to MS-DOS to Windows, Linux, • WAN Evolution: RS-232+Modem+BBS to Modem+Internet to

ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, …• Internet Evolution: email + ftp to Web, Java, Ruby, …

• Then Clusters, Grid/Clouds, 3-D Gaming, and Top500 all started in the mid 90s(!), and commoditized supercomputing

NEC Confidential

Modern Day Supercomputers

Now supercomputers “look like” IDC servers

High-End COTS dominate

Linux based machine with standard + HPC OSS Software Stack

7

1957

“Reclaimed No.1 Supercomputer Rank in the World”

2011

2010

2012

NEC Confidential

Top Supercomputers vs. Global IDC

DARPA study2020 Exaflop (1018)

100 million~1 Billion Cores

K Computer (#1 2011-12) Riken-AICSFujitsu Sparc VIII-fx Venus CPU 88,000 nodes, 800,000CPU cores~11 Petaflops (1016)1.4 Petabyte memory, 13 MW Power864 racks、3000m2

C.f. Amazon ~= 450,000 Nodes, ~3 million Cores

#1 2012 IBM BlueGene/Q “Sequoia” Lawrence Livermore National LabIBM PowerPC System-On-Chip98,000 nodes, 1.57million Cores~20 Petaflops1.6 Petabytes, 8MW, 96 racks

Tianhe2 (#1 2013) China Gwanjou48,000 KNC Xeon Phi + 36,000 Ivy Bridge Xeon 18,000 nodes, >3 Million CPU cores54 Petaflops (1016)0.8 Petabyte memory, 20 MW Power??? racks、???m2

NEC Confidential

Scalability and Massive Parallelism

More nodes & core => Massive Increase in parallelism

Faster, “Bigger” Simulation

Qualitative Difference

CPU Cores ~= Parallelism

Performance

Ideal Linear Scaling Difficult to Achieve

Limitations in Power,

Cost, Reliability

Limitations in Scaling

GOOD!

BAD!

BAD!

TSUBAME2.0

2006: TSUBAME1.0 as No.1 in Japan

>All University Centers

COMBINED 45 TeraFlops

Total 85 TeraFlops, #7 Top500 June 2006 Earth Simulator

40TeraFlops #1 2002~2004

TSUBAME2.0 Nov. 1, 2010“The Greenest Production Supercomputer in the World”

12

TSUBAME 2.0New Development

32nm 40nm

>400GB/s Mem BW80Gbps NW BW~1KW max

>1.6TB/s Mem BW >12TB/s Mem BW35KW Max

>600TB/s Mem BW220Tbps NW Bisecion BW1.4MW Max

Performance Comparison of CPU vs. GPU

GPU

Peak

Per

form

ance

[GFL

OPS

]

1250

1000

750

500

250

0

1500

CPU

1750 GPU

CPU

160

120

200

40

0

80

Mem

ory

Band

wid

th [G

Byte

/s]

x5-6 socket-to-socket advantage in both compute and memory bandwidth,Same power (200W GPU vs. 200W CPU+memory+NW+…)

NEC Confidential

TSUBAME2.0 Compute Node

Total Perf2.4PFlops

Mem： ~100TBSSD: ~200TB 4-1

HP SL390G7 (Developed for TSUBAME 2.0)GPU: NVIDIA Fermi M2050 x 3

515GFlops, 3GByte memory /GPUCPU: Intel Westmere-EP 2.93GHz x2 (12cores/node)Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDRMemory: 54, 96 GB DDR3-1333SSD：60GBx2, 120GBx2

ThinNode

Infiniband QDR x2 (80Gbps)

Productized as HP ProLiantSL390s

1.6 Tflops400GB/s Mem BW80GBps NW~1KW max

TSUBAME2.0 Storage Overview

“Global Work Space” #1

SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

Systemapplication

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System Volumes

Home Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

LustreGPFS with HSM

“Thin node SSD” “Fat/Medium node SSD”

ScratchGrid Storage

1.2PB

2.4 PB HDD + 〜4PB Tape

3.6 PB 30~60GB/s

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

130 TB=> 500TB~1PB

250 TB, 300~500GB/s

TSUBAME2.0 Storage Overview


SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

Systemapplication

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System Volumes

Home Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

LustreGPFS with HSM

“Thin node SSD” “Fat/Medium node SSD”

ScratchHPCI Storage

1.2PB

2.4 PB HDD + 〜4PB Tape

3.6 PB

130 TB=> 500TB~1PB

250 TB, 300GB/s

• Home storage for computing nodes• Cloud-based campus storage services

Concurrent Parallel I/O (e.g. MPI-IO)

Fine-grained R/W I/O(checkpoints, temporary files, Big Data processing)

Data transfer service between SCs/CCs

Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)

Long-TermBackup

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

NEC Confidential

3500 Fiber Cables > 100Kmw/DFB Silicon PhotonicsEnd-to-End 7.5GB/s, > 2usNon-Blocking 200Tbps Bisection

2010: TSUBAME2.0 as No.1 in Japan

> All Other Japanese Centers on the Top500

COMBINED 2.3 PetaFlops

Total 2.4 Petaflops#4 Top500, Nov. 2010

“Greenest Production Supercomputer in the World”the Green 500Nov. 2010, June 2011(#4 Top500 Nov. 2010)

TSUBAME Wins Awards…

3 times more power efficient than a laptop!

ACM Gordon Bell Prize 20112.0 Petaflops Dendrite Simulation


Special Achievements in Scalability and Time-to-Solution“Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”

Commendation for Sci &Tech by Ministry of Education 2012

(文部科学大臣表彰)Prize for Sci & Tech, Development CategoryDevelopment of Greenest Production Peta-scale Supercomputer

Satoshi Matsuoka, Toshio Endo, Takayuki Aoki


Precise Bloodflow Simulation of Artery on TSUBAME2.0

(Bernaschi et. al., IAC-CNR, Italy)Personal CT Scan + Simulation

=> Accurate Diagnostics of Cardiac Illness

5 Billion Red Blood Cells + 10 Billion Degrees of Freedom

MUPHY: Multiphysics simulation of blood flow(Melchionna, Bernaschi et al.)

Combined Lattice-Boltzmann (LB) simulation for plasma and MolecularDynamics (MD) for Red Blood Cells

Realistic geometry ( from CAT scan)

Two-levels of parallelism: CUDA (on GPU) + MPI

• 1 Billion mesh node for LB component•100 Million RBCs

Red blood cells (RBCs) are represented as ellipsoidal particles

Fluid: Blood plasma

Lattice Boltzmann

Multiphyics simulation with MUPHY software

Body: Red blood cell

Extended MD

Irregular mesh is divided by using PT-SCOTCH tool, considering cutoff distance

coupled

ACMGordon Bell Prize 2011Honorable

Mention

4000 GPUs, 0.6Petaflops

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

Lattice-Boltzmann-LES with Coherent-structure SGS model

[Onodera&Aoki2013]

Lattice-Boltzmann-LES with Coherent-structure SGS model

[Onodera&Aoki2013]

Second invariant of the velocity gradient tensor(Q) andEnergy dissipation(ε)

The model parameter is locally determined by the second invariant of the velocity gradient tensor.

◎ Turbulent flow around a complex object◎ Large-scale parallel computation

Coherent-structure Smagorinsky model

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

Computational Area – Entire Downtown Tokyo

Computational Area – Entire Downtown Tokyo

Major part of TokyoIncluding Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku,

10km×10km

Building Data:Pasco Co. Ltd.TDM 3D

Achieved 0.592 Petaflops using over 4000 GPUs (15% efficiency)

25

Map©2012 Google, ZENRIN

Shinjyuku Tokyo

Shinagawa

Shibuya

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

Area Around Metropolitan Government Building

Area Around Metropolitan Government Building

27

Flow profile at the 25m height on the ground

640 m

960 m

地図データ ©2012 Google, ZENRIN

Wind

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 28

29

ASUCA Typhoon Simulation on TSUBAME2.0500m Resolution 4792×4696×48 , 437 GPUs(x1000 resolution)

ASUCA Typhoon Simulation on TSUBAME2.0500m Resolution 4792×4696×48 , 437 GPUs(x1000 resolution)

Current Weather Forecast5km Resolution(Inaccurate Cloud Simulation)

Current Weather Forecast5km Resolution(Inaccurate Cloud Simulation)

CFD analysis over a car body

＊ Number of grid points：3,623,878,656

(3,072 × 1,536 × 768)

＊Grid resolution：4.2mm(13m x 6.5 m x 3.25m)

＊Number of GPUs：288 (96 Nodes)

60 km/h60 km/h

Calculation conditions

32

LBM: DriVer: BMW-AudiLehrstuhl für Aerodynamik und StrömungsmechanikTechnische Universität München

3,000x1,500x1,500Re = 1,000,000

33

34

Industry prog.: TOTO INC.TSUBAME 150 GPUs In-House Cluster

Accelerate In‐silico screeninigand data mining

アステラス製薬とのデング熱等の熱帯病の特効薬の創薬

100‐million‐atom MD Simulation

M. Sekijima (Tokyo Tech), Jim Phillips (UIUC)

Mixed Precision Amber on Tsubame2.0 for Industrial Drug Discovery

75% Energy Efficient

ヌクレオソーム (25095 粒子)

x10 fasterMixed‐Precision

$500mil~$1bil dev. cost per drug

Even 5-10% improvement of the process will more than

pay for TSUBAME

Towards TSUBAME 3.0Interim Upgrade TSUBAME2.0 to 2.5

(Early Fall 2013)• Upgrade the TSUBAME2.0s GPUs

NVIDIA Fermi M2050 to Kepler K20XSFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF

c.f. The K Computer 11.2/11.2Acceleration of Important AppsConsiderable ImprovementSummer 2013TSUBAME2.0 Compute Node

Fermi GPU 3 x 1408 = 4224 GPUs

Significant Capacity Improvement at low cost

& w/oPower Increase

TSUBAME3.0 2H2015

TSUBAME2.0⇒2.5 Thin Node Upgrade

HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5)GPU: NVIDIA Kepler K20X x 3

1310GFlops, 6GByte Mem(per GPU)CPU: Intel Westmere-EP 2.93GHz x2Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDRMemory: 54, 96 GB DDR3-1333SSD：60GBx2, 120GBx2

ThinNode

Infiniband QDR x2 (80Gbps)

Productized as HP ProLiantSL390sModified for TSUABME2.5

Peak Perf.

4.08 Tflops~800GB/s Mem BW80GBps NW~1KW max

NVIDIA Fermi M20501039/515GFlops

NVIDIA KeplerK20X3950/1310GFlops

2013: TSUBAME2.5 No.1 in Japan in Single Precision FP, 17 Petaflops

~=K Computer

11.4 Petaflops SFP/DFP

Total 17.1 Petaflops SFP5.76 Petaflops DFP

All University CentersCOMBINED 9 Petaflops SFP

TSUBAME2.0 TSUBAME2.5Thin Node x 1408 Units

Node Machine HP Proliant SL390s ← No Change

CPU Intel Xeon X5670 (6core 2.93GHz, Westmere) x 2

← No Change

GPU NVIDIA Tesla M2050 x 3 448 CUDA cores (Fermi)

SFP 1.03TFlops DFP 0.515TFlops

3GiB GDDR5 memory 150GB Peak, ~90GB/s

STREAM Memory BW

NVIDIA Tesla K20X x 3 2688 CUDA cores (Kepler)

SFP 3.95TFlops DFP 1.31TFlops

6GiB GDDR5 memory 250GB Peak, ~180GB/s

STREAM Memory BWNode Performance (incl. CPU Turbo boost)

SFP 3.40TFlops DFP 1.70TFlops ~500GB Peak, ~300GB/s

STREAM Memory BW

SFP 12.2TFlops DFP 4.08TFlops ~800GB Peak, ~570GB/s

STREAM Memory BWTOTAL System

Total SystemPerformance

SFP 4.80PFlops DFP 2.40PFlops Peak ~0.70PB/s, STREAM

~0.440PB/s Memory BW

SFP 17.1PFlops (x3.6) DFP 5.76PFlops (x2.4) Peak ~1.16PB/s, STREAM

~0.804PB/s Memory BW (x1.8)

Phase‐field simulation for Dendritic Solidification[Shimokawabe, Aoki et. al.]

• Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials.

• 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution

Weak scaling on TSUBAME (Single precision)Mesh size（1GPU+4 CPU cores）:4096 x 162 x 130

TSUBAME 2.02.000 PFlops

(4,000 GPUs+16,000 CPU cores)

4,096 x 6,480 x 13,000

TSUBAME 2.53.444 PFlops

(3,968 GPUs+15,872 CPU cores)4,096 x 5,022 x 16,640

Developing lightweight strengthening material by controlling microstructure

Low‐carbon society

Peta‐scale stencil application :A Large‐scale LES Wind Simulation using Lattice Boltzmann Method[Onodera, Aoki et. al.]

Number of GPUs

Perf

orm

ance

[TFl

ops]

▲ TSUBAME 2.5 (overlap)● TSUBAME 2.0 (overlap)

Weak scalability in single precision(N = 192 x 256 x 256)

TSUBAME 2.51142 TFlops (3968 GPUs)288 GFlops / GPU

TSUBAME 2.0149 TFlops (1000 GPUs)149 GFlops / GPU

x 1.93

10,080 x 10,240 x 512 (4,032 GPUs)

Large-scale Wind Simulation for a10km x 10km Area in Metropolitan Tokyo

The above peta‐scale simulations were executed as the TSUBAME Grand Challenge Program, Category A in 2012 fall.

• The LES wind simulation for the area 10km × 10km with 1‐m resolution has never been done before in the world.

• We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.

K20X×4

M2050×8

K20X×2

K20X×8

K20X×1

ns/day

AMBER pmemd benchmarkNucleosome = 25,095 atoms

MPI 4node

0.110.150.31

0.991.852.22

3.443.11

4.046.66

11.39

0 2 4 6 8 10 12

M2050×4

M2050×2

M2050×1

MPI 2node

MPI 1node(12 core)

TSUBAME2.0 M2050TSUBAME2.5 K20X

Dr.Sekijima@Tokyo Tech

Application TSUBAME2.0Performance

TSUBAME2.5Performance

Boost Ratio

Top500/Linpack(PFlops)

1.192 2.843 2.39

Green500/Linpack(GFlops/W)

0.958 > 2.400 > 2.50

Semi‐Definite Programming Nonlinear Optimization (PFlops)

1.019 1.713 1.68

Gordon Bell Dandrite Stencil (PFlops)

2.000 3.444 1.72

LBM LES Whole City Airflow (PFlops)

0.600 1.142 1.90

Amber 12 pmemd 4 nodes 8 GPUs (nsec/day)

3.44 11.39 3.31

GHOSTM Genome Homology Search (Sec)

19361 10785 1.80

MEGADOC Protein Docking (vs. 1CPU core)

37.11 83.49 2.25

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 47

TSUBAME EvolutionTowards Exascale and Extreme Big Data

TSUBAME EvolutionTowards Exascale and Extreme Big Data

Graph 500No. 3 (2011)

Awards

3.0

25‐30PF

2015H2

2.5Phase1Fast I/O250TB300GB/s30PB/Day

5.7PF

Phase2Fast I/O5~10PB10TB/s> 100mil iOPs1ExaB/Day

1TB/s

DoE Exascale Parametersx1000 power efficiency in 10 years

System attributes “2010” “2015” “2020”

System peak 2 PetaFlops 100-200 PetaFlops 1 ExaFlopPower Jaguar

6 MWTSUBAME1.3 MW

15 MW 20 MW

System Memory 0.3PB 0.1PB 5 PB 32-64PBNode Perf 125GF 1.6TF 0.5TF 7TF 1TF 10TFNode Mem BW 25GB/s 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/sNode Concurrency 12 O(1000) O(100) O(1000) O(1000) O(10000

)#Nodes 18,700 1442 50,000 5,000 1 million 100,000Total Node Interconnect BW

1.5GB/s 8GB/s 20GB/s 200GB/s

MTTI O(days) O(1 day) O(1 day)

x

Billion Cores

Challenges of Exascale (FLOPS, Byte, …) (1018)!Various Physical Limitations Surface All‐at‐Once

• # CPU Cores: 1Bil c.f. Total # of Smartphones soldLow Power globally = 400Mil

• # Nodes 100K~xM c.f. The K Computer ~100KGoogle ~ 1 Mil

• Mem: x00PB~ExaB c.f. Total mem all PCs (300Mil) shipped globally in 2011 ~ ExaBBTW 264~=1.8x1019=18ExaB

• Storage： xExaB c.f. Google Storage 2 Exabytes (200Mil x 7GB+)

• All of this at 20MW (50GFlops/W), reliability (MTTI=days), ease of programming (billion cores?), cost… in 2020?!

49

Focused Research Towards Tsubame 3.0 and Beyond towards Exa

• Green Computing: Ultra Power Efficient HPC• High Radix Bisection Networks – HW, Topology, Routing

Algorithms, Placement…• Fault Tolerance – Group-based Hierarchical

Checkpointing, Fault Prediction, Hybrid Algorithms• Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop

Acceleration, Large Graphs• New memory systems – Pushing the envelops of low

Power vs. Capacity vs. BW, exploit the deep hierarchy with new algorithms to decrease Bytes/Flops

• Post Petascale Programming – OpenACC and other many-core programming substrates, Task Parallel

• Scalable Algorithms for Many Core –Apps/System/HW Co-Design

モデルと実測の Bayes 的融合

• Bayes モデルと事前分布

• n 回実測後の事後予測分布

),(-Inv~

)/,(~,|

),(~

200

220

22

2

v

xN

Ny

i

iTiii

iii

nTiiimnn

niTinnn

nninininiii

xynyy

ynxnn

tyyyyn

/)()(

/)(,,

)/,(~),,,(|

20

2200

2000

12

21

モデルによる所要時間の推定

所要時間の実測データ

imi yn

y 1

!ABCLib$ static select region start!ABCLib$ parameter (in CacheS, in NB, in NPrc) !ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)

対象１（アルゴリズム１）

!ABCLib$ select sub region end!ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)

対象２（アルゴリズム２）

!ABCLib$ select sub region end!ABCLib$ static select region end

1

実行起動前自動チューニング指定、アルゴリズム選択処理の指定

コスト定義関数で使われる入力変数

コスト定義関数

対象領域1、2

ABCLibScript: アルゴリズム選択

JST‐CREST “Ultra Low Power (ULP)‐HPC” Project 2007‐2012

MRAMPRAMFlashetc.

Ultra Multi-CoreSlow & Parallel

(& ULP)ULP‐HPC SIMD‐

Vector(GPGPU, etc.)

ULP‐HPCNetworks

Power Optimize using Novel Componentsin HPC

Power‐Aware and Optimizable Applications

Perf. ModelAlgorithms

0

Low Power High Perf Model

0

x10 Power Efficiencty

Optimization Point

Power Perf

x1000Improvement in 10 years

Auto‐Tuning for Perf. & Power

Aggressive Power Saving in HPCMethodologies Enterprise/Business

Clouds HPCServer Consolidation Good NG!

DVFS(Dynamic Voltage/Frequency

Scaling)Good Poor

New Devices Poor(Cost & Continuity)

GoodNew HW &SW Architecture

Poor(Cost & Continuity)

Good

Novel Cooling Limited(Cost & Continuity)

Good(high thermal density)

How do we achive x1000?Process Shrink x100

XMany-Core GPU Usage x5

XDVFS & Other LP SW x1.5

XEfficient Cooling x1.4

x1000 !!!

ULP-HPCProject2007-12

Ultra GreenSupercomputingProject 2011-15

Statistical Power Modeling of GPUs[IEEE IGCC10]

i

n

iicp

1

• Prevents overtraining by ridge regression• Determines optimal parameters by cross fitting

Average power consumption

GPU performance counters • Estimates GPU power consumption GPU statistically

• Linear regression model using performance counters as explanatory variables

Power meter withhigh resolution

High accuracy（Avg Err 4.7％）

Accurate even with DVFSFuture: Model‐based power opt.

Linear model shows sufficient accuracy

Possibility of optimization of Exascale systems with O(10^8) processors

Power Efficiency in Denderite Applicationson TSUBAME1.0 thru JST‐CREST ULPHPC Prototype running Gordon Bell Denderite App

TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed[2011‐2015]

Heat ExchangerOil 35~45℃

⇒ Water 25~35℃

Cooling Tower：Water 25~35℃⇒ Outdoor air

Fluid Submersion Cooling + Outdoor Air Cooling + High Density GPU Supercomputing in a 20‐feet container

GRC Submersion RackProcessors 80~90℃

⇒ Coolant oil 35~45℃

• Intel IvyBridge 2.1GHz 6core×2• NVIDIA Tesla K20X GPU ×4• DDR3 memory 64GB, SSD 120GB• 4x FDR InfiniBand 56GbpsPer node

Total Peak210TFlops (DP)630TFlops (SP)

• World’s top power efficiency (>3GFlops/Watt)• Average PUE 1.05, lower component power• Field test ULP‐HPC results

Target

Compute Nodes

Facility20 feet container(16m2)

• Coolant oil: Spectrasyn8

NEC/SMC 1U server x 40Heat

Dissipation

K20X GPU

TSUBAME-KFCTowards TSUBAME3.0 and Beyond

Shooting for #1 on Nov. 2013 Green 500!

Machine Power (incl. cooling)

LinpackPerf(PF)

LinpackMFLOPs/W

Factor Total MemBW TB/s(STREAM)

Mem BWMByte/S/ W

Earth Simulator 1 10MW 0.036 3.6 13,400 160 16Tsubame1.0(2006Q1)

1.8MW 0.038 21 2,368 13 7.2

ORNL Jaguar(XT5. 2009Q4)

~9MW 1.76 196 256 432 48

Tsubame2.0(2010Q4)

1.8MW 1.2 667 75 440 244

K Computer(2011Q2)

~16MW 10 625 80 3300 206

BlueGene/Q(2012Q1)

~12MW? 17 ~1400 ~35 3000 250

TSUBAME2.5(2013Q3)

1.4MW ~3 ~2100 ~24 802 572

Tsubame3.0(2015Q4~2016Q1)

1.5MW ~20 ~13,000 ~4 6000 4000

EXA (2019~20) 20MW 1000 50,000 1 100K 5000

x31.6

~x20

x34

~x13.7

Extreme Big Data (EBD)Next Generation Big Data

Infrastructure Technologies Towards Yottabyte/Year

Principal InvesigatorSatoshi Matsuoka

Global Scientific Information and Computing Center

Tokyo Institute of Technolgoy

The current “Big Data” are not really that Big…

• Typical “real” definition: “Mining people’s privacy data to make money”

• Corporate data are usually in data warehoused silo -> limited volume, in Gigabytes~Terabytes, seldom Petabytes.

• Processing involve simple O(n) algorithms, or those that can be accelerated with DB-inherited indexingalgorithms

• Executed on re-purposed commodity “web” servers linked with 1Gbps networks running Hadoop/HDFS

• Vicious cycle of stagnation in innovations…• NEW: Breaking down of Silos ⇒ Convergence

with Supercomputing with Extreme Big Data

But “Extreme Big Data” will change everything

• “Breaking down of Silos” (Rajeeb Harza, Intel VP of Technical Computing)

• Already happening in Science & Engineering due to Open Data movement

• More complex analysis algorithms: O(n log n), O(m x n), …

• Will become the NORM for competitiveness reasons.

We will have tons of unknown genes

• Directly sequencing uncultured microbiomes obtained from target environment and analyzing the sequence data– Finding novel genes from unculturable microorganism– Elucidating composition of species/genes of environments

Human body

Sea

Gut microbiome

Examples of microbiome

Soil

Metagenome analysis

62

[Slide Courtesy Yutaka Akiyama @ Tokyo Tech.]

Results from Akiyama group@Tokyo TechUltra high‐sensitive “big data” metagenomesequence analysis of human oral microbiome

63歯列の内側歯列の外側歯垢

Metabolic Pathway Map

‐ Required > 1 million node*hour product on K‐computer‐ World’s most sensitive sequence analysis (based on amino acid similarity matrix)

‐ Discovered at least three microbiome clusters with functional differences.(Integrated 422 experiment samples taken from 9 different oral parts)

572.8 M Reads / hour 82,944 node (663,552 Cores)K‐computer (2012)

Extreme Big Data in Genomics

Lincoln Stein, Genome Biology, vol. 11(5), 2010

Sequencing data (bp)/$becomes x4000 per 5 yearsc.f., HPC x33 in 5 years

Impact of new generation sequencers

64

[Slide Courtesy Yutaka Akiyama @ Tokyo Tech.]

Extremely “Big” Graphs• Large scale graphs in various fields

– US Road network : 58 million edges– Twitter follow‐ship : 1.47 billion edges– Neuronal network : 100 trillion edges

89 billion vertices & 100 trillion edgesNeuronal network @ Human Brain Project

Cyber‐security

Twitter

US road network24 million vertices & 58 million edges 15 billion log entries / day

Social network

• Fast and scalable graph processing by using HPClarge

61.6 million vertices& 1.47 billion edges

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr

Human Brain Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1 billion nodes

1 trillion nodes

1 billion edges

1 trillion edges

Symbolic Network

USA Road Network

Twitter (tweets / day)

# of nodes

# of edges

K computer: 65536nodesGraph500: 5524GTEPS

Android tabletTegra3 1.7GHz : 1GB RAM0.15GTEPS: 64.12MTEPS/W

Towards Continuous Billion-Scale Social Simulation with Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo Tech) Applications

– Target Area: Planet (Open Street Map) – 7 billion people

Input Data – Road Network (Open Street Map) for

Planet: 300 GB (XML) – Trip data for 7 billion people

• 10 KB (1 trip) x 7 billion = 70 TB– Real-Time Streaming Data (e.g. Social

sensor, physical data)

Simulated Output for 1 Iteration

– 700 TB

Graph500 “Big Data” BenchmarkKronecker graph BSP Problem

A: 0.57, B: 0.19C: 0.19, D: 0.05

November 15, 2010Graph 500 Takes Aim at a New Kind of HPCRichard Murphy (Sandia NL => Micron)“ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.”

Reality: Top500 Supercomputers DominateNo Cloud IDCs at all

TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)

A Major Northern Japanese Cloud Datacenter (2013)

Juniper EX8208 Juniper EX8208

2 zone switches (Virtual Chassis)

Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper EX4200

Zone (700 nodes)

Juniper EX4200

Juniper MX480 Juniper MX480

10GbE10GbE

10GbE

10GbELACP

the Internet

8 zones, Total 5600 nodes, Injection 1GBps/NodeBisection 160Gigabps

Advanced Silicon Photonics 40G

single CMOS Die1490nm DFB100km Fiber

Supercomputer Tokyo Tech. Tsubame 2.0

#4 Top500 (2010)

~1500 nodes compute & storageFull Bisection Multi-Rail

Optical NetworkInjection 80GBps/NodeBisection 220Terabps

>>x1000!

NEC Confidential

But what does “220Tbps” mean?Global IP Traffic, 2011-2016 (Source Cicso)

2011 2012 2013 2014 2015 2016 CAGR2011-2016

By Type (PB per Month / Average Bitrate in Tbps)Fixed Internet

23,288 32,990 40,587 50,888 64,349 81,347 28%71.9 101.8 125.3 157.1 198.6 251.1

Managed IP

6,849 9,199 11,846 13,925 16,085 18,131 21%21.1 28.4 36.6 43.0 49.6 56.0

Mobile data

597 1,252 2,379 4,215 6,896 10,804 78%1.8 3.9 7.3 13.0 21.3 33.3

Total IP traffic

30,734 43,441 54,812 69,028 87,331 110,282 29%94.9 134.1 169.2 213.0 269.5 340.4

TSUBAME2.0 Network has TWICE the capacity of the Global Internet,

being used by 2.1 Billion users

“convergence” at future extreme scale for computing and data (in Clouds?)

HPC: x1000 in 10 yearsCAGR ~= 100%

Source: Assessing trends over time in performance, costs, and energy use for servers, Intel, 2009.

IDC: x30 in 10 yearsServer unit sales flat (replacement demand)CAGR ~= 30‐40%

What does this all mean?• “Leveraging of mainframe technologies in HPC has

been dead for some time.”• But will leveraging Cloud/Mobile be sufficient?• NO! They are already falling behind, and will be

perpetually behind– CAGR of Clouds 30%, HPC 100%: all data supports it– Stagnation in network, storage, scaling, …

• Rather, HPC will be the technology driver for future Big Data, for Cloud/Mobile to leverage!– Rather than repurposed standard servers

73

Future “Extreme Big Data”- NOT mining Tbytes Silo Data

- Peta~Zetabytes of Data- Ultra High-BW Data Stream- Highly Unstructured, Irregular- Complex correlations between

data from multiple sources- Extreme Capacity, Bandwidth,

Compute All Required

Extreme Big Data not just traditional HPC!!!--- Analysis of required system properties ---74

0

0.2

0.4

0.6

0.8

1Processor Speed

Memory/ops

OPS

Approximate Computations

Local Persistent Storage

Read PerformanceWrite Performance

Comm Latency tolerance

Comm patterns variability

Power Optimization Opportunities

Algorithmic Variety

Extreme-Scale Computing Big Data Analytics BDEC Knowledge Discovery Engine

[Slide courtesy Alok Choudhary, Northeastern U

EBE Research Scheme

SupercomputersCompute&Batch-Oriented

Cloud IDCVery low BW & Efficiencty

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB

TSV Interposer

High Powered Main CPU

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW

30PB/s I/O BW Possible1 Yottabyte / Year

EBD System Softwareincl. EBD Object System

Large Scale Metagenomics

Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC

Co-Design

Future Non-Silo Extreme Big Data Apps

Graph StoreGraph Store

EBD BagEBD BagCo-Design

KVSKVS

KVSKVS

KVSKVS

EBD KVS

Cartesian PlaneCartesian PlaneCo-Design

Phase4: 2019‐20 DRAM+NVM+CPU with 3D/2.5D Die Stacking

‐The Ultimate Convergence of BD and EC‐

PCB

TSV Interposer

High Powered Main CPULow Power CPU

DRAM

DRAM

DRAM

NVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAM

DRAM

DRAM

NVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM4~6HBM Channels1.5TB/s DRAM &

NVM BW

30PB/s I/O BW Possible1 Yottabyte / Year

Preliminary I/O Performance Evaluation on GPU and NVRAM

Mother board

RAID card

mSATA mSATA mSATA mSATA

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 5 10 15 20

Band

width [M

B/s]

# mSATAs

Raw mSATA 4KBRAID0 1MBRAID0 64KB

0

0.5

1

1.5

2

2.5

3

3.5

0.2740.547 1.09 2.19 4.38 8.75 17.5 35 70 140

Throughu

put [GB/s]

Matrix Size [GB]

Raw 8 mSATA8 mSATA RAID0 (1MB)8 mSATA RAID0 (64KB)

I/O performance of multiple mSATA SSD I/O performance from GPU to multiple mSATA SSDs

〜 7.39 GB/s from 16 mSATA SSDs (Enabled RAID0)

〜 3.06 GB/s from 8 mSATA SSDs to GPU

How to design local storage for next‐gen supercomputers ?‐ Designed a local I/O prototype using 16 mSATA SSDs

・Capacity: 4TB・Read bandwidth: 8 GB/s

Large Scale BFS Using NVRAM• Large scale graph processing in

various domainsDRAM resources has increased

• Spread of Flash DevicesProf : Price per bit，Energy consumptionCons: Latency，Throughput

② BFS with reading data from NVRAM

3.Proporsal

Using NVRAMs for large scale graph processing has possibilities of minimum performance degradation

1. Introduction

① offload small accesses data 2.Hybrid‐BFS

Top‐down Bottom‐up

0.01.02.03.04.05.06.0

1.E+04 1.E+05 1.E+06 1.E+07

GTEPS

Switching Parameter α

DRAM Only(β=10α)DRAM+SSD(β=0.1α)

2.8GTEPS（47.1% down）

2.8GTEPS（47.1% down）

5.2GTEPS5.2GTEPS

4.Evaluation

●We could reduce half the size of DRAM with 47.1% performance degradation(130M vertices，2.1G edges）

# of frontiers:nfrontier， # of all vertices:nall, parameter : α, β

Switch two approaches

●We are working on multiplexed I/O→ multiplexed I/O improve NVRAM’s I/O

performance

● Pearce, et al. : 13 times larger datasetswith 52 MTEPS(DRAM 1TB, 12TB NVRAM)

Algorithm Kernels on EBD

High Performance SortingFast algorithms:Distribution vs Comparison-based

MSD radix sort variable length / long keyshigh efficiency on small alphabets

Efficient implementation

GPUs are good at counting numbers

Computational Genomics (A,C,G,T)

Scalability

N log(N) classical sorts(quick, merge etc)

LSD radix sort (THRUST)

short length /fixed-length keys

integer sorts

appleapricotbananakiwi

don't have to examine all characters

Bitonic sort Comparison of keysMap-Reduce

Hadoop easy to use but not that efficient

Hybrid approaches/Best to be foundGood for GPU nodes Balancing IO / computation

–

Twitter network (Application of Graph500 Benchmark)

41 million vertices and 2.47 billion edges

Lv Frontier size Freq. (%) Cum. Freq. (%)0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00

Total 41,652,230 100.00 ‐

Frontier size in BFSwith source as User 21,804,357

Follow‐ship network 2009

User i

User j

(i, j)‐edge

Our NUMA‐optimized BFSon 4‐way Xeon system

69 ms / BFS ⇒ 21.28 GTEPS

Six‐degrees of separation

KVS

KVSKVS

Graph StoreGraph Store

EBD Application Co‐Design and Validation

Ultra High BW & Low Latency NVM Ultra High BW & Low Latency NW

Processor‐in‐memory 3D stacking

EBD Performance Modeling & Evaluation

Large Scale Genomic Correlation

Data Assimilationin Large Scale Sensors

and Exascale Atmospherics

Large Scale Graphs and Social

Infrastructure Apps

100,000 Times Fold EBD “Convergent” System Overview

TSUBAME 3.0

EBD Programming System

TSUBAME 2.0/2.5

Task 3

Tasks 5‐1~5‐3 Task6

EBD “converged” Real‐Time Resource Scheduling

Task 4

EBD Distrbuted Object Store on100,000 NVM Extreme Compute

and Data Nodes

Task 2

EBD BagEBD Bag

EBD KVSEBD KVS

Cartesian PlaneCartesian Plane

Ultra Parallel & Low Powe I/O EBD “Convergent” Supercomputer~10TB/s⇒~100TB/s⇒~10PB/s

Task 1

Summary• TSUBAME1.0->2.0->2.5->3.0->…

– Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP– Template for future supercomputers and IDC machines

• TSUBAME3.0 Early 2016– New supercomputing leadership– Tremendous power efficiency, extreme big data,

extremely high reliability• Lots of background R&D for TSUBAME3.0 and

towards Exascale– Green Computing: ULP-HPC & TSUBAME-KFC– Extreme Big Data – Convergence of HPC and IDC!– Exascale Resilience– Programming with Millions of Cores– …

• Please stay tuned! 乞うご期待。応援をお願いします。

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

Technology

gbs mem bw

storage evolution

gbyte memory gpu cpu

storage overview tsubame2

pb hdd

mw power

japan motorola

gbps nw bw