China’s HPC development: a brief review and … · Depei Qian Beihang University ... Three 863 key projects on HPC • 2002-2005：High Performance Computer and Core Software ...

China’s HPC development: a brief review and perspectives

Depei Qian Beihang University/Sun Yat-sen University

International Symposium on Impact of extreme scale computing Tokyo, Japan Nov. 2, 2017

Outline

• A Brief review • The New HPC key project in China • Issues in exascale system development

http://images.google.cn/imgres?imgurl=http://a2.att.hudong.com/06/00/01300000169220121834008583031_s.jpg&imgrefurl=http://www.hudong.com/wiki/863%E8%AE%A1%E5%88%92&usg=__RS8uVV9JtI1ghsQfJtqObm1Ozm0=&h=240&w=240&sz=14&hl=zh-CN&start=2&um=1&itbs=1&tbnid=2vJxr0eNpvBLIM:&tbnh=110&tbnw=110&prev=/images?q=863&um=1&hl=zh-CN&newwindow=1&client=aff-cs-TTraveler&sa=N&tbs=isch:1

A Brief review


Three 863 key projects on HPC

• 2002-2005：High Performance Computer and Core Software – Research on resource sharing and collaborative work – Grid-enabled applications in multiple areas – TFlops computers and China National Grid (CNGrid) testbed

• 2006-2010：High Productivity Computer and Grid Service Environment – High productivity

• Application performance • Efficiency in program development • Portability of programs • Robust of the system

– Emphasizing service features of the HPC environment – Developing peta-scale computers

• 2010-2016：High Productivity Computer and Application Service Environment – Developing 100PF computers – Developing large scale HPC applications – Upgrading of CNGird


High performance Computers • 2013: Tianhe-2

– CPU+MIC Heterogeneous accelerated architecture

– 54.9 PF peak, 33.9 PF Linpack, No. 1 in Top500 for 6 times from 2013 to 2015

– Installed at the National Supercomputing Center in Guangzhou

– Will be upgraded to 100PF this year

• 2016: Sunway TaihuLight – Implemented with home-grown

Shenwei many-core processors, 10 million cores in total

– 125 PF peak, 93 PF Linpack, No. 1 in Top500 in June and Nov. of 2016

– Installed at the National Supercomputing Center in Wuxi

Sunway Bluelight Tianhe-2


6

Items Tianhe-2 Tianhe-2A

Nodes & Performance

16000 nodes with Intel CPU + KNC

17792 nodes with Intel CPU + Matrix-2000

54.9Pflops 94.97Pflops

Interconnection 10Gbps, 1.57us 14Gbps, 1us

Memory 1.4PB 3.4PB

Storage 12.4PB, 512GB/s 19PB, 1TB/s

Energy Efficiency 17.8MW, 1.9Gflops/W About 18MW, >5Gflops/W

Heterogeneous software MPSS for Intel KNC OpenMP/OpenCL for Matrix-

2000

Tianhe-2 upgrade

プレゼンター

プレゼンテーションのノート

Major work for Phase 2（Milkyway-2A） Designed Proprietary accelerator Matrix-2000 to replace Intel KNC Developed accelerator blade with 4 Matrix-2000s Customized software stack for Matrix-2000 Upgraded the proprietary interconnection chipset and network from 10G to 14G System memory upgraded to 3.4PB from 1.4PB Expanded the I/O storage subsystem from 12.4PB to 19PB


Matrix-2000 accelerator

Chip specification

– 4 super-nodes (SN) – 8 clusters per SN – 4 cores per cluster – Core

• Self-defined 256-bit vector ISA • 16 DP flops/cycle per core

– Peak performance: [email protected]

– Peak power dissipation: ~240w – Interface

• 8 DDR4-2400 channels • X16 PCIE 3.0 EP Port

7

4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = 2.4576 Tflops

SN3C C C CCluster 0

C C C CCluster 1

C C C CCluster 2

C C C CCluster 3

C C C CCluster 4

C C C CCluster 5

C C C CCluster 6

C C C CCluster 7

On chip interconnection

PCIE DDR4 DDR4 DDR4 DDR4

SN0C C C CCluster 0

C C C CCluster 1

C C C CCluster 2

C C C CCluster 3

C C C CCluster 4

C C C CCluster 5

C C C CCluster 6

C C C CCluster 7

SN1C C C CCluster 0

C C C CCluster 1

C C C CCluster 2

C C C CCluster 3

C C C CCluster 4

C C C CCluster 5

C C C CCluster 6

C C C CCluster 7

SN2C C C CCluster 0

C C C CCluster 1

C C C CCluster 2

C C C CCluster 3

C C C CCluster 4

C C C CCluster 5

C C C CCluster 6

C C C CCluster 7

プレゼンター

プレゼンテーションのノート

That means that we have 128 core in each matrix 2000 accelerator Every core can execute 16 double precise vector instructions with 256 bit every cycle The accelerator is attached with 8 DDR4 channels and one pcie port the peak performance of the


Compute Nodes

Heterogeneous Compute Nodes

C PU

C PU

Q PI

PC HD M I

16X PCIE

IPM B

CPLD

16X PCIE

G b LA NCom m . Port

N IC

G E

16X PCIE

MT-2000

DDR4

MT-2000

DDR4

– Intel Xeon CPU x2

– Matrix-2000 x2

– Memory:192GB

– Interconnection:14G proprietary network

– Peak performance: 5.34Tflops


HPC environment

• 2016 – China National Grid,

composed of 17 national supercomputing centers and HPC centers, world leading class computing resources


HPC applications

• 2016 – HPC applications in many domains – 10-million core parallelism reached, Gordon Bell Prize in 2016 – Developed a number application software, adopted by production

systems • aircraft design • high speed train design • oil & gas exploration • new drug discovery • ensemble weather forecasting • bio-information • car development • design optimization of large fluid machinery • electromagnetic computation • …


Problems identified

• Lack of the long-term national program for high performance computing

• Weak in kernel HPC technologies – processor/accelerator – novel devices (new memory, storage, and network) – large scale parallel algorithms and programs

implementation • Application software is the bottleneck

– applications rely on imported commercial software • expensive • small scale parallelism • restricted by export regulation

• Shortage in cross-disciplinary talents – No enough talents with both domain and IT knowledge

• Lack of multi-disciplinary collaboration


The new key HPC project in China


Reform of research system in China

• The national research and development system is undergoing a reform – 100+ different national R&D programs/initiatives

are merged into 5 tracks of national programs • Basic research program (NSFC) • Mega-science and technology programs • Key R&D program (former 863, 973, enabling

programs) • Enterprise innovation program • Facility/talent program


A New key project on HPC

• High performance computing has been identified as a priority subject under the key R&D program (track 3)

• Strategic studies and planning have been conducted since 2013

• A proposal on HPC in the 13th five-year plan was submitted in early 2015

• The key R&D project was approved in Oct. 2015 by a multi-government agency committee led by the MOST


Motivations

• The key value of exascale computers identified – Addressing the grand challenge problems

• Energy shortage, pollution, climate change… – Enabling industry transformation

• supporting development of important products – high speed train, commercial aircraft, automobile…

• promoting economy transformation – For social development and people’s benefit

• new drug discovery, precision medicine, digital media… – Enabling scientific discovery

• high energy physics, computational chemistry, new material, astrophysics…

• Promote computer industry by technology transfer • Developing HPC systems by self-controllable technologies

– a lesson learnt from the recent embargo regulation


Major tasks

• Exa-scale computer development – R&D on novel architectures and key technologies of the exa-scale

computer – Developing the exa-scale computer based on home-grown

processors – Technology transfer to promote development of high-end servers

• HPC applications development – Basic research on exa-scale modeling methods and parallel

algorithms – Developing high performance application software – Establishing the HPC application eco-system

• HPC environment development – Developing software and platform for national HPC environment – Upgrading of the national HPC environment CNGrid – Developing service systems on the national HPC environment

• Each task will cover basic research, key technology development, and application demonstration


• Basic research – Novel high performance interconnect

• Theoretical work on the novel interconnect – based on the enabling technologies of 3D chips, silicon

photonics and on-chip networks – Programming & execution models for exa-scale

systems • new programming models for heterogeneous systems • Improving programming efficiency

Task 1: Exa-scale computer development



• Key technology – prototype systems for verifying the exa-scale system

technologies • 3 typical applications to verify the design

– exa-scale computer technologies • architecture optimized for multi-objectives • high efficient computing node • high performance processor/accelerator design • exa-scale system software • scalable interconnect • parallel I/O • exa-scale infrastructure • energy efficiency • exa-scale system reliability



• Exa-scale computer development – exaflops in peak – Linpack efficiency >60% – 10PB memory – EB storage – 30GF/w energy efficiency – interconnect >500Gbps – large scale system management and resource

scheduling – easy-to-use parallel programming environment – system monitoring and fault tolerance – support large scale applications


Task 2: HPC application development

• Basic research – computable modeling and computational

methods for exa-scale systems – scalable highly efficient parallel algorithms

and parallel libraries for exa-scale systems • Key technology

– programming framework for exa-scale software development



• Application software – Numerical devices

• numerical nuclear reactor • numerical aircraft • numerical earth system • numerical engine

– high performance domain application software • complex engineering project and critical equipment • numerical simulation of ocean • design of energy-efficient large fluid machineries • drug discovery • electromagnetic environment simulation • ship design • oil exploration • digital media rendering

– high performance application software for research • material science • high energy physics • astrophysics • life science



• HPC application software development – establishing a national-level R&D center for HPC

application software – build up of a platform for HPC software development

and optimization – tools for performance/energy efficiency and pre-

/post-processing – build up software resource repository – developing typical domain application software

– a joint effort involving national supercomputing

centers, universities, and institutes


Task 3: HPC environment development

• Basic research – models and architecture for computational

services – virtual data space

• Key technology – mechanism and platform for the national HPC

environment, providing technical support for service–mode operation

– upgrading the national HPC environment (CNGrid)


Task 3: HPC environment development

• Services – integrated business platform, e.g.

• complex product design • HPC-enabled EDA platform

– application villages • innovation and optimization of industrial products • drug discovery • SME computing and simulation platform

– platform for HPC education • provide computing resources and services to

undergraduate and graduate students


Projects supported

• The first call for proposal was issued in Feb. , 2016. 19 projects supported

• The second call was issued in Oct., 2016, 18 projects supported, mainly application software

• The third round of call was issued in Oct. 2017, the review process will begin soon.


Sugon exa-prototype: specification

metrics prototype exascale ratio

Computing Node peak (TF) 10 32 3.2 No. of nodes 512 ≤32768 64

No. silicon-unit 6 ≤ 384 64 System peak (PF) 5.12 ≥1024 200

storage Memory (PB) 0.065 ≥ 10 153.8 Storage (PB) 10 ≥ 100 10

network Silicon-switch 6 ≤ 384 64 Dim. global net 2*1*3 ≤ 8*8*6 4*8*2 Dim. local net 2*3*2 2*3*2 1

Power consum

Power consumption

0.5 ≤ 30 60

Energy efficiency (GF/W)

10.24 ≤ 34.13 3.33

size W*D*H (m) 6*6*6 ≤ 24*24*6 16 Total cabinets 27 ≤ 400 25


Sugon exa-prototype: general design

• Computing sub-system – home-grown X86 processor + DCU accelerator in

2019 – CPU > 1TF, DCU > 15TF

• Network sub-system – 400Gbps 6D-torus, 384 routers

• Storage sub-system – Distributed storage architecture, extensible to EB

• Infrastructure sub-system – Immersive phase-change cooling – High voltage DC power supply – Hierarchical 3D assembly

• Software sub-system – Mature and complete libs and programming tools – Light-weight virtualization and software-defined architecture


Sugon exa-prototype: hierarchical 3D structure

层次每单元节点数原型机单元数 E级机单元数

Node pair 2 256 16384

Super node 16 32 2048

Silicon block 96 6 384

Silicon cubic 512 1 1


Node：2 CPU and 2 DCU CPU and DCU interconnected by

GOP high speed bus Memory bandwidth: 2667 Mbps, DDR4 Memory capacity ≥128G DDR4 Interconnect: 200Gbps fast Fabric

DCU0

DCU1

CPU0

CPU1

CPU2

CPU3

16x GOP*2

U/R/LR DDR4 DIMMs

XGKR*2

Pcle 16x

XGKR*2

Pcle 16x

16x GO

P*2

16x GOP*2

U/R/LR DDR4 DIMMs

BIOS BIOS

BIOS BIOS

DCU2

DCU3

U/R/LR DDR4 DIMMs

16x GO

P*2

SATA Pcle 4x

SATA/ Pcle 4x M.2 M.2 M.2 M.2

XGKR*2

Pcle 16x

XGKR*2

Pcle 16x

U/R/LR DDR4 DIMMs

16x GOP*2

16x GOP*2

SATA Pcle 4x

16x GO

P*2

SATA/ Pcle 4x

16x GO

P*2

Midplane

2X200G NIC

AIU

Sugon exa-ptototype: Computing node


Tianhe exa-prototype: flexible architecture

• Reconfigurable flexible architecture, meet the requirement of different applications

• Virtualized OS, provide a configurable computing environment • Software-defined interconnect, guarantee bandwidth and fault

isolation • Hierarchical storage QoS guarantee technology, providing stable

and independent storage bandwidth • Dynamic optimization providing architecture-aware optimization

Computing sub-system

IO storage sub-system

OS

runtime

compiler

application

Computing node


Tianhe exa-prototype: technical route

31

performance

Energy efficiency Easy to use

Many-core

Special purpose accelerator

customized

General purpose many-core is adopted by the prototype


Tianhe exa-prototype: technical features

• Flexible architecture to meet the requirement of different applications

• New generation many-core processor, pursuing balanced computing and memory access

• Optoelectronic integrated high speed interconnect, greatly improved performance and energy efficiency

• Fault-tolerance based on new storage medium • Accurate heat dissipation, tradeoff between the

manufacture cost and the operational cost


Tianhe exa-prototype: interconnect

• High-radix router for low power consumption, low cost and high desity

• Exascale communication need: single node > 400Gbps • Chip power budget <200W, at most 12 ports of 400 Gbps • Co-design of ultra short distance Serdes PHY, PHY coding,

and link layer • Optoelectronic integration for interconnect

33


Sunway exa-prototype: hardware system

直流供电系统

水冷机组

二级胖树互连结构

强化换热冷板组装节点新一代众核处理器

运算机仓

• System composed of computing, interconnect, storage, power supply and cooling

• New generation many-core based system，512 nodes，performance >4PFlops

• Self-developed network chip, fat-tree interconnect, point to point bandwidth > 200Gbps

• Storage subsystem based on Shenwei storage server

• Self-developed high voltage (300V) DC power supply

• High efficient water-cooling, enhanced heat transfer copper cold plate


Sunway exa-prototype: computing node

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

DDR3MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

DDR3

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

MEMMEMMEMMEMMEMMEM

DDR3 DDR3

以太网

PCI-E

高速计算网网络接口

以太管理网网络接口

核组0 核组1

核组2 核组3

时钟管理

处理器管理

电源管理

节点监测

BM C

DDR4 DDR4

DDR4 DDR4

• Connection to the interconnect：2 X 25GbpsX4

• Point to point one-way bandwidth：200Gbps

• Peak performance： >8TFlops

• memory：> 64GB


Sunway exa-prototype: software system

• Basic software for home-grown many-core processor – parallel OS – high performance storage

management system – parallel compiler – parallel program development

environment • High efficient compiler for

heterogeneous many-core • SIMD auto-vectorization • High performance basic math libs • Integrated multi-domain OS for

heterogeneous many-core • Dynamic storage management • Supporting MPI-1、MPI-2、MPI-3、

OpenMP3.0, compatible OpenACC2.0 • Debugger for heterogeneous many-

core


Sunway exa-prototype: demo applications

Ocean model Aircraft design

seismic Floating platform design

• Porting applications on TaihuLight, performance optimization is being conducted


2016 Fully Implicit Solver for

Atmospheric Dynamics

Surface Wave Modeling

Phase Field Simulations of Coarsening Dynamics

Atomistic Simulation of Silicon Nanowires

Run-away Electron Trajectory Simulation

Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical

Simulation

2017 Extreme-scale Graph Processing

Framework

Simulation of Planetary Rings

Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent

Materials cryo-EM Macromolecule Structure Determination

Redesigning CAM-SE

Nonlinear Earthquake Simulation

Sunway exa-prototype: applications

10-Million core applications on TaihuLight


Issues in exascale system development


Major Challenges to exa-scale systems

• Power consumption • Performance obtained by applications • Programmability • Resilience

• How to make tradeoffs between performance,

power consumption, and programmability? • How to achieve continuous no-stop operation? • How to adapt to a wide range of applications

with reasonable efficiency?


Architecture • Novel architectures beyond the current

heterogeneous accelerated/manycore-based expected

• Co-processor or partitioned heterogeneous architecture? – Low utilization of the co-processor in some

applications, using CPU only – Bottleneck in moving data between CPU

and co-processor • Application-aware architecture

– on-chip integration of special purpose units (idea from Prof. Andrew Chien)

– using the right tool to do the right things – dynamic reconfigurable? how to program?


Memory system

• Pursuing large capacity, low latency, high bandwidth

• Increase capacity and lower power consumption by using DRAM/NVM together – Data placement issue

• Improving bandwidth and latency by using the 3D stacking technology

• Reduce the data move by placing the data closer to processing – HBM/HMC near processor – On-chip DRAM – Simple functions in memory

• Reduce data copy cost by using unified memory space in heterogeneous architecture


Interconnect • Pursuing low latency, high

bandwidth and low energy consumption

• Adopt new technologies – silicon photonics communication

between components – optical interconnect /

communication – miniature optical devices

• High scalability adapting to exa-scale system interconnect requirement – Connecting 10,000+ nodes – Low-hop, low-latency topology – Reliable and intelligent routing


Programming the heterogeneous systems

• Addressing the issues in programming the heterogeneous parallel systems – efficient expression of the parallelism, dependence, data

sharing, execution semantics – problem decomposition appropriate for heterogeneous

systems • Improving programming by means of a holistic approach

– new programming models – programming languages extension and compiler – parallel debugging – runtime support and optimization – architectural support


Computational models and algorithms

• Full-chain innovation – mathematical methods – computer algorithms – algorithm implementation and

optimization • A good mathematical method is

often more effective than hardware improvement and algorithm optimization

• Architecture-aware algorithm implementation and optimization is necessary for heterogeneous systems

• Domain-specific libraries for improving software productivity and performance


Resilience

• Resilience is one of the key issues of the exa-scale system – Large scale of the system

• 50K to 100K nodes • Huge amount of components • Very short MTBF • Long time non-stop operation required for solving large scale

problems

• Reliability measures at different levels required, including device, node, and system levels

• Software / hardware coordination is necessary – fast context saving and recovery for checkpointing in case of

short MTBF – fault-tolerance at the algorithm and application software level


Importance of the tools

• Development and optimization of large scale parallel software require scalable tools

• Particularly important for systems implemented with home-grown processors – current commercial and research tools do not

support • Three kinds of default tools required

– Parallel debugger for correctness – Performance tuner for performance – Energy optimizer for energy efficiency


Urgent need for eco-system

• The eco-system for exa-scale system based on home-grown processors is in a urgent need – languages, compilers, OS, runtime – tools – application development support – application software

• Need to attract the hardware manufacturers and the third party software developers – product family instead of a single machine

• Collaboration between industry, academia and end-users required


Thank you!


China’s HPC development: a brief review and … · Depei Qian Beihang University ... Three 863 key projects on HPC • 2002-2005：High Performance Computer and Core Software ...

Documents