China’s HPC development: a brief review and perspectives Depei Qian Beihang University/Sun Yat-sen University International Symposium on Impact of extreme scale computing Tokyo, Japan Nov. 2, 2017
China’s HPC development: a brief review and perspectives
Depei Qian Beihang University/Sun Yat-sen University
International Symposium on Impact of extreme scale computing Tokyo, Japan Nov. 2, 2017
Outline
• A Brief review • The New HPC key project in China • Issues in exascale system development
Three 863 key projects on HPC
• 2002-2005:High Performance Computer and Core Software – Research on resource sharing and collaborative work – Grid-enabled applications in multiple areas – TFlops computers and China National Grid (CNGrid) testbed
• 2006-2010:High Productivity Computer and Grid Service Environment – High productivity
• Application performance • Efficiency in program development • Portability of programs • Robust of the system
– Emphasizing service features of the HPC environment – Developing peta-scale computers
• 2010-2016:High Productivity Computer and Application Service Environment – Developing 100PF computers – Developing large scale HPC applications – Upgrading of CNGird
High performance Computers • 2013: Tianhe-2
– CPU+MIC Heterogeneous accelerated architecture
– 54.9 PF peak, 33.9 PF Linpack, No. 1 in Top500 for 6 times from 2013 to 2015
– Installed at the National Supercomputing Center in Guangzhou
– Will be upgraded to 100PF this year
• 2016: Sunway TaihuLight – Implemented with home-grown
Shenwei many-core processors, 10 million cores in total
– 125 PF peak, 93 PF Linpack, No. 1 in Top500 in June and Nov. of 2016
– Installed at the National Supercomputing Center in Wuxi
Sunway Bluelight Tianhe-2
6
Items Tianhe-2 Tianhe-2A
Nodes & Performance
16000 nodes with Intel CPU + KNC
17792 nodes with Intel CPU + Matrix-2000
54.9Pflops 94.97Pflops
Interconnection 10Gbps, 1.57us 14Gbps, 1us
Memory 1.4PB 3.4PB
Storage 12.4PB, 512GB/s 19PB, 1TB/s
Energy Efficiency 17.8MW, 1.9Gflops/W About 18MW, >5Gflops/W
Heterogeneous software MPSS for Intel KNC OpenMP/OpenCL for Matrix-
2000
Tianhe-2 upgrade
Matrix-2000 accelerator
Chip specification
– 4 super-nodes (SN) – 8 clusters per SN – 4 cores per cluster – Core
• Self-defined 256-bit vector ISA • 16 DP flops/cycle per core
– Peak performance: [email protected]
– Peak power dissipation: ~240w – Interface
• 8 DDR4-2400 channels • X16 PCIE 3.0 EP Port
7
4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = 2.4576 Tflops
SN3C C C CCluster 0
C C C CCluster 1
C C C CCluster 2
C C C CCluster 3
C C C CCluster 4
C C C CCluster 5
C C C CCluster 6
C C C CCluster 7
On chip interconnection
PCIE DDR4 DDR4 DDR4 DDR4
SN0C C C CCluster 0
C C C CCluster 1
C C C CCluster 2
C C C CCluster 3
C C C CCluster 4
C C C CCluster 5
C C C CCluster 6
C C C CCluster 7
SN1C C C CCluster 0
C C C CCluster 1
C C C CCluster 2
C C C CCluster 3
C C C CCluster 4
C C C CCluster 5
C C C CCluster 6
C C C CCluster 7
SN2C C C CCluster 0
C C C CCluster 1
C C C CCluster 2
C C C CCluster 3
C C C CCluster 4
C C C CCluster 5
C C C CCluster 6
C C C CCluster 7
Compute Nodes
Heterogeneous Compute Nodes
C PU
C PU
Q PI
PC HD M I
16X PCIE
IPM B
CPLD
16X PCIE
G b LA NCom m . Port
N IC
G E
16X PCIE
MT-2000
DDR4
MT-2000
DDR4
– Intel Xeon CPU x2
– Matrix-2000 x2
– Memory:192GB
– Interconnection:14G proprietary network
– Peak performance: 5.34Tflops
HPC environment
• 2016 – China National Grid,
composed of 17 national supercomputing centers and HPC centers, world leading class computing resources
HPC applications
• 2016 – HPC applications in many domains – 10-million core parallelism reached, Gordon Bell Prize in 2016 – Developed a number application software, adopted by production
systems • aircraft design • high speed train design • oil & gas exploration • new drug discovery • ensemble weather forecasting • bio-information • car development • design optimization of large fluid machinery • electromagnetic computation • …
Problems identified
• Lack of the long-term national program for high performance computing
• Weak in kernel HPC technologies – processor/accelerator – novel devices (new memory, storage, and network) – large scale parallel algorithms and programs
implementation • Application software is the bottleneck
– applications rely on imported commercial software • expensive • small scale parallelism • restricted by export regulation
• Shortage in cross-disciplinary talents – No enough talents with both domain and IT knowledge
• Lack of multi-disciplinary collaboration
Reform of research system in China
• The national research and development system is undergoing a reform – 100+ different national R&D programs/initiatives
are merged into 5 tracks of national programs • Basic research program (NSFC) • Mega-science and technology programs • Key R&D program (former 863, 973, enabling
programs) • Enterprise innovation program • Facility/talent program
A New key project on HPC
• High performance computing has been identified as a priority subject under the key R&D program (track 3)
• Strategic studies and planning have been conducted since 2013
• A proposal on HPC in the 13th five-year plan was submitted in early 2015
• The key R&D project was approved in Oct. 2015 by a multi-government agency committee led by the MOST
Motivations
• The key value of exascale computers identified – Addressing the grand challenge problems
• Energy shortage, pollution, climate change… – Enabling industry transformation
• supporting development of important products – high speed train, commercial aircraft, automobile…
• promoting economy transformation – For social development and people’s benefit
• new drug discovery, precision medicine, digital media… – Enabling scientific discovery
• high energy physics, computational chemistry, new material, astrophysics…
• Promote computer industry by technology transfer • Developing HPC systems by self-controllable technologies
– a lesson learnt from the recent embargo regulation
Major tasks
• Exa-scale computer development – R&D on novel architectures and key technologies of the exa-scale
computer – Developing the exa-scale computer based on home-grown
processors – Technology transfer to promote development of high-end servers
• HPC applications development – Basic research on exa-scale modeling methods and parallel
algorithms – Developing high performance application software – Establishing the HPC application eco-system
• HPC environment development – Developing software and platform for national HPC environment – Upgrading of the national HPC environment CNGrid – Developing service systems on the national HPC environment
• Each task will cover basic research, key technology development, and application demonstration
• Basic research – Novel high performance interconnect
• Theoretical work on the novel interconnect – based on the enabling technologies of 3D chips, silicon
photonics and on-chip networks – Programming & execution models for exa-scale
systems • new programming models for heterogeneous systems • Improving programming efficiency
Task 1: Exa-scale computer development
Task 1: Exa-scale computer development
• Key technology – prototype systems for verifying the exa-scale system
technologies • 3 typical applications to verify the design
– exa-scale computer technologies • architecture optimized for multi-objectives • high efficient computing node • high performance processor/accelerator design • exa-scale system software • scalable interconnect • parallel I/O • exa-scale infrastructure • energy efficiency • exa-scale system reliability
Task 1: Exa-scale computer development
• Exa-scale computer development – exaflops in peak – Linpack efficiency >60% – 10PB memory – EB storage – 30GF/w energy efficiency – interconnect >500Gbps – large scale system management and resource
scheduling – easy-to-use parallel programming environment – system monitoring and fault tolerance – support large scale applications
Task 2: HPC application development
• Basic research – computable modeling and computational
methods for exa-scale systems – scalable highly efficient parallel algorithms
and parallel libraries for exa-scale systems • Key technology
– programming framework for exa-scale software development
Task 2: HPC application development
• Application software – Numerical devices
• numerical nuclear reactor • numerical aircraft • numerical earth system • numerical engine
– high performance domain application software • complex engineering project and critical equipment • numerical simulation of ocean • design of energy-efficient large fluid machineries • drug discovery • electromagnetic environment simulation • ship design • oil exploration • digital media rendering
– high performance application software for research • material science • high energy physics • astrophysics • life science
Task 2: HPC application development
• HPC application software development – establishing a national-level R&D center for HPC
application software – build up of a platform for HPC software development
and optimization – tools for performance/energy efficiency and pre-
/post-processing – build up software resource repository – developing typical domain application software
– a joint effort involving national supercomputing
centers, universities, and institutes
Task 3: HPC environment development
• Basic research – models and architecture for computational
services – virtual data space
• Key technology – mechanism and platform for the national HPC
environment, providing technical support for service–mode operation
– upgrading the national HPC environment (CNGrid)
Task 3: HPC environment development
• Services – integrated business platform, e.g.
• complex product design • HPC-enabled EDA platform
– application villages • innovation and optimization of industrial products • drug discovery • SME computing and simulation platform
– platform for HPC education • provide computing resources and services to
undergraduate and graduate students
Projects supported
• The first call for proposal was issued in Feb. , 2016. 19 projects supported
• The second call was issued in Oct., 2016, 18 projects supported, mainly application software
• The third round of call was issued in Oct. 2017, the review process will begin soon.
Sugon exa-prototype: specification
metrics prototype exascale ratio
Computing Node peak (TF) 10 32 3.2 No. of nodes 512 ≤32768 64
No. silicon-unit 6 ≤ 384 64 System peak (PF) 5.12 ≥1024 200
storage Memory (PB) 0.065 ≥ 10 153.8 Storage (PB) 10 ≥ 100 10
network Silicon-switch 6 ≤ 384 64 Dim. global net 2*1*3 ≤ 8*8*6 4*8*2 Dim. local net 2*3*2 2*3*2 1
Power consum
Power consumption
0.5 ≤ 30 60
Energy efficiency (GF/W)
10.24 ≤ 34.13 3.33
size W*D*H (m) 6*6*6 ≤ 24*24*6 16 Total cabinets 27 ≤ 400 25
Sugon exa-prototype: general design
• Computing sub-system – home-grown X86 processor + DCU accelerator in
2019 – CPU > 1TF, DCU > 15TF
• Network sub-system – 400Gbps 6D-torus, 384 routers
• Storage sub-system – Distributed storage architecture, extensible to EB
• Infrastructure sub-system – Immersive phase-change cooling – High voltage DC power supply – Hierarchical 3D assembly
• Software sub-system – Mature and complete libs and programming tools – Light-weight virtualization and software-defined architecture
Sugon exa-prototype: hierarchical 3D structure
层次 每单元节点数 原型机单元数 E级机单元数
Node pair 2 256 16384
Super node 16 32 2048
Silicon block 96 6 384
Silicon cubic 512 1 1
Node:2 CPU and 2 DCU CPU and DCU interconnected by
GOP high speed bus Memory bandwidth: 2667 Mbps, DDR4 Memory capacity ≥128G DDR4 Interconnect: 200Gbps fast Fabric
DCU0
DCU1
CPU0
CPU1
CPU2
CPU3
16x GOP*2
U/R/LR DDR4 DIMMs
XGKR*2
Pcle 16x
XGKR*2
Pcle 16x
16x GO
P*2
16x GOP*2
U/R/LR DDR4 DIMMs
BIOS BIOS
BIOS BIOS
DCU2
DCU3
U/R/LR DDR4 DIMMs
16x GO
P*2
SATA Pcle 4x
SATA/ Pcle 4x M.2 M.2 M.2 M.2
XGKR*2
Pcle 16x
XGKR*2
Pcle 16x
U/R/LR DDR4 DIMMs
16x GOP*2
16x GOP*2
SATA Pcle 4x
16x GO
P*2
SATA/ Pcle 4x
16x GO
P*2
Midplane
2X200G NIC
AIU
Sugon exa-ptototype: Computing node
Tianhe exa-prototype: flexible architecture
• Reconfigurable flexible architecture, meet the requirement of different applications
• Virtualized OS, provide a configurable computing environment • Software-defined interconnect, guarantee bandwidth and fault
isolation • Hierarchical storage QoS guarantee technology, providing stable
and independent storage bandwidth • Dynamic optimization providing architecture-aware optimization
Computing sub-system
IO storage sub-system
OS
runtime
compiler
application
Computing node
Tianhe exa-prototype: technical route
31
performance
Energy efficiency Easy to use
Many-core
Special purpose accelerator
customized
General purpose many-core is adopted by the prototype
Tianhe exa-prototype: technical features
• Flexible architecture to meet the requirement of different applications
• New generation many-core processor, pursuing balanced computing and memory access
• Optoelectronic integrated high speed interconnect, greatly improved performance and energy efficiency
• Fault-tolerance based on new storage medium • Accurate heat dissipation, tradeoff between the
manufacture cost and the operational cost
Tianhe exa-prototype: interconnect
• High-radix router for low power consumption, low cost and high desity
• Exascale communication need: single node > 400Gbps • Chip power budget <200W, at most 12 ports of 400 Gbps • Co-design of ultra short distance Serdes PHY, PHY coding,
and link layer • Optoelectronic integration for interconnect
33
Sunway exa-prototype: hardware system
直流供电系统
水冷机组
二级胖树互连结构
强化换热冷板组装节点 新一代众核处理器
运算机仓
• System composed of computing, interconnect, storage, power supply and cooling
• New generation many-core based system,512 nodes,performance >4PFlops
• Self-developed network chip, fat-tree interconnect, point to point bandwidth > 200Gbps
• Storage subsystem based on Shenwei storage server
• Self-developed high voltage (300V) DC power supply
• High efficient water-cooling, enhanced heat transfer copper cold plate
Sunway exa-prototype: computing node
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
DDR3MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
DDR3
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
MEMMEMMEMMEMMEMMEM
DDR3 DDR3
以太网
PCI-E
高速计算网网络接口
以太管理网网络接口
核组0 核组1
核组2 核组3
时钟管理
处理器管理
电源管理
节点监测
BM C
DDR4 DDR4
DDR4 DDR4
• Connection to the interconnect:2 X 25GbpsX4
• Point to point one-way bandwidth:200Gbps
• Peak performance: >8TFlops
• memory:> 64GB
Sunway exa-prototype: software system
• Basic software for home-grown many-core processor – parallel OS – high performance storage
management system – parallel compiler – parallel program development
environment • High efficient compiler for
heterogeneous many-core • SIMD auto-vectorization • High performance basic math libs • Integrated multi-domain OS for
heterogeneous many-core • Dynamic storage management • Supporting MPI-1、MPI-2、MPI-3、
OpenMP3.0, compatible OpenACC2.0 • Debugger for heterogeneous many-
core
Sunway exa-prototype: demo applications
Ocean model Aircraft design
seismic Floating platform design
• Porting applications on TaihuLight, performance optimization is being conducted
2016 Fully Implicit Solver for
Atmospheric Dynamics
Surface Wave Modeling
Phase Field Simulations of Coarsening Dynamics
Atomistic Simulation of Silicon Nanowires
Run-away Electron Trajectory Simulation
Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical
Simulation
2017 Extreme-scale Graph Processing
Framework
Simulation of Planetary Rings
Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent
Materials cryo-EM Macromolecule Structure Determination
Redesigning CAM-SE
Nonlinear Earthquake Simulation
Sunway exa-prototype: applications
10-Million core applications on TaihuLight
Major Challenges to exa-scale systems
• Power consumption • Performance obtained by applications • Programmability • Resilience
• How to make tradeoffs between performance,
power consumption, and programmability? • How to achieve continuous no-stop operation? • How to adapt to a wide range of applications
with reasonable efficiency?
Architecture • Novel architectures beyond the current
heterogeneous accelerated/manycore-based expected
• Co-processor or partitioned heterogeneous architecture? – Low utilization of the co-processor in some
applications, using CPU only – Bottleneck in moving data between CPU
and co-processor • Application-aware architecture
– on-chip integration of special purpose units (idea from Prof. Andrew Chien)
– using the right tool to do the right things – dynamic reconfigurable? how to program?
Memory system
• Pursuing large capacity, low latency, high bandwidth
• Increase capacity and lower power consumption by using DRAM/NVM together – Data placement issue
• Improving bandwidth and latency by using the 3D stacking technology
• Reduce the data move by placing the data closer to processing – HBM/HMC near processor – On-chip DRAM – Simple functions in memory
• Reduce data copy cost by using unified memory space in heterogeneous architecture
Interconnect • Pursuing low latency, high
bandwidth and low energy consumption
• Adopt new technologies – silicon photonics communication
between components – optical interconnect /
communication – miniature optical devices
• High scalability adapting to exa-scale system interconnect requirement – Connecting 10,000+ nodes – Low-hop, low-latency topology – Reliable and intelligent routing
Programming the heterogeneous systems
• Addressing the issues in programming the heterogeneous parallel systems – efficient expression of the parallelism, dependence, data
sharing, execution semantics – problem decomposition appropriate for heterogeneous
systems • Improving programming by means of a holistic approach
– new programming models – programming languages extension and compiler – parallel debugging – runtime support and optimization – architectural support
Computational models and algorithms
• Full-chain innovation – mathematical methods – computer algorithms – algorithm implementation and
optimization • A good mathematical method is
often more effective than hardware improvement and algorithm optimization
• Architecture-aware algorithm implementation and optimization is necessary for heterogeneous systems
• Domain-specific libraries for improving software productivity and performance
Resilience
• Resilience is one of the key issues of the exa-scale system – Large scale of the system
• 50K to 100K nodes • Huge amount of components • Very short MTBF • Long time non-stop operation required for solving large scale
problems
• Reliability measures at different levels required, including device, node, and system levels
• Software / hardware coordination is necessary – fast context saving and recovery for checkpointing in case of
short MTBF – fault-tolerance at the algorithm and application software level
Importance of the tools
• Development and optimization of large scale parallel software require scalable tools
• Particularly important for systems implemented with home-grown processors – current commercial and research tools do not
support • Three kinds of default tools required
– Parallel debugger for correctness – Performance tuner for performance – Energy optimizer for energy efficiency
Urgent need for eco-system
• The eco-system for exa-scale system based on home-grown processors is in a urgent need – languages, compilers, OS, runtime – tools – application development support – application software
• Need to attract the hardware manufacturers and the third party software developers – product family instead of a single machine
• Collaboration between industry, academia and end-users required