Update of “Fugaku” Mitsuhisa Sato Team Leader of Architecture Development Team Deputy project leader, FLAGSHIP 2020 project Deputy Director, RIKEN Center for Computational Science (R-CCS) Professor (Cooperative Graduate School Program), University of Tsukuba AHUG, 18 th Nov 2019 @ SC 2019, Denver
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Update of “Fugaku”
Mitsuhisa Sato Team Leader of Architecture Development Team
Deputy project leader, FLAGSHIP 2020 projectDeputy Director, RIKEN Center for Computational Science (R-CCS)
Professor (Cooperative Graduate School Program), University of Tsukuba
AHUG, 18th Nov 2019 @ SC 2019, Denver
FLAGSHIP2020 Project “Fugaku”Missions
• Building the Japanese national flagship supercomputer Fugaku (a.k. a post K), and
• Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan (application development proj will be over at the end of march)
Overview of Fugaku architectureNode: Manycore architecture
• Armv8-A + SVE (Scalable Vector Extension)• SIMD Length: 512 bits• # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core)• Co-design with application developers and high memory
Status and Update• March 2019: The official contract with Fujitsu to
manufacture, ship, and install hardware for Fugaku is done• RIKEN revealed #nodes > 150K• March 2019: The Name of the system was decided as
“Fugaku”• Aug. 2019: The K computer stopped the services and
shutdown (removed from the computer room)• Oct 2019: access to the test chips was started.• Nov. 2019: Fujitsu announce FX1000 and FX700, and
business with Cray.• Nov 2019: Fugaku clock frequency will be 2.0GHz and boost
to 2.2 GHz.• Mov 2019: Green 500 1st position!• Oct-Nov 2019: MEXT announced the Fugaku “early access
program” to begin around Q2/CY2020• Around Jan 2020: Installation of “Fugaku” will be started.
Nov/18/2019 2
KPIs on Fugaku development in FLAGSHIP 2020 project
3 KPIs (key performance indicator) were defined for Fugaku development
1. Extreme Power-Efficient System Maximum performance under Power consumption of 30 - 40MW (for system) Approx. 15 GF/W (dgemm) confirmed by the prototype CPU
2. Effective performance of target applications It is expected to exceed 100 times higher than the K computer’s performance in some
applications 125 times faster in GENESIS (MD application), 120 times faster in NICAM+LETKF (climate
KPIs on Fugaku development in FLAGSHIP 2020 project
3 KPIs (key performance indicator) were defined for Fugaku development
1. Extreme Power-Efficient System Maximum performance under Power consumption of 30 - 40MW (for system) Approx. 15 GF/W (dgemm) confirmed by the prototype CPU
2. Effective performance of target applications It is expected to exceed 100 times higher than the K computer’s performance in some
applications 125 times faster in GENESIS (MD application), 120 times faster in NICAM+LETKF (climate
simulation and data assimilation) were estimated
3. Ease-of-use system for wide-range of users Shared memory system with high-bandwidth on-package memory must make existing
OpenMP-MPI program ported easily. No programming effort for accelerators such as GPUs is required. Co-design with application developers
ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.
4 NUMA Nodes
TofuD Interconnect
8B Put latency 0.49 – 0.54 usec
1MiB Put throughput 6.35 GB/s
rf. Yuichiro Ajima, et al. , “The Tofu Interconnect D,” IEEE Cluster 2018, 2018.
Disclaimer:The software used for the evaluation, such as the compiler, is still under development and its performance may be different when the supercomputer Fugaku starts its operation.
13
Evaluation using one CMG(NUMA node) without MPI Good scalability by increasing the number of threads within CMG. One GMG performance is comparable to Intel one. (Chip contains 4 CMG!)
CloverLeaf
Nov/18/2019
0
5
10
15
20
25
Fujitsu Arm Intel
A64FX TX2 Broadwell
Exec
utio
n tim
e [s
ec]
# of threads 1 # of threads 4 # of threads 8 # of threads 12
0
1
2
3
4
5
6
7
8
9
Fujitsu Arm Intel
A64FX TX2 Broadwell
Rela
tive
perf
orm
ance
# of threads 1 # of threads 4 # of threads 8 # of threads 12
Execution time Relative performance (to 1T/A64FX)
14
0
200
400
600
800
1000
1200
1400
1 4 8 12 1 4 8 12 1 4 8 12
1 2 4
Wal
l clo
ck ti
me
[sec
]
# of threads# of processes
A64FX TX2 Xeon
Evaluation of MPI program within one chip (upto 4 MPI process)
Changing #threads within CMG The speedup is limited for more
than 4 threads due to the memory bandwidth (?)
We need more performance analysis.
TeaLeaf
Nov/18/2019
Execution time
0
2
4
6
8
10
12
14
1 4 8 12 1 4 8 12 1 4 8 12
1 2 4
Rela
tive
perfo
rman
ce (/
A64F
X 1p
roc
1thr
ead)
# of threads# of processes
A64FX TX2 XeonRelative performance(to 1T/A64FX)
15
Xeon @ Cygnus, Univ. of TsukubaIntel Xeon Gold 6126 2.6GHz; 12 core x 2 socket
0
500
1000
1500
2000
2500
3000
3500
Fujitsu Arm Intel
A64FX TX2 Broadwell
FOM
(z/
s)
# of threads 1 # of threads 4 # of threads 8 # of threads 12
Evaluation using one CMG(NUMA node) without MPI One CMG performance is less than Thx2 and Intel one We found low vectorization (SIMD (SVE) instructions ratio is a few %) We need more code tuning for more vectorization using SIMD
LULESH
Nov/18/2019 16
17
Fugaku (PostK) Fujitsu A64FX processor simulator based on gem-5 The processor simulator will give a detail performance results including
estimated executing time, cache-miss, the number of instruction executed in O3.
The user can understand how the compiled code for SVE is executed on A64FX processor for optimization.
NDA with RIKEN/Fujitsu is required.
Open version of Arm-SVE gem5 simulator in docker file (x86) Arm-SVE gem5 with “open parameters” (free) and gcc for Arm-SVE included Can be used for architecture exploration Available on Linaro docker hub:
https://hub.docker.com/r/linaro/gem5-riken-open
Compilers for Fujitsu A64FX processor Fujitsu Compilers:Fortran, C, C++. Fully-tuning for “A64FX” architecture. Arm Compiler:LLVM-based compiler to generate code forArmv8-A + SV.