Top Banner
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu , S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18
27

Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Mar 10, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe,A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi

Tohoku University14 November, 2018

SC18

Page 2: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 2

Page 3: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Background

• Supercomputers become important infrastructures• Widely used for scien8fic researches as well as various

industries• Top1 Summit system reaches 143.5 Pflop/s

• Big gap between theore8cal performance and sustained performance◎ Compute-intensive applica8ons stand to benefit from

high peak performance✖Memory-intensive applica8ons are limited by lower

memory performance

14 November, 2018 SC18 3

Memory performance has gained more and more attentions

Page 4: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

A new vector supercomputerSX-Aurora TSUBASA• Two important concepts of its design

• High usability• High sustained performance

• New memory integration technology• Realize the world’s highest memory bandwidth

• New architecture• Vector host (VH) is attached to vector engines (VEs)

• VE is responsible for executing an entire application• VH is used for processing system calls invoked by the

applications

14 November, 2018 5X86 Linux

VE

Vector processor

VEVH

SC18

Page 5: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

New execution model

• Conven1onal model • New execution model

6

VH VE

Exe module load Startprocessing

System call(I/O, etc)

Transparent offloadOS function

Finishprocessing

Host GPU

Startprocessing

Kernelexecution

Kernelexecution

Fisnishprocessing SC18

Page 6: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Highlights of the execution model

• Two advantages over conventional execution model• Avoid frequent data transfers between VE and VH

• Applications are entirely executed on VE• Only necessary data for system calls need to be transferred

→High sustained performance• No special programming

• Explicit specifications of computation kernels are not necessary

• System calls are transparently offloadedto the VH• Programmers do not need to care system calls

→High usability

14 November, 2018 SC18 7

VH VE

Finishprocessing

OS function

Offload System call(I/O, etc)

Exe module load Startprocessing

Page 7: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Specification of SX-Aurora TSUBASA• Memory bandwidth

• 1.22 TB/s world’s highest memory bandwidth• Six HBM2 memory modules integraEon

• 3.0 TB/s LLC bandwidth• LLC is connected to cores via 2D mesh network

• ComputaEonal capability• 2.15 Tflop/[email protected] GHz

• 8 powerful vector cores• 16 nm FINFET process technology• 4.8 billion transistors• 14.96 mm x 33.00 mm

14 November, 2018 SC18Block diagram of a vector processor

Page 8: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 9

Page 9: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Experimental environments

• SX-Aurora TSUBASA A300-2

• 2x VEs Type 10B

• 1x VH

14 November, 2018 SC18 10

VE Type 10BFrequency 1.4 GHz

Peak FP / core 268.8 Gflop/s

# cores 8

Peak DP Flops / socket 2.15 Tflop/s

Memory BW 1.2 TB/s

Memory capacity 48 GB

Memory config HBM2

VH Intel Xeon Gold 6126Frequency 2.60 GHz / 3.70 GHz (Turbo)

Peak FP / core 83.2 Gflop/s

# cores 12

Peak DP Flops 998.4 Gflop/s

Mem BW 128 GB/s

Mem Capacity 96 GB

Mem config DDR4-2666 DIMM 16GB x 6

X86 Linux

VE

Vector Engines

VEVH

Page 10: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Experimental environments cont.

14 November, 2018 SC18 11

Processor SX-AuroraType 10B

Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi

KNL 7290Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz

# of cores 8 12 4 5120 72

DP flop/s(SP flop/s)

2.15 T(4.30 T)

998.4 GF (1996.8 GF) 256 GF 7 TF

(14 TF)3.456 TF

(6.912 TF)

Memory subsystem HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 MCDRAM

DDR4

Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 450+ GB/s115.2 GB/s

Memory capacity 48 GB 96 GB 64 GB 16 GB 16 GB

96 GB

LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A

LLC capacity 16 MB shared

19.25 MB shared 1 MB private 6 MB shared 1 MB shared

by 2 cores

Page 11: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Applications used for evaluation

• SGEMM/DGEMM• Matrix-matrix mul;plica;ons to evaluate the Peak flop/s

• Stream benchmark• Simple kernels (copy, scale, add, triad) to measure sustained

memory performance

• Himeno benchmark• Jacobi kernels with a 19-point stencil as a memory-intensive

kernels

• Applica;on kernels• Kernels of prac;cal applica;ons of Tohoku univ in Earthquake,

CFD, Electromagne;c

• Microbenchmark to evaluate the execu;on model• Mixture with vector-friendly Jacobi kernels and I/O kernels

14 November, 2018 SC18 12

Page 12: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Overview of application kernels

14 November, 2018 13

Kernels Fields Methods Memory access

Meshsize

CodeB/F

ActualB/F

Land mine Electromagnetic FDTD Sequential 100x750x750 6.22 5.15

Earthquake Seismology

Friction Law Sequential 2047x2047x256 4.00 4.00

Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35

Antenna Electromagnetic FDTD Sequential 252755x9x97336 1.73 0.98

Plasma Physics Lax-Wendroff Indirect 20,048,000 1.12 0.075

Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084

SC18

Page 13: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 14

Page 14: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

SGEMM/DGEMM Performance

14 November, 2018 SC18 15

• High scalability up to 8 threads• High vectorization ratio 99.36%, good vector length 253.8

• High efficiency• Efficiency 97.8~99.2%

0102030405060708090100

0500

10001500200025003000350040004500

1 2 3 4 5 6 7 8

Effic

ienc

y (%

)

GEM

M p

erfo

rman

ce (G

flop/

s)

Number of threads

DGEMM Gflop/s SGEMM Gflop/sDGEMM efficiency SGEMM Efficiency

Page 15: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Memory performance(Stream Triad)

• High sustained memory bandwidth of SX-Aurora TSUBASA• Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81%

• Scalability• Saturated even when the number of threads is less than half

14 November, 2018 SC18 16

0

200

400

600

800

1000

1200

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Stre

am b

andw

idth

(GB/

s)

x 4.72 x 11.7 x 1.37 x 2.23

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

Stre

am b

andw

idth

(GB/

s)

Number of threads

SX-Aurora… SX-ACE Skylake

Page 16: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Himeno (Jacobi) performance

• Higher performance… except GPU• Vector reduction becomes bottleneck due to copy among vector pipes

• Nice thread scalability• 6.9x speedup in 8 threads => 86% parallel efficiency

14 November, 2018 SC18 17

0

50

100

150

200

250

300

350

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Him

eno

perf

orm

ance

(Gflo

p/s) x 3.4 x 7.96 x 0.93 x 2.1

0

40

80

120

160

200

240

280

320

0 2 4 6 8 10 12

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of threads

SX-Aurora TSUBASA SX-ACE Skylake

x 6.9

Page 17: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Application kernel performance

• SX-Aurora TSUBASA could achieve high performance• Plasma, Turbine => Indirect access, memory latency-bound• Antenna => computation-bound to memory BW-bound• Land mine, Earthqauke, Turbulent flow =>memory or LLC BW-

bound

14 November, 2018 SC18 18

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

1

4

16

64

256

1024

4096

0.1250.25 0.5 1 2 4 8 16 32 64 128

Atta

inab

le p

erfo

rman

ce (G

flop/

s)

Arithmetic intensity (Flops/Byte)

SX-Aurora TSUBASA SX-ACE

Land mine

Earthquake Antenna

Antenna

Turbulent flow

Plasma

PlasmaTurbine

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

Page 18: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Memory bound? or LLC bound?

• Further analysis using 4 types of Bytes/Flop ratio• Memory B/F = (memory BW) / (peak performance)• LLC B/F = (LLC BW) / (peak performance)• Code B/F = (necessary data in Byte) / (# FP operations)• Actual B/F = (# block memory access) * (block size) / (# FP operations)

• Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound• Code B/F < Actual B/F * LLC BW / Memory BW => memory bound

14 November, 2018 SC18 19

B/F ratio Actual < Memory Memory > Actual

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound *

Page 19: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Application kernel performance

• Memory B/F = 1.22 TB/s / 2.15 TF= 0.57

• LLC B/F = 2.66 TB/s / 2.15 TF= 1.24

• Land mine (Code 6.22, Actual 5.79) => LLC bound• Earthqauke (Code 6.00, Actual 2.00) => LLC bound• Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound• Antenna (Code 1.73, Actual 0.98) => memory BW bound

14 November, 2018 SC18 20

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

B/F ratio Actual < Memory Actual > Memory

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound

Page 20: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Evaluation of the execution model• (Transparent/Explicit)

Offload from VE to VH• Offload from VH to VE

21

VH VE

SAP VVE

offload

0

5

10

15

20

25

30

35

40

VE execution VH offload VE offload

Exec

utio

n tim

e (s

ec)

Data transfer Jacobi 1 I/O Jacobi 2

VH VE

VAPS VH

offload

VAP

S

VAP

VAP

S

VAP

V

V

SAP

Transparent VE to VH explicit VE to VH VH to VE

SAP

Page 21: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Multi-VE performance on A300-8

• Stream VE-level scalability• Almost ideal scalability up to 8 VEs

• Himeno VE-level scalability• Good scalability up to 4VEs• Lack of vector lengths when more than 5VEs

• Problem size is too small

14 November, 2018 SC18 22

0100020003000400050006000700080009000

1 2 3 4 5 6 7 8

Stre

am b

andw

idth

(GB/

s)

Number of VEs

Aurora Ideal

0200400600800

10001200140016001800200022002400

1 2 3 4 5 6 7 8

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of VEs

Aurora Ideal

Page 22: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 23

Page 23: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Application evaluation

• Applications• Tsunami simulation code

• Mainly consists on ground fluctuation and Tsunami inundation prediction

• Used for a real-time tsunami inundation system• A costal region of Japan(1244x826km) with an 810m resolution

• Direct numerical simulation code of turbulent flows• Incompressible Navier-Stokes equations and a continuum

equation are solved by 3D FFT• MPI parallelization to 4VEs (1~32 processes)• 1283, 2563, 5123 grid points

14 November, 2018 SC18 24

Page 24: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Performance of the Tsunami code

• Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE.• Socket performance is x 2.5, x 1.9, x 3.5

14 November, 2018 SC18 25

Page 25: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Performance of the DNS code

• About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 1283, 2563, 5123 grids• LLC hit ratios of 2563 and 5123 are lower than that of 1283

• 10% is bank conflicts

14 November, 2018 SC18 26

Page 26: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Conclusions

• A vector supercomputer SX-Aurora TSUBASA• The highest bandwidth 1.22 TB/s by six HBM2 integration• New execution model to achieve high usability and high

sustained performance• Performance evaluation

• Benchmark programs→ High sustained memory performance→ effectiveness of a new execution model• Application programs→ High sustained performance

• Future work• Further optimizations for SX-Aurora TSUBASA

14 November, 2018 SC18 27

Page 27: Performance Evaluation of a Vector Supercomputer SX-Aurora ......Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35 Antenna Electro magnetic FDTD Sequential 252755x9x97336

Acknowledgements

• People• Hiroyuki Takizawa• Ryusuke Egawa• Souya Fujimoto• Yasuhisa Masaoka

• Projects• The closed beta program for

early access to Aurora• High performance computing

division (jointly-organized with NEC), Cyberscience Center, Tohoku university

14 November, 2018 SC18 28