Leading Computational Methods on Scalar and Vector HEC Platforms

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

BIPSBIPS

Leading Computational Methods on Scalar and Vector

HEC Platforms

Leonid OlikerJonathan Carter, Michael Wehner, Andrew Canning

Lawrence Berkeley National Laboratory

Stephane EthierPrinceton Plasma Physics Laboratory

Art Mirin, Govindasamy BalaLawrence Livermore National Laboratory

David ParksNEC Solutions America

Patrick WorleyOak Ridge National Laboratory

Shigemune Kitawaki, Yoshinori TsudaEarth Simulator Center

BIPSBIPS Overview

Stagnating application performance is well-know problem in scientific computing

By end of decade numerous mission critical applications expected to have 100X computational demands of current levels

Many HEC platforms are poorly balanced for demands of leading applications Memory-CPU gap, deep memory hierarchies,

poor network-processor integration, low-degree network topology

Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining,

Clock frequency limited by power concerns In order to continuously increase computing power and reap its

benefits: major strides necessary in architecture development, software infrastructure, and application development

BIPSBIPS Application Evaluation

Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency

However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches

Our evaluation work emphasizes full applications, with real input data, at the appropriate scale

Requires coordination of computer scientists and application experts from highly diverse backgrounds

Our initial efforts have focused on comparing performance between high-end vector and scalar platforms

Effective code vectorization is an integral part of the process

First US team to conduct Earth Simulator performance study

BIPSBIPS Benefits of Evaluation

Full scale application evaluation lead to more efficient use of the community resources For both current installation and future designs

Head-to-head comparisons on full applications: Help identify the suitability of a particular architecture for a

given application class Give application scientists information about how well

various numerical methods perform across systems Reveal performance-limiting system bottlenecks that can

aid designers of the next generation systems.• Science Driven Architecture

In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale to achieve high performance.

BIPSBIPSArchitectural Comparison

Node Type Where NetworkCPU/Node

ClockMHz

PeakGFlop

Stream BW

GB/s/P

Peak byte/flop

MPIBW

GB/s/P

MPI Latency

sec

NetworkTopology

Power3 NERSC Colony 16 375 1.5 0.4 0.26 0.13 16.3 Fat-tree

Itanium2 LLNL Quadrics 4 1400 5.6 1.1 0.19 0.25 3.0 Fat-tree

Opteron NERSCInfiniBan

d2 2200 4.4 2.3 0.51 0.59 6.0 Fat-tree

X1 ORNL Custom 4 800 12.8 14.9 1.16 6.3 7.1 4D-Hypercube

X1E ORNL Custom 4 1130 18.0 9.7 0.54 2.9 5.0 4D-Hypercube

ES ESC IN 8 1000 8.0 26.3 3.29 1.5 5.6 Crossbar

SX-8 HLRS INX 8 2000 16.0 41.0 2.56 2.0 5.0 Crossbar

Custom vector architectures: High mem vs peak, superior interconnects ES shows best balance between memory and peak performance Data caches of superscalar systems and X1(E) potential reduce mem costs

X1E: 2 MSP’s per MCM - increases contention for memory and interconnect

A key ‘balance point’ for vector systems is the scalar:vector ratio

Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency

Cost is a critical metric - however we are unable to provide such data

Proprietary, pricing varies based on customer and time frame

Poorly balanced systems cannot solve important problems/resolutions

BIPSBIPS Application Overview

NAME Discipline Problem/Method Structure

LBMHD Plasma Physics Magneto-Hydrodyamics, Lattice-Boltzmann

Lattice/Grid

GTC Magnetic Fusion Particle in Cell,Vlasov-Poisson

Particle/Grid

PARATEC Material Science Density Functional Theory, Kohn Shan, FFT

Fourier/Grid

FVCAM Climate Modeling AGCM,Finite Volume, Navier-Stokes, FFT

Grid

Examining candidate ultra-scale applications with abundant data parallelism Codes designed for superscalar architectures, required vectorization effort

ES use requires minimum vectorization and parallelization hurdles

BIPSBIPSClimate: FVCAM

Atmospheric component of CCSM AGCM: consists of physics (PS) and dynamical core (DC) DC approximates Navier-Stokes equations to describe

dynamics of atmosphere PS: calculates source terms to equations of motion:

Turbulence, radiative transfer, clouds, etc Default uses spectral transform - maps onto sphere

Allows 1D decomposition in latitude Finite volume (FV) grid is rectangular (long, lat,

level) Allows 2D decomp (lat, level) in dynamics phase Requires remapping between Lagrangian surfaces

and Eulerian reference frame

Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley

Simulated Class IV hurricane at 0.5. This storm was produced solely through the chaos of the

atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5.

Hybrid (MPI/OpenMP) programming MPI tasks limited by number of latitude lines

• minimum 3 per domain Increase potential parallelism Improves surface to volume ratio Not available on Thunder Did not increase performance on X1/X1E

BIPSBIPSFVCAM Decomposition and

Vectorization

Processor communication topology and volume for 1D Spectral and 2D FVCAM Generated by IPM profiling tool - used to understand

interconnect requirements 1D approach straightforward nearest neighbor communication 2D communication bulk is nearest neighbor - however:

Complex pattern due to vertical decomp and transposition during remapping Total volume in 2D remap is reduced due to improved surface/volume ratio

Vectorization Move latitude calculation to inner loops to maximize parallelism Reduce number of branches, performing logical tests in advance (indirect indexing) Vectorize across (not within) FFT’s for Polar filters Finer domain decomp fixed size problem, limit performance of vectorized FFTs

BIPSBIPS FVCAM3.1: Performance

FVCAM 2D decomp allows effective use of >2X as many procs Increasing vertical discretizations (1,4,7) allows higher

concurrencies First results showing high resolution vector performance 361x576x26 (0.5 x

0.625) X1E achieves speedup of over 4500 on P=672 - highest ever achieved Power3 limited to speedup of 600 regardless of concurrency Factor of at least 1000x necessary for simulation to be tractable

Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)

ES achieves highest sustained performance (over 10% at P=256) Vectors suffer from short vector length of fixed problem size, esp FFTs Superscalars generally achieve lower efficiencies/performance than vectors

Finer resolutions requires increased number of more powerful processors

Simulated Speedup

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 200 400 600 800

Processors

Simulated Years / Wallclock Years

Power3Itanium2ESX1X1E

Percent of Theoretical Peak

3%

5%

7%

9%

11%

13%

15%

17%

P=32 (1D) P=256 (2D:4) P=336 (2D:7) P=672 (2D:7)

Configuration

% of Theoretical Peak

Power3

Itanium2

ES

X1

X1E

BIPSBIPSMagnetic Fusion: GTC

Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)

Goal magnetic fusion is burning plasma power plant producing cleaner energy

GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)

PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid

Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)

Vectorization inhibited since multiple particles may attempt to concurrently update same grid point

Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies

Developed at PPPL, vectorized/optimized by Stephane Ethier

BIPSBIPS GTC Particle Decomposition

GTC originally optimized for superscalar SMPs using MPI/OpenMP OpenMP achieved limited perform & severely increase memory for vectors

Vectorization and thread-level parallelism compete w/ each other Previous vector experiments limited to only 64-way MPI parallelism

64 is optimal domains for 1D toroidal (independent of # particles) New GTC version introduces a third level of parallelism:

Algorithm splits particles between several processors (within 1D domain) Allows increase concurrency and number of studied particles

Larger particle simulations allow increase resolution studies Particles not subject to Courant condition (same timestep) Allows multiple species calculations

BIPSBIPS GTC: Performance

New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES) Breakthrough of Tflop barrier on ES for important SciDAC code

7.2 Tflop/s on 4096 processors SX8 highest raw performance (ever) but lower efficiency than ES

Opens possibility of new set of high-phase space-resolution simulations Scalar architectures suffer from low computational intensity, irregular data access, and

register spilling Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1

Opteron: on-chip memory controller and caching of FP L1 data X1 suffers from overhead of scalar code portions Original (unmodified) X1 version performed 12% *slower* on X1E

Recent additional optimizations increased performance by 50%! Chosen as HPCS benchmark

PPart/Cell

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

X1Phoenix

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

128 200 0.14 9% 0.39 7% 0.59 13% 1.2 9% 1.7 10% 1.9 23% 2.3 14%

256 400 0.14 9% 0.39 7% 0.57 13% 1.2 9% 1.7 10% 1.8 22% 2.3 15%

512 800 0.14 9% 0.38 7% 0.51 12% 1.7 9% 1.8 22%

1024 1600 0.14 9% 0.37 7% 1.8 22%

BIPSBIPSPlasma Physics: LBMHD

LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)

Performs 2D/3D simulation of high temp plasma Evolves from initial conditions and decaying to form

current sheets Spatial grid coupled to octagonal streaming lattice Block distributed over processor grid Main computational components:

Collision, Stream, Interpolation Vectorization: loop interchange, unrolling

Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary

Evolution of vorticity into turbulent structures

BIPSBIPS LBMHD-3D: Performance

Grid Size

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

X1Phoenix

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

2563 16 0.14 9% 0.26 5% 0.70 16% 5.2 41% 6.6 37% 5.5 69% 7.9 49%

5123 64 0.15 9% 0.35 6% 0.68 15% 5.2 41% 5.8 32% 5.3 66% 8.1 51%

10243 256 0.14 9% 0.32 6% 0.60 14% 5.2 41% 6.0 33% 5.5 68% 9.6 60%

20483 512 0.14 9% 0.35 6% 0.59 13% 5.8 32% 5.2 65%

Not unusual to see vector achieve > 40% peak while superscalar architectures achieve < 10% There exists plenty of computation, however large working set causes register spilling scalars Opteron shows impressive superscalar performance, 2X speed vs. Itanium2

Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache Large vector register sets hide latency ES sustains 68% of peak up to 4800 processors: 26TFlops - the highest performance ever attained

for this code by far! SX8 shows highest raw performance, but lags behind ES in terms of efficiency

SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM X1E achieved same performance as X1 using original code version

By turning off caching resulted in about 10% improvement over X1

BIPSBIPSMaterial Science: PARATEC

PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set

Density Functional Theory to calc structure & electronic properties of new materials

DFT calc are one of the largest consumers of supercomputer cycles in the world

33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space

Uses specialized 3D FFT to transform wavefunctionConduction band minimum electron state forCdSe quantum dot

Developed by Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

BIPSBIPS PARATEC: Performance

All architectures generally perform well due to computational intensity of code (BLAS3, FFT)

ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs Main ES advantage for this code is fast interconnect Allows never before possible, high resolution simulations Qdot: Largest cell-size atomistic experiment ever run using PARATEC

SX8 achieves highest per-processor performance X1/X1E shows lowest % of peak

Non-vectorizable code much more expensive on X1/X1E (32:1) Lower bisection bandwidth to computational ratio (4D-hypercube) X1 Performance is comparable to Itanium2

Itanium2 outperforms Opteron (unlike LBMHD/GTC) because Paratec less sensitive to memory access issues (BLAS3) Opteron lacks FMA unit Quadrics shows better scaling of all-to-all at large concurrencies

Problem P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

X1Phoenix

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GF/s/P %pk GFs/P %pk GFs/P %pk

488 AtomCdSe

QuantumDot

128 0.93 62% 2.8 51% 3.2 25% 3.8 21% 5.1 64% 7.5 49%

256 0.85 67% 2.6 47% 2.0 45% 3.0 24% 3.3 18% 5.0 62% 6.8 43%

512 0.73 49% 2.4 44% 1.0 22% 2.2 12% 4.4 55%

1024 0.60 40% 1.8 32% 3.6 46%

BIPSBIPS Performance Overview

Tremendous potential of vector systems - unprecedented aggregate performance: >4500x simulation speedup of FVCAM on 672 processors of X1E New GTC decomposition algorithm achieves 7.2 TF/s on 4096 ES processors LBMHD-3D achieves 26 TF/s using 4800 ES procs (68% of peak) - GB finalist PARATEC achieves 5.5 TF/s on 2048 processors of ES

ES highest efficiency, SX8 achieves highest raw performance (X1E for FVCAM) X1E faster absolute performance X1, but lower sustained performance

SSP vs MSP experiments: tradeoffs between comp granularity and scalar work Opteron vs Itanium2

Opteron faster GTC, LBMHD: low CI, register spilling, irregular memory access Itanium2 faster PARATEC: High CI, FMA support, all-to-all on Quadrics

Future: Sparse, Unstructured, AMR codes on latest Power5, BG/*, XT3.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

FVCAM GTC LBMHD3D PARATEC

Application

Speedup vs. ES

Power3

Itanium2

Opteron

ES

SX-8

X1

X1E

0%

10%

20%

30%

40%

50%

60%

70%

FVCAM GTC LBMHD3D PARATEC

Application

% of Theoretical Peak

Power3

Itanium2

Opteron

ES

SX-8

X1

X1E

Leading Computational Methods on Scalar and Vector HEC Platforms

Documents

high performance

application experts

vector systems

performance modeling

systemsreveal performance

balanced systems

superscalar architectures

operating systems