Top Banner
Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab
38

Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Dec 27, 2015

Download

Documents

Blaise York
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Lattice QCD and GPU-s

Robert Edwards, Theory GroupChip Watson, HPC & CIO

Jie Chen & Balint Joo, HPCJefferson Lab

Page 2: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

2

Outline

Will describe how:

• Capability computing + Capacity computing + SciDAC – Deliver science & NP milestones

• Collaborative efforts involve USQCD + JLab & DOE+NSF user communities

Page 3: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

3

Hadronic & Nuclear Physics with LQCD

• Hadronic spectroscopy– Hadron resonance determinations– Exotic meson spectrum (JLab 12GeV )

• Hadronic structure– 3-D picture of hadrons from gluon & quark spin+flavor distributions– Ground & excited E&M transition form-factors (JLab 6GeV+12GeV+Mainz)– E&M polarizabilities of hadrons (Duke+CERN+Lund)

• Nuclear interactions– Nuclear processes relevant for stellar evolution– Hyperon-hyperon scattering– 3 & 4 nucleon interaction properties [Collab. w/LLNL] (JLab+LLNL)

• Beyond the Standard Model– Neutron decay constraints on BSM from Ultra Cold Neutron source (LANL)

Page 4: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

4

Bridges in Nuclear PhysicsNP Exascale

Page 5: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

5

Spectroscopy

Spectroscopy reveals fundamental aspects of hadronic physics– Essential degrees of freedom?– Gluonic excitations in mesons - exotic states of

matter?

• Status– Can extract excited hadron energies & identify spins, – Pursuing full QCD calculations with realistic quark

masses.

• New spectroscopy programs world-wide– E.g., BES III (Beijing), GSI/Panda (Darmstadt)– Crucial complement to 12 GeV program at JLab.

• Excited nucleon spectroscopy (JLab)• JLab GlueX: search for gluonic excitations.

Page 6: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

6

USQCD National Effort

US Lattice QCD effort: Jefferson Laboratory, BNL and FNAL

FNALWeak matrix

elements

BNL

RHIC Physics

JLAB

Hadronic Physics

SciDAC – R&D Vehicle

Software R&D

INCITE resources (~20 TF-yr) + USQCD cluster facilities (17 TF-yr):

Impact on DOE’s High Energy & Nuclear Physics Program

Page 7: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

7

QCD: Theory of Strong Interactions

• QCD: theory of quarks & gluons• Lattice QCD: approximate with a grid

– Systematically improvable• Gluon (Gauge) generation:

– “Configurations” via importance sampling

– Rewrite as diff. eqns. – sparse matrix solve per step – avoid “determinant” problem

• Analysis:– Compute observables via averages over

configurations

• Requires large scale computing resources

Page 8: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

8

Gauge Generation: Cost Scaling• Cost: reasonable statistics, box size and “physical” pion

mass• Extrapolate in lattice spacings: 10 ~ 100 PF-yr

PF-years

State-of-Art

Today, 10TF-yr

2011 (100TF-yr)

Page 9: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

9

Typical LQCD Workflow

Generate the configurations

Leadership level 24k cores, 10 TF-yr

t=0 t=T

Analyze• Typically mid-

range level• 256 cores

Few big jobs Few big files

Many small jobs Many big files

I/O movement

Extract Extract information from

measured observables

Page 10: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

10

Computational RequirementsGauge generation (INCITE) : Analysis (LQCD)

Current calculations• Weak matrix elements: 1 : 1• Baryon spectroscopy: 1 : 10• Nuclear structure: 1 : 4

Computational Requirements: INCITE : LQCD Computing 1 : 1 (2005) 1 : 3 (2010)

Current availability: INCITE (~20 TF) : LQCD (17 TF)

Core work: solve sparse matrix equation (iteratively)

Page 11: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

11

SciDAC Impact

• Software development– QCD friendly API’s and libraries: enables high user

productivity– Allows rapid prototyping & optimization – Significant software effort for GPU-s

• Algorithm improvements– Operators & contractions: clusters (Distillation: PRL (2009))

– Mixed-precision Dirac-solvers: INCITE+clusters+GPU-s, 2-3X

– Adaptive multi-grid solvers: clusters, ~8X (?)

• Hardware development via USQCD Facilities– Adding support for new hardware– GPU-s

Page 12: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

12

Modern GPU Characteristics• Hundreds of simple cores: high flop rate• SIMD architecture (single instruction, multiple data)• Complex (high bandwidth) memory hierarchy• Fast context switching -> hides memory access latency• Gaming cards: no memory Error-Correction (ECC) – reliability

issue• I/O bandwidth << Memory bandwidth

Commodity Processors x86 CPU NVIDIA GT200 New Fermi GPU

#cores 8 240 480

Clock speed 3.2 GHz 1.4 GHz 1.4 GHz

Main memory bandwidth 20 GB/s 160 GB/s(gaming card)

180 GB/s(gaming card)

I/O bandwidth 7 GB/s(dual QDR IB)

3 GB/s 4 GB/s

Power 80 watts 200 watts 250 watts

Page 13: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

13

Inverter Strong Scaling: V=323x256

Local volume on GPU too small (I/O bottleneck)

3 Tflops

Page 14: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

14

Science / Dollar for (Some) LQCD Capacity Apps

Page 15: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

15

530 GPUs at Jefferson Lab (July)200,000 cores (1,600 million core hours / year)600 Tflops peak single precision100 Tflops aggregate sustained in the inverter,

(mixed half / single precision)Significant increase in dedicated USQCD resources

All this for only $1M with hosts, networking, etc.

Disclaimer: • To exploit this performance, code has to be run on the

GPUs, not the CPU (Amdahl’s Law problem). • SciDAC-2 (& 3) software effort: move more inverters &

other code to gpu

A Large Capacity Resource

Page 16: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

16

New Science Reach in 2010-2011

QCD Spectrum

• Gauge generation: (next dataset)– INCITE: Crays&BG/P-s, ~ 16K – 24K cores– Double precision

• Analysis (existing dataset): two-classes– Propagators (Dirac matrix inversions)

• Few GPU level• Single + half precision• No memory error-correction

– Contractions: • Clusters: few cores• Double precision + large memory

footprint

Cost (TF-yr)

New: 10 TF-yrOld: 1 TF-yr

10 TF-yr

1 TF-yr

Page 17: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Isovector Meson Spectrum

17

Page 18: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Isovector Meson Spectrum

18

Page 19: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

19

Exotic matter?

Can we observe exotic matter? Excited string

• QED

• QCD

Page 20: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

20

Exotic matterExotics: world summary

Page 21: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

21

Exotic matter

Suggests (many) exotics within range of JLab Hall D

Previous work: photo-production rates high

Current GPU work: (strong) decays - important experimental input

Exotics: first GPU results

Page 22: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Baryon Spectrum

“Missing resonance problem”• What are collective modes?• What is the structure of the states?

– Major focus of (and motivation for) JLab Hall B– Not resolved experimentally @ 6GeV

22

Page 23: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Nucleon & Delta Spectrum

First results from GPU-s

< 2% error bars[56,2+]D-wave

[70,1-]P-wave[70,1-]

P-wave

[56,2+]D-wave

Discern structure: wave-function overlaps

Change at light quark mass? Decays!

Suggests spectrum at least as dense as quark model

23

Page 24: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Towards resonance determinations

• Augment with multi-particle operators– Needs “annihilation diagrams” – provided by

Distillation Ideally suited for (GPU-s)

• Resonance determination– Scattering in a finite box – discrete energy levels– Lüscher finite volume techniques– Phase shifts ! Width

• First results (partially from GPU-s)– Seems practical

arxiv:0905.2160

24

Page 25: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Phase Shifts: demonstration

25

Page 26: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

26

Extending science reach

• USQCD:– Next calculations: physical quark masses: 100 TF – 1 PF-yr– New INCITE+Early Science application (ANL+ORNL+NERSC)– NSF Blue Waters Petascale (PRAC)

• Need SciDAC-3– Significant software effort for next generation GPU-s &

heterogeneous environments– Participate in emerging ASCR Exascale initiatives

• INCITE + LQCD synergy:– ARRA GPU system well matched to current leadership

facilities

Page 27: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

27

Path to Exascale

• Enabled by some hybrid GPU system?– Cray + Nvidia ??

• NSF GaTech: Tier 2 (experimental facility)– Phase 1: HP cluster+GPU (Nvidia Tesla)– Phase 2: hybrid GPU+<partner>

• ASCR Exascale facility– Case studies for Science, Software+Runtime,

Hardware

• Exascale capacity resource will be needed

Page 28: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

28

Summary

Capability + Capacity + SciDAC – Deliver science & HEP+NP milestones

Petascale (leadership) + Petascale (capacity)+SciDAC-3Spectrum + decays

First contact with experimental resolution

Exascale (leadership) + Exascale (capacity)+SciDAC-3Full resolution

Spectrum + transitionsNuclear structure

Collaborative efforts: USQCD + JLab user communities

Page 29: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

29

Backup slides

• The end

Page 30: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

JLab ARRA: Phase 1

30

Page 31: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

JLab ARRA: Phase 2

31

Page 32: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Hardware: ARRA GPU Cluster• Host:• 2.4 GHz Nehalem• 48 GB memory / node• 65 nodes, 200 GPUs

• Original configuration:• 40 nodes w/ 4 GTX-285 GPUs• 16 nodes w/ 2 GTX-285 + QDR IB• 2 nodes w/ 4 Tesla C1050 or

S1070

• One quad GPU node = one rack of conventional nodes

Page 33: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

33

SciDAC Software Stack

QCD friendly API’s/libs

• http://www.usqcd.org

Architectural level(Data parallel)

High-level (lapack-like)

GPU-s

Application level

Page 34: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

34

Dirac Inverter with Parallel GPU-s

Divide problem among nodes:

• Trade-offs – On-node vs off-

node bandwidths– Locality vs memory

bandwidth

• Efficient at large problem size per node

Page 35: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

35

Amdahl’s Law (Problem)

Also disappointing: the GPU is idle 80% of the time!

Conclusion: need to move more code to the GPU, and/or need task level parallelism (overlap CPU and GPU)

Jefferson Lab has split this workload into two jobs (red and black), for 2 machines (conventional, GPU)

• 2x clock time improvement

• A major challenge in exploiting GPUs is Amdahl’s Law:• If 60% of the code is GPU accelerated by 6x, • the net gain is only 2x.

Page 36: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

36

Considerable Software R&D is Needed

Hardware

Device Drivers

Linux or mKernel

RTS MPI (?)

User ApplicationSpace

Up until now: O/S & RTS form a 'thin layer' between Application & H/W

Hardware

Device Drivers

Power RAS

Memory

RTSscheduling, load balancing,

work stealing, program modelcoexistence

MPI (?)

Programming Modelhybrid MPI + node parallelism

PGAS? Chapel?

UserApplication

Space

Exascale X-Stack (?)

Libraries(BLAS, PetSc,Trilinos...)

Need SciDAC-3 to move to Exascale

Chroma CPS MILC

MDWF

Dirac Operators

QOP

QDP++ QIO

QMP Message Passing

QDP/C

QLA QMT Threads

• Application Layer

Level 1: Basics

Level 2: Data Parallel

Level 3: Optimization

QA0, GCC-BGL, Workflow, Viz.Tools

• + tools from collaborations with other SciDAC • projects e.g. PERI

Page 37: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

37

Need SciDAC-3• Application porting to new programming

models/languages– Node abstraction – portability (like QDP++

now?)– Interactions with more restrictive (liberating?)

exascale stack?• Performance libraries for Exascale hardware

– like level 3 currently– will need productivity tools

• Domain Specific Languages (QDP++ is almost this)

• Code Generators (More QA0, BAGEL etc)• Performance monitoring • Debugging, Simulation

• Algorithms for greater concurrency/reduced synchronization

Page 38: Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

NP Exascale Report