CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

CS 267 Applications of Parallel Computers

Supercomputing:The Past and Future

Kathy Yelickwww.cs.berkeley.edu/~yelick/cs267_s07

Outline

• Historical perspective (1985 to 2005) from Horst Simon

• Recent past: what’s new in 2007

• Major challenges and opportunities for the future

Slide source: Horst Simon

Signpost System 1985

Cray-2

• 244 MHz (4.1 nsec)

• 4 processors

• 1.95 Gflop/s peak

• 2 GB memory (256 MW)

• 1.2 Gflop/s LINPACK R_max

• 1.6 m2 floor space

• 0.2 MW power


Signpost System in 2005

IBM BG/L @ LLNL

• 700 MHz (x 2.86)

• 65,536 nodes (x 16,384)

• 180 (360) Tflop/s peak (x 92,307)

• 32 TB memory (x 16,000)

• 135 Tflop/s LINPACK (x 110,000)

• 250 m2 floor space (x 156)

• 1.8 MW power (x 9)


1985 versus 2005

• custom built vector mainframe platforms

• 30 Mflops sustained is good performance

• vector Fortran• proprietary operating system• remote batch only• no visualization• no tools, hand tuning only• dumb terminals• remote access via 9600 baud• single software developer,

develops and codes everything• serial, vectorized algorithms

• commodity massively parallel platforms

• 1 Tflops sustained is good performance

• Fortan/C with MPI, object orientation• Unix, Linux• interactive use• visualization• parallel debugger, development tools• high performance desktop• remote access via 10 Gb/s; grid tools• large group developed software,

code share and reuse• parallel algorithms


The Top 10 Major Accomplishments in Supercomputing in the Past 20 Years

•Horst’s Simon’s list from 2005• Selected by “impact” and “change in perspective”

10) The TOP500 list9) NAS Parallel Benchmarks8) The “grid” 7) Hierarchical algorithms: multigrid and fast multipole6) HPCC initiative and Grand Challenge application

5) Attack of the killer micros


- Listing of the 500 most powerful Computers in the World

- Yardstick: Rmax from Linpack

Ax=b, dense problem

- Updated twice a year:ISC‘xy in Germany, June xySC‘xy in USA, November xy

- All data available from www.top500.org - Good and bad effects of this list/competition

Size

Rate

TPP performance

#10) TOP500

http://www.top500.org/

TOP500 list - Data shownTOP500 list - Data shown

• Manufacturer Manufacturer or vendor• Computer Type indicated by manufacturer or vendor• Installation Site Customer• Location Location and country• Year Year of installation/last major update• Customer Segment Academic,Research,Industry,Vendor,Class.• # Processors Number of processors

• Rmax Maxmimal LINPACK performance achieved

• Rpeak Theoretical peak performance

• Nmax Problemsize for achieving Rmax

• N1/2 Problemsize for achieving half of Rmax

• Nworld Position within the TOP500 ranking

100

1000

10000

100000

1E+06

1E+07

1E+08

1E+09

1E+10

1E+11

1E+12

1993 1996 1999 2002 2005 2008 2011 2014

SUM

#1

#500

Petaflop with ~1M Cores By 2008

1Eflop/s

100 Pflop/s

10 Pflop/s

1 Pflop/s

100 Tflop/s

10 Tflops/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

10 MFlop/s

1 PFlop system in 2008

Data from top500.org

6-8 years

Common by 2015?


Petaflop with ~1M Cores in your PC by 2025?

#4 Beowulf Clusters

• Thomas Sterling et al. established vision of low cost, high end computing

• Demonstrated effectiveness of PC clusters for some (not all) classes of applications

• Provided software and conveyed findings to broad community (great PR) through tutorials and book (1999)

• Made parallel computing accessible to large community worldwide; broadened and democratized HPC; increased demand for HPC

• However, effectively stopped HPC architecture innovation for at least a decade; narrower market for commodity systems


#3 Scientific Visualization

• NSF Report, “Visualization in Scientific Computing” established the field in 1987 (edited by B.H. McCormick,

T.A. DeFanti, and M.D. Brown)

• Change in point of view: transformed computer graphics from a technology driven subfield of computer science into a medium for communication

• Added artistic element

• The role of visualization is “to reveal concepts that are otherwise invisible” (Krystof Lenk)


Before Scientific Visualization (1985)

Computer graphics typical of the time:– 2 dimensional– line drawings– black and white– “vectors” used to

display vector field

Images from a CFD report at Boeing (1985).


After scientific visualization (1992)

The impact of scientific visualization seven years later: – 3 dimensional– use of “ribbons” and “tracers” to visualize flow field– color used to characterize updraft and downdraft

Images from “Supercomputing and the Transformation of Science” byKauffmanand Smarr, 1992; visualization by NCSA; simulation by BobWilhelmson, NCSA


#2 Message Passing Interface (MPI)

MPI


Parallel Programming 1988

• At the 1988 “Salishan” conference there was a bake-off of parallel programming languages trying to solve five scientific problems

• The “Salishan Problems” (ed. John Feo, published 1992) investigated four programming languages

– Sisal, Haskel, Unity, LGDF

• Significant research activity at the time

• The early work on parallel languages is all but forgotten today



• The availability of real parallel machines moved the discussion from the domain of theoretical CS to the pragmatical application area

• In this presentation (ca. 1990) Jack Dongarra lists six approaches to parallel processing

• Note that message passing libraries are a sub-item on 2)



#1 Scaled Speed-Up

The argument against massive parallelism (ca. 1988)


Amdahl’s Law: speed = base_speed / ( (1-f) + f/nprocs ) infinitely parallel Cray YMP base_speed = .1 2.4 nprocs = infinity 8

Then speed(Infinitely Parallel) > speed(Cray) only if f > .994

Challenges for the Future

• Petascale computing

• Multicore and the memory wall

• Performance understanding at scale

• Topology-sensitive interconnects

• Programming models for the masses

Application Status in 2005

• A few Teraflop/s sustained performance

• Scaled to 512 - 1024 processors

Parallel job size at NERSC

How to Waste Machine $

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Roun

dtrip

Lat

ency

(use

c)

MPI ping-pong

GASNet put+sync

2) Use a programming model in which you can’t utilize bandwidth or “low” latency

Flood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Perc

ent H

W p

eak

MPIGASNet

Integrated Performance Monitoring (IPM)

• brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application

• maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage

• open source, relies on portable software technologies and is scalable to thousands of tasks

• developed by David Skinner at NERSC (see http://www.nersc.gov/projects/ipm/ )

Scaling Portability: Profoundly Interesting

A high level description of the performance of cosmology code MADCAP on four well known architectures.

Source: David Skinner, NERSC

16 Way for 4 seconds

(About 20 timestamps per second per task) *( 1…4 contextual variables)

64 way for 12 seconds

Applications on Petascale Systems will need to deal with

(Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each)

Three major issues:

• Scaling to 100,000 processors and multi-core processors

• Topology sensitive interconnection network

• Memory Wall

Even today’s machines are interconnect topology sensitive

Four (16 processor) IBM Power 3 nodes with Colony switch

Application Topology

1024 way MILC

1024 way MADCAP

336 way FVCAM

If the interconnect is topology sensitive,

mapping will become an issue (again)

“Characterizing Ultra-Scale Applications Communincations Requirements”, by John Shalf et al., submitted to SC05

Interconnect Topology BG/L

Applications on Petascale Systems will need to deal with

(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)

Three major issues:



• Memory Wall

The Memory Wall

Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

Characterizing Memory Access

HPCS Challenge PointsHPCchallenge Benchmarks

HighLowLow

PTRANS

MissionPartner

Applications

Tem

pora

lLoc

alit

y

Spatial Locality

RandomAccess STREAM

HPLHighHigh

FFT

Memory Access Patterns/Locality

Source: David Koester, MITRE

Apex-MAP characterizes architectures through a synthetic benchmark

Temporal Locality

1/Re-use

0 = High

1=Low

1/L 1=Low0 = High

"HPL"

"Global Streams" "Short indirect"

"Small working set"

Spatial Locality

Apex-MAP

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.001

0.0100.100

1.0000.1

1.0

10.0

100.0

1000.0

Cycles

L

a

Seaborg Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.001

0.0100.100

1.0000.10

1.00

10.00

100.00

1000.00

Cycles

L

a

Power4 Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.00

0.010.10

1.000.10

1.00

10.00

100.00

Cycles

L

a

X1 Sequential 1.00-2.00

0.00-1.00

-1.00-0.00

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.00

0.010.10

1.000.10

1.00

10.00

100.00

1000.00

Cycles

L

a

SX6 Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Multicore: Is MPI Really that Bad?

FLOP Rates per Core (single vs. dual core)

0

50

100

150

200

250

Code Name

Sustain

ed G

FLO

P/s p

er c

ore

Singlecore

dualcore

Moving from single to dual core nearly doubles performance Worst case is MILC, which is 40% below this doubling

Experiments by the NERSC SDSA group (Shalf, Carter, Wasserman, et al)

Single Core vs. Dual-core AMD Opteron.

Data collected on jaguar (ORNL XT3 system)

Small pages used except for MADCAP

How to Waste Machine $

1) Build a memory system in which you can’t utilize bandwidth that is there

% of peak of arithmetic performance

0%

5%

10%

15%

20%

25%

30%

35%

40%

Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)

one core

full socket

full system

% of peak memory bandwidth utlized

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)

one core

full socket

full system

Challenge 2010 - 2018: Developing a New Ecosystem for HPC

From the NRC Report on “The Future of Supercomputing”:

• Platforms, software, institutions, applications, and people who solve supercomputing applications can be thought of collectively as an ecosystem

• Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs.

Pond ecosystem image from

http://www.tpwd.state.tx.us/expltx/ef

t/txwild/pond.htm

Exaflop Programming?

• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,

asynchronous• Rethink algorithms

– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing

latency requirements• Design programming model to express this

parallelism– Develop technology to automate as much as possible

(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities

– Analyze at various levels of detail (eliminating when they are clearly infeasible)

– Early prototypes (expect 90% failures) to validate

Technical Challenges in Programming Models

• Open problems in language runtimes– Virtualization: away from SPMD model for load

balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler

• What we do know how to do:– Build systems with dynamic load balancing (Cilk)

that do not respect locality– Build systems with rigid locality control (MPI, UPC,

etc.) that run at the speed of the slowest component

– Put the programmer in control of resources: message buffers, dynamic load balancing

Challenge 2015 - 2025: Winning the Endgame of Moore’s Law

(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)

Three major issues:



• Memory Wall

Summary

• Applications will face (at least) three challenges

– Scaling to 100,000s of processors– Interconnect topology– Memory access

• Three sets of tools (applications benchmarks, performance monitoring, quantitative architecture characterization) have been shown to provide critical insight into applications performance

1985

Vanishing Electrons (2016)

1990 1995 2000 2010 2015 202010-1

100

101

102

103

104

Electrons per device

2005Year

(Transistors per chip)

(16M)

(4M)

(256M)(1G)

(4G)

(16G)

(64M)

Source: Joel Birnbaum, HP, Lecture at APS Centennial, Atlanta, 1999

ITRS Device Review 2016

TechnologySpeed

(min-max)Dimension(min-max)

Energy per

gate-opComparison

CMOS 30 ps-1 s 8 nm-5 m 4 aJ

RSFQ 1 ps-50 ps 300 nm- 1m 2 aJ Larger

Molecular 10 ns-1 ms 1 nm- 5 nm 10 zJ Slower

Plastic 100 s-1 ms 100 m-1 mm 4 aJ Larger+Slower

Optical 100 as-1 ps 200 nm-2 m 1 pJ Larger+Hotter

NEMS 100 ns-1 ms 10-100 nm 1 zJ Slower+Larger

Biological 100 fs-100 s 6-50 m .3 yJ Slower+Larger

Quantum 100 as-1 fs 10-100 nm 1 zJ Larger

Data from ITRS ERD Section, quoted from Erik DeBenedictis, Sandia Lab.

New Programming Models

• Science is currently limited by the difficulty of programming– Codes get written somehow, but barrier to

algorithm experimentation is too high– Multicore shift will make this worse: The

application scientists have other things to do that worry about the next hw revolution

• Want an integrated programming model:– Express many levels and kinds of parallelism for

multiple machine generations– Painful reprogramming should only be done once

Programming Models: What We’ll Get if DOE Does Nothing

• An exascale machines of 1K-core chips is not your thesis advisor’s cluster–Memory per core on chip will not be as large as memory per CPU

on-node today–Clusters have led us down the road of memory-hungry OS,

Runtime, Libraries (MPI) • Absurd but scalable programming models

–MPI with only “expected” messages: coordination nightmare–PGAS on shared memory without synchronization

• Need a new mechanism to:–Tie synch to data transfer (w/out CPU help)–Only pay for it when you need it

• Challenges–Single programming model within and between chips–Single programming model across markets

Build a machine that is so complicated we can’t understand its performance

Sun Ultra 2i/333MHz

How to Waste Machine $s

Num

ber

of r

ows

per

tile

(r)

Number of columns per tile (c)

Technical Challenges in Programming Models

• Open problems in language runtimes– Virtualization: away from SPMD model for load

balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler

• What we do know how to do:– Build systems with dynamic load balancing (Cilk)

that do not respect locality– Build systems with rigid locality control (MPI, UPC,

etc.) that run at the speed of the slowest component

– Put the programmer in control of resources: message buffers, dynamic load balancing

Research Approach for Exaflop Program

• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,

asynchronous• Rethink algorithms

– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing

latency requirements• Design programming model to express this

parallelism– Develop technology to automate as much as possible

(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities

– Analyze at various levels of detail (eliminating when they are clearly infeasible)

– Early prototypes (expect 90% failures) to validate

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Documents

tflops linpack x

tflops peak x

nodes x

horst simon1985

horst simonsignpost

horst simonthe

mw power x

horst simonrecent past