Top Banner
CS 267 Applications of Parallel Computers Supercomputing: The Past and Future Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_s07
55

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Jan 11, 2016

Download

Documents

CS 267 Applications of Parallel Computers Supercomputing: The Past and Future. Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_s07. Outline. Historical perspective (1985 to 2005) from Horst Simon Recent past: what’s new in 2007 Major challenges and opportunities for the future. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

CS 267 Applications of Parallel Computers

Supercomputing:The Past and Future

Kathy Yelickwww.cs.berkeley.edu/~yelick/cs267_s07

Page 2: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Outline

• Historical perspective (1985 to 2005) from Horst Simon

• Recent past: what’s new in 2007

• Major challenges and opportunities for the future

Slide source: Horst Simon

Page 3: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Signpost System 1985

Cray-2

• 244 MHz (4.1 nsec)

• 4 processors

• 1.95 Gflop/s peak

• 2 GB memory (256 MW)

• 1.2 Gflop/s LINPACK R_max

• 1.6 m2 floor space

• 0.2 MW power

Slide source: Horst Simon

Page 4: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Signpost System in 2005

IBM BG/L @ LLNL

• 700 MHz (x 2.86)

• 65,536 nodes (x 16,384)

• 180 (360) Tflop/s peak (x 92,307)

• 32 TB memory (x 16,000)

• 135 Tflop/s LINPACK (x 110,000)

• 250 m2 floor space (x 156)

• 1.8 MW power (x 9)

Slide source: Horst Simon

Page 5: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

1985 versus 2005

• custom built vector mainframe platforms

• 30 Mflops sustained is good performance

• vector Fortran• proprietary operating system• remote batch only• no visualization• no tools, hand tuning only• dumb terminals• remote access via 9600 baud• single software developer,

develops and codes everything• serial, vectorized algorithms

• commodity massively parallel platforms

• 1 Tflops sustained is good performance

• Fortan/C with MPI, object orientation• Unix, Linux• interactive use• visualization• parallel debugger, development tools• high performance desktop• remote access via 10 Gb/s; grid tools• large group developed software,

code share and reuse• parallel algorithms

Slide source: Horst Simon

Page 6: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

The Top 10 Major Accomplishments in Supercomputing in the Past 20 Years

•Horst’s Simon’s list from 2005• Selected by “impact” and “change in perspective”

10) The TOP500 list9) NAS Parallel Benchmarks8) The “grid” 7) Hierarchical algorithms: multigrid and fast multipole6) HPCC initiative and Grand Challenge application

5) Attack of the killer micros

Slide source: Horst Simon

Page 7: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

- Listing of the 500 most powerful Computers in the World

- Yardstick: Rmax from Linpack

Ax=b, dense problem

- Updated twice a year:ISC‘xy in Germany, June xySC‘xy in USA, November xy

- All data available from www.top500.org - Good and bad effects of this list/competition

Size

Rate

TPP performance

#10) TOP500

Page 8: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

TOP500 list - Data shownTOP500 list - Data shown

• Manufacturer Manufacturer or vendor• Computer Type indicated by manufacturer or vendor• Installation Site Customer• Location Location and country• Year Year of installation/last major update• Customer Segment Academic,Research,Industry,Vendor,Class.• # Processors Number of processors

• Rmax Maxmimal LINPACK performance achieved

• Rpeak Theoretical peak performance

• Nmax Problemsize for achieving Rmax

• N1/2 Problemsize for achieving half of Rmax

• Nworld Position within the TOP500 ranking

Page 9: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future
Page 10: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future
Page 11: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

100

1000

10000

100000

1E+06

1E+07

1E+08

1E+09

1E+10

1E+11

1E+12

1993 1996 1999 2002 2005 2008 2011 2014

SUM

#1

#500

Petaflop with ~1M Cores By 2008

1Eflop/s

100 Pflop/s

10 Pflop/s

1 Pflop/s

100 Tflop/s

10 Tflops/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

10 MFlop/s

1 PFlop system in 2008

Data from top500.org

6-8 years

Common by 2015?

Slide source: Horst Simon

Page 12: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Petaflop with ~1M Cores in your PC by 2025?

Page 13: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

#4 Beowulf Clusters

• Thomas Sterling et al. established vision of low cost, high end computing

• Demonstrated effectiveness of PC clusters for some (not all) classes of applications

• Provided software and conveyed findings to broad community (great PR) through tutorials and book (1999)

• Made parallel computing accessible to large community worldwide; broadened and democratized HPC; increased demand for HPC

• However, effectively stopped HPC architecture innovation for at least a decade; narrower market for commodity systems

Slide source: Horst Simon

Page 14: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

#3 Scientific Visualization

• NSF Report, “Visualization in Scientific Computing” established the field in 1987 (edited by B.H. McCormick,

T.A. DeFanti, and M.D. Brown)

• Change in point of view: transformed computer graphics from a technology driven subfield of computer science into a medium for communication

• Added artistic element

• The role of visualization is “to reveal concepts that are otherwise invisible” (Krystof Lenk)

Slide source: Horst Simon

Page 15: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Before Scientific Visualization (1985)

Computer graphics typical of the time:– 2 dimensional– line drawings– black and white– “vectors” used to

display vector field

Images from a CFD report at Boeing (1985).

Slide source: Horst Simon

Page 16: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

After scientific visualization (1992)

The impact of scientific visualization seven years later: – 3 dimensional– use of “ribbons” and “tracers” to visualize flow field– color used to characterize updraft and downdraft

Images from “Supercomputing and the Transformation of Science” byKauffmanand Smarr, 1992; visualization by NCSA; simulation by BobWilhelmson, NCSA

Slide source: Horst Simon

Page 17: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

#2 Message Passing Interface (MPI)

MPI

Slide source: Horst Simon

Page 18: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Parallel Programming 1988

• At the 1988 “Salishan” conference there was a bake-off of parallel programming languages trying to solve five scientific problems

• The “Salishan Problems” (ed. John Feo, published 1992) investigated four programming languages

– Sisal, Haskel, Unity, LGDF

• Significant research activity at the time

• The early work on parallel languages is all but forgotten today

Slide source: Horst Simon

Page 19: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Parallel Programming 1990

• The availability of real parallel machines moved the discussion from the domain of theoretical CS to the pragmatical application area

• In this presentation (ca. 1990) Jack Dongarra lists six approaches to parallel processing

• Note that message passing libraries are a sub-item on 2)

Slide source: Horst Simon

Page 20: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Parallel Programming 1994

Page 21: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

#1 Scaled Speed-Up

Page 22: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

The argument against massive parallelism (ca. 1988)

Slide source: Horst Simon

Amdahl’s Law: speed = base_speed / ( (1-f) + f/nprocs ) infinitely parallel Cray YMP base_speed = .1 2.4 nprocs = infinity 8

Then speed(Infinitely Parallel) > speed(Cray) only if f > .994

Page 23: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Challenges for the Future

• Petascale computing

• Multicore and the memory wall

• Performance understanding at scale

• Topology-sensitive interconnects

• Programming models for the masses

Page 24: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Application Status in 2005

• A few Teraflop/s sustained performance

• Scaled to 512 - 1024 processors

Parallel job size at NERSC

Page 25: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

How to Waste Machine $

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Roun

dtrip

Lat

ency

(use

c)

MPI ping-pong

GASNet put+sync

2) Use a programming model in which you can’t utilize bandwidth or “low” latency

Flood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Perc

ent H

W p

eak

MPIGASNet

Page 26: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Integrated Performance Monitoring (IPM)

• brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application

• maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage

• open source, relies on portable software technologies and is scalable to thousands of tasks

• developed by David Skinner at NERSC (see http://www.nersc.gov/projects/ipm/ )

Page 27: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Scaling Portability: Profoundly Interesting

A high level description of the performance of cosmology code MADCAP on four well known architectures.

Source: David Skinner, NERSC

Page 28: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

16 Way for 4 seconds

(About 20 timestamps per second per task) *( 1…4 contextual variables)

Page 29: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

64 way for 12 seconds

Page 30: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Applications on Petascale Systems will need to deal with

(Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each)

Three major issues:

• Scaling to 100,000 processors and multi-core processors

• Topology sensitive interconnection network

• Memory Wall

Page 31: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Even today’s machines are interconnect topology sensitive

Four (16 processor) IBM Power 3 nodes with Colony switch

Page 32: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Application Topology

1024 way MILC

1024 way MADCAP

336 way FVCAM

If the interconnect is topology sensitive,

mapping will become an issue (again)

“Characterizing Ultra-Scale Applications Communincations Requirements”, by John Shalf et al., submitted to SC05

Page 33: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Interconnect Topology BG/L

Page 34: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Applications on Petascale Systems will need to deal with

(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)

Three major issues:

• Scaling to 100,000 processors and multi-core processors

• Topology sensitive interconnection network

• Memory Wall

Page 35: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

The Memory Wall

Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004

Page 36: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Characterizing Memory Access

HPCS Challenge PointsHPCchallenge Benchmarks

HighLowLow

PTRANS

MissionPartner

Applications

Tem

pora

lLoc

alit

y

Spatial Locality

RandomAccess STREAM

HPLHighHigh

FFT

Memory Access Patterns/Locality

Source: David Koester, MITRE

Page 37: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Apex-MAP characterizes architectures through a synthetic benchmark

Temporal Locality

1/Re-use

0 = High

1=Low

1/L 1=Low0 = High

"HPL"

"Global Streams" "Short indirect"

"Small working set"

Spatial Locality

Apex-MAP

Page 38: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.001

0.0100.100

1.0000.1

1.0

10.0

100.0

1000.0

Cycles

L

a

Seaborg Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Page 39: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.001

0.0100.100

1.0000.10

1.00

10.00

100.00

1000.00

Cycles

L

a

Power4 Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Page 40: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.00

0.010.10

1.000.10

1.00

10.00

100.00

Cycles

L

a

X1 Sequential 1.00-2.00

0.00-1.00

-1.00-0.00

Page 41: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Apex-Map Sequential

1 4

16 64

256

1024

4096

1638

4

6553

6

0.00

0.010.10

1.000.10

1.00

10.00

100.00

1000.00

Cycles

L

a

SX6 Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Page 42: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Multicore: Is MPI Really that Bad?

FLOP Rates per Core (single vs. dual core)

0

50

100

150

200

250

Code Name

Sustain

ed G

FLO

P/s p

er c

ore

Singlecore

dualcore

Moving from single to dual core nearly doubles performance Worst case is MILC, which is 40% below this doubling

Experiments by the NERSC SDSA group (Shalf, Carter, Wasserman, et al)

Single Core vs. Dual-core AMD Opteron.

Data collected on jaguar (ORNL XT3 system)

Small pages used except for MADCAP

Page 43: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

How to Waste Machine $

1) Build a memory system in which you can’t utilize bandwidth that is there

% of peak of arithmetic performance

0%

5%

10%

15%

20%

25%

30%

35%

40%

Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)

one core

full socket

full system

% of peak memory bandwidth utlized

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)

one core

full socket

full system

Page 44: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Challenge 2010 - 2018: Developing a New Ecosystem for HPC

From the NRC Report on “The Future of Supercomputing”:

• Platforms, software, institutions, applications, and people who solve supercomputing applications can be thought of collectively as an ecosystem

• Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs.

Pond ecosystem image from

http://www.tpwd.state.tx.us/expltx/ef

t/txwild/pond.htm

Page 45: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Exaflop Programming?

• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,

asynchronous• Rethink algorithms

– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing

latency requirements• Design programming model to express this

parallelism– Develop technology to automate as much as possible

(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities

– Analyze at various levels of detail (eliminating when they are clearly infeasible)

– Early prototypes (expect 90% failures) to validate

Page 46: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Technical Challenges in Programming Models

• Open problems in language runtimes– Virtualization: away from SPMD model for load

balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler

• What we do know how to do:– Build systems with dynamic load balancing (Cilk)

that do not respect locality– Build systems with rigid locality control (MPI, UPC,

etc.) that run at the speed of the slowest component

– Put the programmer in control of resources: message buffers, dynamic load balancing

Page 47: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Challenge 2015 - 2025: Winning the Endgame of Moore’s Law

(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)

Three major issues:

• Scaling to 100,000 processors and multi-core processors

• Topology sensitive interconnection network

• Memory Wall

Page 48: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Summary

• Applications will face (at least) three challenges

– Scaling to 100,000s of processors– Interconnect topology– Memory access

• Three sets of tools (applications benchmarks, performance monitoring, quantitative architecture characterization) have been shown to provide critical insight into applications performance

Page 49: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

1985

Vanishing Electrons (2016)

1990 1995 2000 2010 2015 202010-1

100

101

102

103

104

Electrons per device

2005Year

(Transistors per chip)

(16M)

(4M)

(256M)(1G)

(4G)

(16G)

(64M)

Source: Joel Birnbaum, HP, Lecture at APS Centennial, Atlanta, 1999

Page 50: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

ITRS Device Review 2016

TechnologySpeed

(min-max)Dimension(min-max)

Energy per

gate-opComparison

CMOS 30 ps-1 s 8 nm-5 m 4 aJ

RSFQ 1 ps-50 ps 300 nm- 1m 2 aJ Larger

Molecular 10 ns-1 ms 1 nm- 5 nm 10 zJ Slower

Plastic 100 s-1 ms 100 m-1 mm 4 aJ Larger+Slower

Optical 100 as-1 ps 200 nm-2 m 1 pJ Larger+Hotter

NEMS 100 ns-1 ms 10-100 nm 1 zJ Slower+Larger

Biological 100 fs-100 s 6-50 m .3 yJ Slower+Larger

Quantum 100 as-1 fs 10-100 nm 1 zJ Larger

Data from ITRS ERD Section, quoted from Erik DeBenedictis, Sandia Lab.

Page 51: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

New Programming Models

• Science is currently limited by the difficulty of programming– Codes get written somehow, but barrier to

algorithm experimentation is too high– Multicore shift will make this worse: The

application scientists have other things to do that worry about the next hw revolution

• Want an integrated programming model:– Express many levels and kinds of parallelism for

multiple machine generations– Painful reprogramming should only be done once

Page 52: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Programming Models: What We’ll Get if DOE Does Nothing

• An exascale machines of 1K-core chips is not your thesis advisor’s cluster–Memory per core on chip will not be as large as memory per CPU

on-node today–Clusters have led us down the road of memory-hungry OS,

Runtime, Libraries (MPI) • Absurd but scalable programming models

–MPI with only “expected” messages: coordination nightmare–PGAS on shared memory without synchronization

• Need a new mechanism to:–Tie synch to data transfer (w/out CPU help)–Only pay for it when you need it

• Challenges–Single programming model within and between chips–Single programming model across markets

Page 53: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Build a machine that is so complicated we can’t understand its performance

Sun Ultra 2i/333MHz

How to Waste Machine $s

Num

ber

of r

ows

per

tile

(r)

Number of columns per tile (c)

Page 54: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Technical Challenges in Programming Models

• Open problems in language runtimes– Virtualization: away from SPMD model for load

balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler

• What we do know how to do:– Build systems with dynamic load balancing (Cilk)

that do not respect locality– Build systems with rigid locality control (MPI, UPC,

etc.) that run at the speed of the slowest component

– Put the programmer in control of resources: message buffers, dynamic load balancing

Page 55: CS 267 Applications of Parallel Computers Supercomputing: The Past and Future

Research Approach for Exaflop Program

• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,

asynchronous• Rethink algorithms

– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing

latency requirements• Design programming model to express this

parallelism– Develop technology to automate as much as possible

(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities

– Analyze at various levels of detail (eliminating when they are clearly infeasible)

– Early prototypes (expect 90% failures) to validate