CS 267 Applications of Parallel Computers Supercomputing: The Past and Future
Post on 11-Jan-2016
20 Views
Preview:
DESCRIPTION
Transcript
CS 267 Applications of Parallel Computers
Supercomputing:The Past and Future
Kathy Yelickwww.cs.berkeley.edu/~yelick/cs267_s07
Outline
• Historical perspective (1985 to 2005) from Horst Simon
• Recent past: what’s new in 2007
• Major challenges and opportunities for the future
Slide source: Horst Simon
Signpost System 1985
Cray-2
• 244 MHz (4.1 nsec)
• 4 processors
• 1.95 Gflop/s peak
• 2 GB memory (256 MW)
• 1.2 Gflop/s LINPACK R_max
• 1.6 m2 floor space
• 0.2 MW power
Slide source: Horst Simon
Signpost System in 2005
IBM BG/L @ LLNL
• 700 MHz (x 2.86)
• 65,536 nodes (x 16,384)
• 180 (360) Tflop/s peak (x 92,307)
• 32 TB memory (x 16,000)
• 135 Tflop/s LINPACK (x 110,000)
• 250 m2 floor space (x 156)
• 1.8 MW power (x 9)
Slide source: Horst Simon
1985 versus 2005
• custom built vector mainframe platforms
• 30 Mflops sustained is good performance
• vector Fortran• proprietary operating system• remote batch only• no visualization• no tools, hand tuning only• dumb terminals• remote access via 9600 baud• single software developer,
develops and codes everything• serial, vectorized algorithms
• commodity massively parallel platforms
• 1 Tflops sustained is good performance
• Fortan/C with MPI, object orientation• Unix, Linux• interactive use• visualization• parallel debugger, development tools• high performance desktop• remote access via 10 Gb/s; grid tools• large group developed software,
code share and reuse• parallel algorithms
Slide source: Horst Simon
The Top 10 Major Accomplishments in Supercomputing in the Past 20 Years
•Horst’s Simon’s list from 2005• Selected by “impact” and “change in perspective”
10) The TOP500 list9) NAS Parallel Benchmarks8) The “grid” 7) Hierarchical algorithms: multigrid and fast multipole6) HPCC initiative and Grand Challenge application
5) Attack of the killer micros
Slide source: Horst Simon
- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from Linpack
Ax=b, dense problem
- Updated twice a year:ISC‘xy in Germany, June xySC‘xy in USA, November xy
- All data available from www.top500.org - Good and bad effects of this list/competition
Size
Rate
TPP performance
#10) TOP500
TOP500 list - Data shownTOP500 list - Data shown
• Manufacturer Manufacturer or vendor• Computer Type indicated by manufacturer or vendor• Installation Site Customer• Location Location and country• Year Year of installation/last major update• Customer Segment Academic,Research,Industry,Vendor,Class.• # Processors Number of processors
• Rmax Maxmimal LINPACK performance achieved
• Rpeak Theoretical peak performance
• Nmax Problemsize for achieving Rmax
• N1/2 Problemsize for achieving half of Rmax
• Nworld Position within the TOP500 ranking
100
1000
10000
100000
1E+06
1E+07
1E+08
1E+09
1E+10
1E+11
1E+12
1993 1996 1999 2002 2005 2008 2011 2014
SUM
#1
#500
Petaflop with ~1M Cores By 2008
1Eflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflops/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
10 MFlop/s
1 PFlop system in 2008
Data from top500.org
6-8 years
Common by 2015?
Slide source: Horst Simon
Petaflop with ~1M Cores in your PC by 2025?
#4 Beowulf Clusters
• Thomas Sterling et al. established vision of low cost, high end computing
• Demonstrated effectiveness of PC clusters for some (not all) classes of applications
• Provided software and conveyed findings to broad community (great PR) through tutorials and book (1999)
• Made parallel computing accessible to large community worldwide; broadened and democratized HPC; increased demand for HPC
• However, effectively stopped HPC architecture innovation for at least a decade; narrower market for commodity systems
Slide source: Horst Simon
#3 Scientific Visualization
• NSF Report, “Visualization in Scientific Computing” established the field in 1987 (edited by B.H. McCormick,
T.A. DeFanti, and M.D. Brown)
• Change in point of view: transformed computer graphics from a technology driven subfield of computer science into a medium for communication
• Added artistic element
• The role of visualization is “to reveal concepts that are otherwise invisible” (Krystof Lenk)
Slide source: Horst Simon
Before Scientific Visualization (1985)
Computer graphics typical of the time:– 2 dimensional– line drawings– black and white– “vectors” used to
display vector field
Images from a CFD report at Boeing (1985).
Slide source: Horst Simon
After scientific visualization (1992)
The impact of scientific visualization seven years later: – 3 dimensional– use of “ribbons” and “tracers” to visualize flow field– color used to characterize updraft and downdraft
Images from “Supercomputing and the Transformation of Science” byKauffmanand Smarr, 1992; visualization by NCSA; simulation by BobWilhelmson, NCSA
Slide source: Horst Simon
#2 Message Passing Interface (MPI)
MPI
Slide source: Horst Simon
Parallel Programming 1988
• At the 1988 “Salishan” conference there was a bake-off of parallel programming languages trying to solve five scientific problems
• The “Salishan Problems” (ed. John Feo, published 1992) investigated four programming languages
– Sisal, Haskel, Unity, LGDF
• Significant research activity at the time
• The early work on parallel languages is all but forgotten today
Slide source: Horst Simon
Parallel Programming 1990
• The availability of real parallel machines moved the discussion from the domain of theoretical CS to the pragmatical application area
• In this presentation (ca. 1990) Jack Dongarra lists six approaches to parallel processing
• Note that message passing libraries are a sub-item on 2)
Slide source: Horst Simon
Parallel Programming 1994
#1 Scaled Speed-Up
The argument against massive parallelism (ca. 1988)
Slide source: Horst Simon
Amdahl’s Law: speed = base_speed / ( (1-f) + f/nprocs ) infinitely parallel Cray YMP base_speed = .1 2.4 nprocs = infinity 8
Then speed(Infinitely Parallel) > speed(Cray) only if f > .994
Challenges for the Future
• Petascale computing
• Multicore and the memory wall
• Performance understanding at scale
• Topology-sensitive interconnects
• Programming models for the masses
Application Status in 2005
• A few Teraflop/s sustained performance
• Scaled to 512 - 1024 processors
Parallel job size at NERSC
How to Waste Machine $
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
6.6
4.5
9.5
18.5
24.2
13.5
17.8
8.3
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Roun
dtrip
Lat
ency
(use
c)
MPI ping-pong
GASNet put+sync
2) Use a programming model in which you can’t utilize bandwidth or “low” latency
Flood Bandwidth for 4KB messages
547
420
190
702
152
252
750
714231
763223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Perc
ent H
W p
eak
MPIGASNet
Integrated Performance Monitoring (IPM)
• brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application
• maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage
• open source, relies on portable software technologies and is scalable to thousands of tasks
• developed by David Skinner at NERSC (see http://www.nersc.gov/projects/ipm/ )
Scaling Portability: Profoundly Interesting
A high level description of the performance of cosmology code MADCAP on four well known architectures.
Source: David Skinner, NERSC
16 Way for 4 seconds
(About 20 timestamps per second per task) *( 1…4 contextual variables)
64 way for 12 seconds
Applications on Petascale Systems will need to deal with
(Assume nominal Petaflop/s system with 100,000 commodity processors of 10 Gflop/s each)
Three major issues:
• Scaling to 100,000 processors and multi-core processors
• Topology sensitive interconnection network
• Memory Wall
Even today’s machines are interconnect topology sensitive
Four (16 processor) IBM Power 3 nodes with Colony switch
Application Topology
1024 way MILC
1024 way MADCAP
336 way FVCAM
If the interconnect is topology sensitive,
mapping will become an issue (again)
“Characterizing Ultra-Scale Applications Communincations Requirements”, by John Shalf et al., submitted to SC05
Interconnect Topology BG/L
Applications on Petascale Systems will need to deal with
(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)
Three major issues:
• Scaling to 100,000 processors and multi-core processors
• Topology sensitive interconnection network
• Memory Wall
The Memory Wall
Source: “Getting up to speed: The Future of Supercomputing”, NRC, 2004
Characterizing Memory Access
HPCS Challenge PointsHPCchallenge Benchmarks
HighLowLow
PTRANS
MissionPartner
Applications
Tem
pora
lLoc
alit
y
Spatial Locality
RandomAccess STREAM
HPLHighHigh
FFT
Memory Access Patterns/Locality
Source: David Koester, MITRE
Apex-MAP characterizes architectures through a synthetic benchmark
Temporal Locality
1/Re-use
0 = High
1=Low
1/L 1=Low0 = High
"HPL"
"Global Streams" "Short indirect"
"Small working set"
Spatial Locality
Apex-MAP
Apex-Map Sequential
1 4
16 64
256
1024
4096
1638
4
6553
6
0.001
0.0100.100
1.0000.1
1.0
10.0
100.0
1000.0
Cycles
L
a
Seaborg Sequential2.00-3.00
1.00-2.00
0.00-1.00
-1.00-0.00
Apex-Map Sequential
1 4
16 64
256
1024
4096
1638
4
6553
6
0.001
0.0100.100
1.0000.10
1.00
10.00
100.00
1000.00
Cycles
L
a
Power4 Sequential2.00-3.00
1.00-2.00
0.00-1.00
-1.00-0.00
Apex-Map Sequential
1 4
16 64
256
1024
4096
1638
4
6553
6
0.00
0.010.10
1.000.10
1.00
10.00
100.00
Cycles
L
a
X1 Sequential 1.00-2.00
0.00-1.00
-1.00-0.00
Apex-Map Sequential
1 4
16 64
256
1024
4096
1638
4
6553
6
0.00
0.010.10
1.000.10
1.00
10.00
100.00
1000.00
Cycles
L
a
SX6 Sequential2.00-3.00
1.00-2.00
0.00-1.00
-1.00-0.00
Multicore: Is MPI Really that Bad?
FLOP Rates per Core (single vs. dual core)
0
50
100
150
200
250
Code Name
Sustain
ed G
FLO
P/s p
er c
ore
Singlecore
dualcore
Moving from single to dual core nearly doubles performance Worst case is MILC, which is 40% below this doubling
Experiments by the NERSC SDSA group (Shalf, Carter, Wasserman, et al)
Single Core vs. Dual-core AMD Opteron.
Data collected on jaguar (ORNL XT3 system)
Small pages used except for MADCAP
How to Waste Machine $
1) Build a memory system in which you can’t utilize bandwidth that is there
% of peak of arithmetic performance
0%
5%
10%
15%
20%
25%
30%
35%
40%
Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)
one core
full socket
full system
% of peak memory bandwidth utlized
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Niagara Clovertow n Opteron Cell(PS3) Cell(Blade)
one core
full socket
full system
Challenge 2010 - 2018: Developing a New Ecosystem for HPC
From the NRC Report on “The Future of Supercomputing”:
• Platforms, software, institutions, applications, and people who solve supercomputing applications can be thought of collectively as an ecosystem
• Research investment in HPC should be informed by the ecosystem point of view - progress must come on a broad front of interrelated technologies, rather than in the form of individual breakthroughs.
Pond ecosystem image from
http://www.tpwd.state.tx.us/expltx/ef
t/txwild/pond.htm
Exaflop Programming?
• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,
asynchronous• Rethink algorithms
– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing
latency requirements• Design programming model to express this
parallelism– Develop technology to automate as much as possible
(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities
– Analyze at various levels of detail (eliminating when they are clearly infeasible)
– Early prototypes (expect 90% failures) to validate
Technical Challenges in Programming Models
• Open problems in language runtimes– Virtualization: away from SPMD model for load
balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler
• What we do know how to do:– Build systems with dynamic load balancing (Cilk)
that do not respect locality– Build systems with rigid locality control (MPI, UPC,
etc.) that run at the speed of the slowest component
– Put the programmer in control of resources: message buffers, dynamic load balancing
Challenge 2015 - 2025: Winning the Endgame of Moore’s Law
(Assume nominal Petaflop/s system with 100,000commodity processors of 10 Gflop/s each)
Three major issues:
• Scaling to 100,000 processors and multi-core processors
• Topology sensitive interconnection network
• Memory Wall
Summary
• Applications will face (at least) three challenges
– Scaling to 100,000s of processors– Interconnect topology– Memory access
• Three sets of tools (applications benchmarks, performance monitoring, quantitative architecture characterization) have been shown to provide critical insight into applications performance
1985
Vanishing Electrons (2016)
1990 1995 2000 2010 2015 202010-1
100
101
102
103
104
Electrons per device
2005Year
(Transistors per chip)
(16M)
(4M)
(256M)(1G)
(4G)
(16G)
(64M)
Source: Joel Birnbaum, HP, Lecture at APS Centennial, Atlanta, 1999
ITRS Device Review 2016
TechnologySpeed
(min-max)Dimension(min-max)
Energy per
gate-opComparison
CMOS 30 ps-1 s 8 nm-5 m 4 aJ
RSFQ 1 ps-50 ps 300 nm- 1m 2 aJ Larger
Molecular 10 ns-1 ms 1 nm- 5 nm 10 zJ Slower
Plastic 100 s-1 ms 100 m-1 mm 4 aJ Larger+Slower
Optical 100 as-1 ps 200 nm-2 m 1 pJ Larger+Hotter
NEMS 100 ns-1 ms 10-100 nm 1 zJ Slower+Larger
Biological 100 fs-100 s 6-50 m .3 yJ Slower+Larger
Quantum 100 as-1 fs 10-100 nm 1 zJ Larger
Data from ITRS ERD Section, quoted from Erik DeBenedictis, Sandia Lab.
New Programming Models
• Science is currently limited by the difficulty of programming– Codes get written somehow, but barrier to
algorithm experimentation is too high– Multicore shift will make this worse: The
application scientists have other things to do that worry about the next hw revolution
• Want an integrated programming model:– Express many levels and kinds of parallelism for
multiple machine generations– Painful reprogramming should only be done once
Programming Models: What We’ll Get if DOE Does Nothing
• An exascale machines of 1K-core chips is not your thesis advisor’s cluster–Memory per core on chip will not be as large as memory per CPU
on-node today–Clusters have led us down the road of memory-hungry OS,
Runtime, Libraries (MPI) • Absurd but scalable programming models
–MPI with only “expected” messages: coordination nightmare–PGAS on shared memory without synchronization
• Need a new mechanism to:–Tie synch to data transfer (w/out CPU help)–Only pay for it when you need it
• Challenges–Single programming model within and between chips–Single programming model across markets
Build a machine that is so complicated we can’t understand its performance
Sun Ultra 2i/333MHz
How to Waste Machine $s
Num
ber
of r
ows
per
tile
(r)
Number of columns per tile (c)
Technical Challenges in Programming Models
• Open problems in language runtimes– Virtualization: away from SPMD model for load
balance, fault tolerance, OS noise, etc.– Resource management: thread scheduler
• What we do know how to do:– Build systems with dynamic load balancing (Cilk)
that do not respect locality– Build systems with rigid locality control (MPI, UPC,
etc.) that run at the speed of the slowest component
– Put the programmer in control of resources: message buffers, dynamic load balancing
Research Approach for Exaflop Program
• Start with two Exaflop apps– One easy: if anything scale, this will– One hard: plenty of parallelism, but it’s irregular, adaptive,
asynchronous• Rethink algorithms
– Scalability at all levels (including algorithmic)– Reducing bandwidth (compress data structures); reducing
latency requirements• Design programming model to express this
parallelism– Develop technology to automate as much as possible
(parallelism, HL constructs, search-based optimization)• Consider spectrum of hardware possibilities
– Analyze at various levels of detail (eliminating when they are clearly infeasible)
– Early prototypes (expect 90% failures) to validate
top related