06/23/22 1 An Overview of High An Overview of High Performance Computing Performance Computing and Challenges for the and Challenges for the Future Future Jack Dongarra INNOVATIVE COMP ING LABORATORY University of Tennessee Oak Ridge National Laboratory University of Manchester
62
Embed
8/9/20151 An Overview of High Performance Computing and Challenges for the Future Jack Dongarra INNOVATIVE COMP ING LABORATORY University of Tennessee.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
04/19/23 1
An Overview of High An Overview of High Performance Computing Performance Computing and Challenges for the and Challenges for the
FutureFuture
Jack DongarraINNOVATIVE COMP ING LABORATORY
University of TennesseeOak Ridge National Laboratory
The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores).(13K sec about 3.6 hours; n=1.8M)
1.6 MWatts (1600 homes)43,000 ops/s/person
15
Lower Lower VoltageVoltage
Increase Increase Clock RateClock Rate
& & Transistor Transistor DensityDensity
We have seen increasing number of gates on a chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel Processors > 100 Watts
We will not see the dramatic increases in clock speeds in the future.
However, the number of gates on a chip will continue to increase.
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
CoreCore
CacheCache
CoreCore
CacheCache
CoreCore
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
16
Power Cost of FrequencyPower Cost of Frequency
• Power ∝ Voltage2 x Frequency
(V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
17
Power Cost of FrequencyPower Cost of Frequency
• Power ∝ Voltage2 x Frequency
(V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
What’s Next?What’s Next?
SRAMSRAM
+ 3D Stacked Memory
Many Floating-Point Cores
All Large CoreAll Large CoreMixed LargeMixed LargeandandSmall CoreSmall Core
All Small CoreAll Small Core
Many Small CoresMany Small Cores
Different Classes of Chips Home Games / Graphics Business Scientific
Different Classes of Chips Home Games / Graphics Business Scientific
19
Novel Opportunities in Novel Opportunities in MulticoresMulticores
• Don’t have to contend with uniprocessors
• Not your same old multiprocessor problem How does going from
Multiprocessors to Multicores impact programs?
What changed? Where is the Impact?
•Communication Bandwidth•Communication Latency
20
Communication Communication BandwidthBandwidth
• How much data can be communicated between two cores?
• What changed? Number of Wires Clock rate Multiplexing
• Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec~300 Tera bits/sec
10,000X
21
Communication LatencyCommunication Latency
• How long does it take for a round trip communication?
• What changed? Length of wire Pipeline stages
• Impact on programming model? Ultra-fast synchronization Can run real-time apps
on multiple cores
50X
~200 Cycles ~4 cycles
22
80 Core80 Core• Intel’s 80
Core chip 1 Tflop/s 62 Watts 1.2 TB/s
internal BW
• $200M• 10 Pflop/s; • 40K 8-core 4Ghz IBM Power7 chips; • 1.2 PB memory; • 5PB/s global bandwidth; • interconnect BW of 0.55PB/s; • 18 PB disk at 1.8 TB/s I/O bandwidth.• For use by a few people
NSF Track 1 – NCSA/UIUCNSF Track 1 – NCSA/UIUC
• $65M over 5 years for a 1 Pflop/s system $30M over 5 years for equipment
• 36 cabinets of a Cray XT5 • (AMD 8-core/chip, 12 socket/board, 3 GHz, 4
flops/cycle/core) $35M over 5 years for operations
Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our
software Another disruptive technology
•Similar to what happened with cluster computing and message passing
Rethink and rewrite the applications, algorithms, and software
• Numerical libraries for example will change For example, both LAPACK and
ScaLAPACK will undergo major changes to accommodate this
28
Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our
software Another disruptive technology
•Similar to what happened with cluster computing and message passing
Rethink and rewrite the applications, algorithms, and software
• Numerical libraries for example will change For example, both LAPACK and
ScaLAPACK will undergo major changes to accommodate this
A New Generation of Software:A New Generation of Software:
Algorithms follow hardware evolution in time
LINPACK (80’s)(Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (90’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS operations
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Multicore Architectures Parallel Linear Algebra Software for Multicore Architectures (PLASMA)(PLASMA)
Algorithms follow hardware evolution in time
LINPACK (80’s)(Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (90’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS operations
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
• Power PC at 3.2 GHz DGEMM at 5 Gflop/s Altivec peak at 25.6 Gflop/s
• Achieved 10 Gflop/s SGEMM
• 8 SPUs 204.8 Gflop/s peak! The catch is that this is for 32 bit
floating point; (Single Precision SP) And 64 bit floating point runs at 14.6
Gflop/s total for all 8 SPEs!! • Divide SP peak by 14; factor of 2 because of
DP and 7 because of latency issues
Moving Data Around on the Cell
256 KB
Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY)For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)
Injection bandwidth25.6 GB/s
Injection bandwidth
46
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
0
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GF
lop
/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DP Peak (15 Gflop/s)
DP Ax=b IBM
.30 secs
3.9 secs
8 SGEMM (Embarrassingly Parallel)
47
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
0
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GF
lop
/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DSGESV
DP Peak (15 Gflop/s)
DP Ax=b IBM
.30 secs
.47 secs
3.9 secs
8.3X
8 SGEMM (Embarrassingly Parallel)
33 48
Cholesky on the CellCholesky on the Cell, , Ax=b, A=AAx=b, A=ATT, , xxTTAx > 0Ax > 0
For the SPE’s standard C code and C language SIMD extensions (intrinsics)
Single precision performance
Mixed precision performance using iterative refinement Method achieving 64 bit accuracy
Cholesky - Using 2 Cell Cholesky - Using 2 Cell ChipsChips
49
50
Intriguing PotentialIntriguing Potential• Exploit lower precision as much as possible
Payoff in performance• Faster floating point • Less data to move
• Automatically switch between SP and DP to match the desired accuracy Compute solution in SP and then a correction to
the solution in DP
• Potential for GPU, FPGA, special purpose processors What about 16 bit floating point?
• Use as little you can get away with and improve the accuracy
• Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.
Correction = - A\(b – Ax)
33 51
IBM/Mercury Cell BladeIBM/Mercury Cell Blade
From IBM or Mercury 2 Cell chip
Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW
33 52
Sony Playstation 3 Cluster Sony Playstation 3 Cluster PS3-TPS3-T
From IBM or Mercury 2 Cell chip
Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW
From WAL*MART PS3 1 Cell chip
w/6 SPEs 256 MB/PS3 $600 Download SW Dual boot
SIT CELL
Cell Hardware OverviewCell Hardware Overview
PE
PE
PE
PE
PE
PE
200 GB/s200 GB/s
512 MiB512 MiB
25 GB/s
PowerPC
PE
PE
3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs32 bit peak perf 8*25.6 Gflop/s
204.8 Gflop/s peak64 bit peak perf 8*1.8 Gflop/s
14.6 Gflop/s peak512 MiB memory
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s
SIT CELL
PS3 Hardware OverviewPS3 Hardware Overview
PE
PE
PE
PE
PE
PE
200 GB/s200 GB/sGameOS
Hypervisor
256 MiB256 MiB
Disabled/Broken: Yield issues
25 GB/s
PowerPC
3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s
153.6 Gflop/s peak64 bit peak perf 6*1.8 Gflop/s
10.8 Gflop/s peak1 Gb/s NIC256 MiB memory
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s
33 55
PlayStation 3 LU CodesPlayStation 3 LU Codes
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500
Matrix Size
GF
lop
/s
SP Peak (153.6 Gflop/s)
SP Ax=b IBM
DP Peak (10.9 Gflop/s)
6 SGEMM (Embarrassingly Parallel)
33 56
PlayStation 3 LU CodesPlayStation 3 LU Codes
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500
Matrix Size
GF
lop
/s
SP Peak (153.6 Gflop/s)
SP Ax=b IBM
DSGESV
DP Peak (10.9 Gflop/s)
6 SGEMM (Embarrassingly Parallel)
33 57
Cholesky on the PS3Cholesky on the PS3, , Ax=b, A=AAx=b, A=ATT, x, xTTAx > Ax >
00
33 58
HPC in the Living RoomHPC in the Living Room
33
Matrix Multiple on a 4 Node PlayStation3 Cluster
What's goodVery cheap: ~4$ per Gflop/s (with 32 bit fl pt theoretical peak)Fast local computations between SPEsPerfect overlap between communications and computations is possible (Open-MPI running):
PPE does communication via MPI SPEs do computation via SGEMMs
What's badGigabit network card. 1 Gb/s is too little for such computational power (150 Gflop/s per node)Linux can only run on top of GameOS (hypervisor)
Extremely high network access latencies (120 usec)
Low bandwidth (600 Mb/s)Only 256 MB local memoryOnly 6 SPEs
Gold: Computation: 8 ms
Blue: Communication: 20 ms
33
Users Guide for SC on PS3Users Guide for SC on PS3
• SCOP3: A Rough Guide to Scientific Computing on the PlayStation 3
• See webpage for details
Conclusions Conclusions • For the last decade or more, the research
investment strategy has been overwhelmingly biased in favor of hardware.
• This strategy needs to be rebalanced - barriers to progress are increasingly on the software side.
• Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while
software has a half-life measured in decades.• High Performance Ecosystem out of balance
Hardware, OS, Compilers, Software, Algorithms, Applications• No Moore’s Law for software, algorithms and applications