ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004.
Post on 27-Mar-2015
214 Views
Preview:
Transcript
ECMWF Slide 1
Introduction to Parallel Computing
George Mozdzynski
March 2004
ECMWF Slide 2
Outline
What is parallel computing?
Why do we need it?
Types of computer
Parallel Computing today
Parallel Programming Languages
OpenMP and Message Passing
Terminology
ECMWF Slide 3
What is Parallel Computing?
The simultaneous use of more than one processor or computer to solve a problem
ECMWF Slide 4
Why do we need Parallel Computing?
Serial computing is too slow
Need for large amounts of memory not
accessible by a single processor
ECMWF Slide 5
An IFS operational TL511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs).
How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4
ECMWF Slide 6
Ans. About 8 days
This PC would need about 25 Gbytes of memory.
8 days is too long for a 10 day forecast!
2-3 hours is too long …
ECMWF Slide 7
CPUs Wall time
64 11355
128 5932
192 4230
256 3375
320 2806
384 2338
448 2054
512 1842
Amdahl’s Law:
Wall Time = S + P/NCPUS
IFS Forecast Model (TL511L60)
Serial =574 secs
Parallel=690930 secs
(Calculated using Excel’s LINEST function)
(Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).
ECMWF Slide 8
IFS Forecast Model (TL511L60)
CPUs Wall time SpeedUp Efficiency 1 691504 1 100.0
64 11355 61 95.2
128 5932 117 91.1
192 4230 163 85.1
256 3375 205 80.0
320 2806 246 77.0
384 2338 296 77.0
448 2054 337 75.1
512 1842 375 73.3
ECMWF Slide 9
IFS Forecast Model, TL511L60
0
64
128
192
256
320
384
448
512
576
64 128 192 256 320 384 448 512
IBM Power 4 CPUs
Sp
eed
Up Observed
Estimated
Ideal
ECMWF Slide 10
Extrapolating Performance
0256512768102412801536179220482304
256 512 768 1024 1280 1536 1792 2048
IBM Power 4 CPUs
Sp
eed
Up
Ideal
Estimated
IFS model would be inefficient on large numbers of CPUs. But OK up to 512.
ECMWF Slide 11
Types of Parallel ComputerP=ProcessorM=MemoryS=Switch
Shared Memory Distributed Memory
P
M
P … P
M
P
M
S
…
ECMWF Slide 12
IBM Cluster 1600 ( at ECMWF)
P=ProcessorM=MemoryS=Switch
…
S
P
M
P … P
M
P …
Node Node
ECMWF Slide 13
IBM Cluster 1600’s at ECMWF (hpca + hpcb)
ECMWF Slide 14
ECMWF supercomputers
1979 CRAY 1A Vector
CRAY XMP-2CRAY XMP-4CRAY YMP-8CRAY C90-16
Fujitsu VPP700Fujitsu VPP5000
2002 IBM p690 Scalar + MPI +Shared Memory Parallel
}
} Vector + MPI Parallel
Vector + Shared Memory Parallel
ECMWF Slide 15
ECMWF’s first Supercomputer
CRAY-1A
1979
ECMWF Slide 16
Where have 25 years gone?
ECMWF Slide 17
Types of ProcessorDO J=1,1000
A(J)=B(J) + C
ENDDO
LOAD B(J)FADD CSTORE A(J)INCR JTEST
SCALAR PROCESSOR
VECTOR PROCESSOR
LOADV B->V1FADDV B,C->V2STOREV V2->A
Single instruction processes one element
Single instruction processes many elements
ECMWF Slide 18
Parallel Computing Today
Vector SystemsNEC SX6CRAY X-1Fujitsu VPP5000
Scalar SystemsIBM Cluster 1600FujitsuPRIMEPOWER HPC2500HP Integrity rx2600 Itanium2
Cluster Systems (typically installed by an Integrator)Virgina Tech, Apple G5 / InfinibandNCSA, Dell PowerEdge 1750, P4 Xeon / MyrinetLLNL, MCR Linux Cluster Xeon / QuadricsLANL, Linux Networx AMD Opteron / Myrinet
ECMWF Slide 19
The TOP500 project
started in 1993
Top 500 sites reported
Report produced twice a year
- EUROPE in JUNE
- USA in NOV
Performance based on LINPACK benchmark
http://www.top500.org/
ECMWF Slide 20
Top 500 Supercomputers
ECMWF Slide 21
Where is ECMWF in Top 500
Rmax
Rpeak
Rmax – Gflop/sec using Linpack Benchmark
Rpeak – Peak Hardware Gflop/sec (that will never be reached!)
ECMWF Slide 22
What performance do Meteorological Applications achieve?
Vector computers- About 30 to 50 percent of peak performance
- Relatively more expensive
- Also have front-end scalar nodes
Scalar computers- About 5 to 10 percent of peak performance
- Relatively less expensive
Both Vector and Scalar computers are being used in Met Centres around the world
Is it harder to parallelize than vectorize?- Vectorization is mainly a compiler responsibility
- Parallelization is mainly the user’s responsibility
ECMWF Slide 23
http://www.top500.org/ORSC/2003/
Overview of Recent Supercomputers
Aad J. van der Steen
and
Jack J. Dongarra
ECMWF Slide 24
ECMWF Slide 25
ECMWF Slide 26
Parallel Programming Languages?
• High Performance Fortran (HPF) • directive based extension to Fortran• works on both shared and distributed memory systems• not widely used (more popular in Japan?)• not suited to applications using irregular grids• http://www.crpc.rice.edu/HPFF/home.html
• OpenMP• directive based• support for Fortran 90/95 and C/C++• shared memory programming only• http://www.openmp.org
ECMWF Slide 27
Most Parallel Programmers use…
Fortran 90/95, C/C++ with MPI for communicating between tasks (processes)
- works for applications running on shared and distributed memory systems
Fortran 90/95, C/C++ with OpenMP
- For applications that need performance that is satisfied by a single node (shared memory)
Hybrid combination of MPI/OpenMP
- ECMWF’s IFS uses this approach
ECMWF Slide 28
the myth of automatic parallelization(2 common versions)
Compilers can do anything (but we may have to wait a while)
- Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source
Compilers can’t do anything (now or never)
- Automatic parallelization is useless. It’ll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome
ECMWF Slide 29
Terminology
Cache, Cache line
NUMA
false sharing
Data decomposition
Halo, halo exchange
FLOP
Load imbalance
Synchronization
ECMWF Slide 30
THANKYOU
ECMWF Slide 31
Cache
P
M
C
P=ProcessorC=CacheM=Memory
M
P
C1 C1
C2
P
ECMWF Slide 32
IBM node = 8 CPUs + 3 levels of $
P
C1 C1
C2
P P
C1 C1
C2
P P
C1 C1
C2
PP
C1 C1
C2
P
C3
Memory
ECMWF Slide 33
Cache is …
Small and fast memory
Cache line typically 128 bytes
Cache line has state (copy,exclusive owner)
Coherency protocol
Mapping, sets, ways
Replacement strategy
Write thru’ or not
Important for performance
- Single stride access of always the best!!!
- Try to avoid writes to same cache line from different Cpus
But don’t lose sleep over this
ECMWF Slide 34
IFS blocking in grid space( IBM p690 / TL159L60 )
Optimal use of cache / subroutine call overhead
top related