Seminar on parallel computing Goal: provide environment for exploration of parallel computing Driven by participants Weekly hour for discussion, show &

Seminar on parallel computing• Goal: provide environment for exploration of

parallel computing • Driven by participants• Weekly hour for discussion, show & tell• Focus primarily on distributed memory

computing on linux PC clusters• Target audience:

– Experience with linux computing & Fortran/C– Requires parallel computing for own studies

• 1 credit possible for completion of ‘proportional’ project

Main idea

• Distribute a job over multiple processing units

• Do bigger jobs than is possible on single machines

• Solve bigger problems faster

• Resources: e.g., www-jics.cs.utk.edu

Sequential limits

• Moore’s law

• Clock speed physically limited– Speed of light– Miniaturization; dissipation; quantum effects

• Memory addressing – 32 bit words in PCs: 4 Gbyte RAM max.

Machine architecture: serial

– Single processor– Hierarchical memory:

• Small number of registers on CPU

• Cache (L1/L2)

• RAM

• Disk (swap space)

– Operations require multiple steps• Fetch two floating point numbers from main memory

• Add and store

• Put back into main memory

Vector processing

• Speed up single instructions on vectors– E.g., while adding two floating point numbers

fetch two new ones from main memory– Pushing vectors through the pipeline

• Useful in particular for long vectors• Requires good memory control:

– Bigger cache is better

• Common on most modern CPUs– Implemented in both hardware and software

SIMD

• Same instruction works simultaneously on different data sets

• Extension of vector computing• Example:

DO IN PARALLELfor i=1,n

x(i) = a(i)*b(i) endDONE PARALLEL

MIMD• Multiple instruction, multiple data• Most flexible, encompasses SIMD/serial.• Often best for ‘coarse grained’ parallelism• Message passing• Example: domain decomposition

– Divide computational grid in equal chunks

– Work on each domain with one CPU

– Communicate boundary values when necessary

• 1976 Cray-1 at Los Alamos (vector)

• 1980s Control Data Cyber 205 (vector)

• 1980s Cray XMP– 4 coupled Cray-1s

• 1985 Thinking Machines Connection Machine– SIMD, up to 64k processors

• 1984+ Nec/Fujitsu/Hitachi– Automatic vectorization

Historical machines

Sun and SGI (90s)

• Scaling between desktops and compute servers– Use of both vectorization and large scale

parallelization– RISC processors– Sparc for Sun– MIPS for SGI: PowerChallenge/Origin

Happy developments

• High performance Fortran / Fortran 90

• Definitions for message passing languages– PVM– MPI

• Linux

• Performance increase of commodity CPUs

• Combination leads to affordable cluster computing

Who’s the biggest

• www.top500.org

• Linpack matrix-vector benchmarks

• June 2003:– Earth Simulator, Yokohama, NEC, 36 Tflops– Asci Q, Los Alamos, HP, 14 Tflops– Linux cluster, Livermore, 8 Tflops

http://www.top500.org/

Parallel approaches

• Embarrassingly parallel– “Monte Carlo” searches– SETI @ home

• Analyze lots of small time series

• Parallalize DO-loops in dominantly serial code• Domain decomposition

– Fully parallel– Requires complete rewrite/rethinking

Example: seismic wave propagation

• 3D spherical wave propagation modeled with high order finite element technique (Komatitsch and Tromp, GJI, 2002)

• Massively parallel computation on linux PC clusters

• Approx. 34 Gbyte RAM needed for 10 km average resolution

• www.geo.lsa.umich.edu/~keken/waves

Resolution

• Spectral elements: 10 km average resolution

• 4th order interpolation functions

• Reasonable graphics resolution: 10 km or better

• 12 km: 10243 = 1 GB

• 6 km: 20483 = 8 GB

Simulated EQ (d=15 km) after 17 minutes512x512 256 colorsPositive onlyTruncated maxLog10 scaleParticle velocity

P

PP

PKIKP

SK

PPP

PKPPKPab

512x512 256 colorsPositive onlyTruncated maxLog10 scaleParticle velocity

Some S component

R

PKS

PcSS

PcSS

SS

Resources at UM

• Various linux clusters in Geology– Agassiz (Ehlers) 8 Pentium 4 @ 2 Gbyte each– Panoramix (van Keken) 10 P3 @ 512 Gbyte– Trans (van Keken, Ehlers) 24 P4 @ 2 Gbyte

• SGIs– Origin 2000 (Stixrude, Lithgow, van Keken)

• Center for Advanced Computing @ UM– Athlon clusters (384 nodes @ 1 Gbyte each)– Opteron cluster (to be installed)

• NPACI

Software resources

• GNU and Intel compilers– Fortran/Fortran 90/C/C++

• MPICH www-fp.mcs.anl.gov– Primary implementation of MPI– “Using MPI” 2nd edition, Gropp et al., 1999

• Sun Grid Engine

• Petsc www-fp.mcs.anl.gov– Toolbox for parallel scientific computing

Seminar on parallel computing Goal: provide environment for exploration of parallel computing Driven by participants Weekly hour for discussion, show &

Documents

parallel slide

main memory slide

tflops slide

powerchallengeorigin

necessary slide

software slide

parallel computing goal

distributed memory computing