Computational Science jsumethod05 [email protected] January 2005 More on Parallel Computing Spring Semester 2005 Geoffrey Fox Community Grids Laboratory.

Computational Science jsumethod05 [email protected] 130 January 2005

More on Parallel Computing

Spring Semester 2005Geoffrey FoxCommunity

Grids Laboratory Indiana University

505 N MortonSuite 224

Bloomington [email protected]

jsumethod05 [email protected] 230 January 2005

What is Parallel Architecture?• A parallel computer is any old collection of processing elements

that cooperate to solve large problems fast– from a pile of PC’s to a shared memory multiprocessor

• Some broad issues:– Resource Allocation:

• how large a collection? • how powerful are the elements?• how much memory?

– Data access, Communication and Synchronization• how do the elements cooperate and communicate?• how are data transmitted between processors?• what are the abstractions and primitives for cooperation?

– Performance and Scalability• how does it all translate into performance?• how does it scale?


Parallel Computers -- Classic Overview• Parallel computers allow several CPUs to contribute to a

computation simultaneously.• For our purposes, a parallel computer has three types of

parts:– Processors

– Memory modules

– Communication / synchronization network

• Key points:– All processors must be busy for peak speed.

– Local memory is directly connected to each processor.

– Accessing local memory is much faster than other memory.

– Synchronization is expensive, but necessary for correctness.

Colors Used in Following pictures


Distributed Memory Machines• Every processor has a memory

others can’t access.• Advantages:

– Relatively easy to design and build

– Predictable behavior– Can be scalable– Can hide latency of

communication

• Disadvantages:– Hard to program– Program and O/S (and sometimes

data) must be replicated


Communication on Distributed Memory Architecture

• On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (owner-computes rule)

• Information needed on edges for update must be communicated via explicitly generated messages

Messages


Distributed Memory Machines -- Notes• Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster

BlueGene are quite similar.

• Bandwidth and latency of interconnects different

• The network topology is a two-dimensional torus for Paragon, three-dimensional torus for BlueGene, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2

• To program these machines:

• Divide the problem to minimize number of messages while retaining parallelism

• Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables)

• Optimization: Pack messages together to reduce fixed overhead (almost always needed)

• Optimization: Carefully schedule messages (usually done by library)

30 January 200530 January 2005 jsumethod05 [email protected] [email protected] 77

BlueGene/L has Classic Architecture

32768 node BlueGene/L takes #1 TOP500

Position 29 Sept 200470.7 Teraflops


BlueGene/L Fundamentals Low Complexity nodes gives more flops per transistor and per

watt 3D Interconnect supports many scientific simulations as nature

as we see it is 3D


1024 Nodes full systemwith hypercube Interconnect

1987 MPP


Shared-Memory Machines

• All processors access the same memory.

• Advantages:– Retain sequential programming

languages such as Java or Fortran

– Easy to program (correctly)

– Can share code and data among processors

• Disadvantages:– Hard to program (optimally)

– Not scalable due to bandwidth limitations in bus


Communication on SharedMemory Architecture

• On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it

• Nature of parallelism is identical to that for distributed memory machines but communication implicit as “just” access memory


Shared-Memory Machines -- Notes• Interconnection network varies from machine to

machine• These machines share data by direct access.

– Potentially conflicting accesses must be protected by synchronization.

– Simultaneous access to the same memory bank will cause contention, degrading performance.

– Some access patterns will collide in the network (or bus), causing contention.

– Many machines have caches at the processors.– All these features make it profitable to have each

processor concentrate on one area of memory that others access infrequently.


Distributed Shared Memory Machines

• Combining the (dis)advantages of shared and distributed memory

• Lots of hierarchical designs.– Typically, “shared memory

nodes” with 4 to 32 processors

– Each processor has a local cache

– Processors within a node access shared memory

– Nodes can get data from or put data to other nodes’ memories


Summary on Communication etc.• Distributed Shared Memory machines have

communication features of both distributed (messages) and shared (memory access) architectures

• Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax)

• For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, “correct” version is used by other processors accessing this variable and that values living in caches are updated


Seismic Simulation of Los Angeles Basin• This is (sophisticated) wave equation similar to Laplace example

and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor

Computer with4 Processors

Problem represented byGrid Points and divided

Into 4 Domains


Communication Must be Reduced• 4 by 4 regions in each

processor– 16 Green (Compute) and 16

Red (Communicate) Points

• 8 by 8 regions in each processor– 64 Green and “just” 32 Red

Points

• Communication is an edge effect

• Give each processor plenty of memory and increase region in each machine

• Large Problems Parallelize Best


Irregular 2D Simulation -- Flow over an Airfoil• The Laplace grid

points become finite element mesh nodal points arranged as triangles filling space

• All the action (triangles) is near near wing boundary

• Use domain decomposition but no longer equal area as equal triangle count


• Simulation of cosmological cluster (say 10 million stars )

• Lots of work per star as very close together( may need smaller time step)

• Little work per star as force changes slowly and can be well approximated by low order multipole expansion

Heterogeneous Problems


Load Balancing Particle Dynamics• Particle dynamics of this type (irregular with sophisticated force

calculations) always need complicated decompositions

• Equal area decompositions as shown here to load imbalance

Equal Volume DecompositionUniverse Simulation

Galaxy or Star or ...16 Processors

If use simpler algorithms (full O(N2) forces) or FFT, then equal area best


Reduce Communication• Consider a geometric problem with 4

processors• In top decomposition, we divide

domain into 4 blocks with all points in a given block contiguous

• In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains

• edge/area(bottom) = 2* edge/area(top)• So minimizing communication implies

we keep points in a given processor together

Block Decomposition

Cyclic Decomposition


Minimize Load Imbalance• But this has a flip side. Suppose we are

decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by .

• In Top decomposition only the white processor does any work while the other 3 sit idle.– Ffficiency 25% due to Load Imbalance

• In Bottom decomposition all the processors do roughly the same work and so we get good load balance …...

Block Decomposition

Cyclic Decomposition


Parallel Irregular Finite Elements

• Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results– All the work is near crack

Processor


Irregular Decomposition for Crack

• Concentrating processors near crack leads to good workload balance

• equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous

• This is NP complete (exponenially hard) optimization problem but in practice many ways of getting good but not exact good decompositions Processor

Region assigned to 1 processor

WorkLoad

Not Perfect !


Further Decomposition Strategies

• Not all decompositions are quite the same• In defending against missile attacks, you track each missile on a separate node --

geometric again• In playing chess, you decompose chess tree -- an abstract not geometric space

Computer Chess TreeCurrent Position(node in Tree)

First Set Moves

Opponents Counter Moves

California gets its independence


Summary of Parallel Algorithms• A parallel algorithm is a collection of tasks and a partial

ordering between them.• Design goals:

– Match tasks to the available processors (exploit parallelism).

– Minimize ordering (avoid unnecessary synchronization points).

– Recognize ways parallelism can be helped by changing ordering

• Sources of parallelism:– Data parallelism: updating array elements simultaneously.

– Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarse grain size

• fine is “internal” such as I/O and computation; coarse is “external” such as separate modules linked together


Data Parallelism in Algorithms• Data-parallel algorithms exploit the parallelism inherent in many

large data structures.– A problem is an (identical) algorithm applied to multiple points in data “array”

– Usually iterate over such “updates”

• Features of Data Parallelism– Scalable parallelism -- can often get million or more way parallelism

– Hard to express when “geometry” irregular or dynamic

• Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like)


Functional Parallelism in Algorithms• Functional parallelism exploits the parallelism between the parts

of many systems.– Many pieces to work on many independent operations– Example: Coarse grain Aeroelasticity (aircraft design)

• CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel

• Analysis:– Parallelism limited in size -- tens not millions– Synchronization probably good as parallelism natural from problem and

usual way of writing software– Web exploits functional parallelism NOT data parallelism


Pleasingly Parallel Algorithms• Many applications are what is called (essentially) embarrassingly

or more kindly pleasingly parallel

• These are made up of independent concurrent components– Each client independently accesses a Web Server

– Each roll of a Monte Carlo dice (random number) is an independent sample

– Each stock can be priced separately in a financial portfolio

– Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time)

– Different parts of Seismic data can be processed independently

• In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently

• Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism


Parallel Languages• A parallel language provides an executable notation for

implementing a parallel algorithm.

• Design criteria:– How are parallel operations defined?

• static tasks vs. dynamic tasks vs. implicit operations

– How is data shared between tasks?• explicit communication/synchronization vs. shared memory

– How is the language implemented?• low-overhead runtime systems vs. optimizing compilers

• Usually a language reflects a particular style of expressing parallelism.

• Data parallel expresses concept of identical algorithm on different parts of array

• Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts


Data-Parallel Languages• Data-parallel languages provide an abstract, machine-independent

model of parallelism.– Fine-grain parallel operations, such as element-wise operations on arrays– Shared data in large, global arrays with mapping “hints”– Implicit synchronization between operations– Partially explicit communication from operation definitions

• Advantages:– Global operations conceptually simple– Easy to program (particularly for certain scientific applications)

• Disadvantages:– Unproven compilers– As express “problem” can be inflexible if new algorithm which language didn’t

express well

• Examples: HPF• Originated on SIMD machines where parallel operations are in lock-

step but generalized (not so successfully as compilers too hard) to MIMD


Approaches to Parallel Programming • Data Parallel typified by CMFortran and its generalization - High

Performance Fortran which in previous years we discussed in detail but this year we will not discuss; See Source Book for more on HPF

• Typical Data Parallel Fortran Statements are full array statements

– B=A1 + A2

– B=EOSHIFT(A,-1)

– Function operations on arrays representing full data domain

• Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature

• Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"

• Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory


Shared Memory Programming Model• Experts in Java are familiar with this as it is built in Java

Language through thread primitives

• We take “ordinary” languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads– indicate which DO/for loop instances can be executed in parallel

and where there are critical sections with global variables etc.

• openMP is a recent set of compiler directives supporting this model

• This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural


Structure(Architecture) of Applications - I• Applications are metaproblems with a mix of module (aka coarse grain

functional) and data parallelism• Modules are decomposed into parts (data parallelism) and composed

hierarchically into full applications.They can be the – “10,000” separate programs (e.g. structures,CFD ..) used in design of

aircraft– the various filters used in Adobe Photoshop or Matlab image processing

system – the ocean-atmosphere components in integrated climate simulation– The data-base or file system access of a data-intensive application– the objects in a distributed Forces Modeling Event Driven Simulation


Structure(Architecture) of Applications - II• Modules are “natural” message-parallel components of problem

and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components– modules are what HPF needs task parallelism for

– Often modules are naturally distributed whereas parts of data parallel decomposition may need to be kept on tightly coupled MPP

• Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism– Now if one takes a large CFD problem and divides into a few

components, those “coarse grain data-parallel components” will be supported by computational grid technology

• Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems


Multi Server Model for metaproblems• We have multiple supercomputers in the backend -- one doing

CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)– One runs 10,000 separate programs to design a modern aircraft which

must be scheduled and linked …..

• All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research


Database

Matrix Solver

OptimizationService

MPP

MPP

Parallel DBProxy

NEOS ControlOptimization

Origin 2000Proxy

NetSolveLinear Alg.

Server

Multi-Server Scenario

IBM SP2Proxy

Gatew

ay C

on

trol Ag

ent-b

asedC

ho

ice o

fC

om

pu

te E

ng

ine

Mu

ltidiscip

linary

Co

ntro

l (W

ebF

low

)Data Analysis

Server

Computational Science jsumethod05 [email protected] January 2005 More on Parallel Computing Spring Semester 2005 Geoffrey Fox Community Grids Laboratory.

Documents

local memory

memory machinesall processors

memory machinesevery

separate memory space

data access

chunk of decomposed

local pieces explicit

local variablesoptimization